Voice operation device, method and recording medium for recording voice operation program
A technology of speech synthesis and sound quality, which is applied in speech synthesis, speech analysis, infrastructure engineering, etc., and can solve problems such as the difference between sound quality and speech content
Inactive Publication Date: 2005-08-24
YAMAHA CORP
1 Cites 1 Cited by
AI-Extracted Technical Summary
Problems solved by technology
[0004] However, in the above-mentioned phoneme database, only one type of phoneme data uttered by a specific speaker (for example, a male speaker) is registered.
Therefore, for example, in the case of outputting text information ("ちよう..." or "...みたい...
Abstract
Provided is a speech synthesizer etc., capable of synthesizing speeches having various kinds of voice quality even in an environment where large restrictions are imposed on hardware resources. The speech synthesizer 100 which has one kind of phoneme data is provided with a voice quality change part 250 and a voice quality database 260. The voice quality change part 250 performs retrieval from the voice quality database 260 based upon a voice quality data number supplied from a text analysis part 220 as a retrieval key to obtain voice quality parameters. The voice quality change part 250 changes voice quality of each tone that the phoneme data obtained by a phoneme data acquisition part 230 represent based upon the acquired voice quality parameters.
Application Domain
Underwater structuresCultivating equipments +1
Technology Topic
Change voiceSpeech sound +5
Image
Examples
- Experimental program(1)
Example Embodiment
[0035] Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings.
[0036] A. This Embodiment
[0037] FIG. 1 is a diagram showing a functional configuration of a speech synthesis apparatus 100 according to the present embodiment. In this embodiment, it is assumed that the speech synthesis apparatus 100 is installed in a mobile phone, a mobile terminal such as a PHS (Personal Handyphone System), a PDA (Personal DigitalAssistance), etc., which have relatively limited hardware resources, but the present invention is not limited to this. Can be used in various electronic devices.
[0038] The input unit 210 supplies the text analysis unit 220 with text information input via an operation unit or the like (not shown). FIG. 2 is a diagram illustrating text information.
[0039] The text content information is information for indicating the content of text to be output as synthesized speech (for example, "こんにちわ"). In addition, although the text content information represented only by hiragana is shown in FIG. 2, the text content information is not limited to hiragana, and may be information represented by various characters such as kanji, romanji, and katakana, and various symbols. .
[0040] The voice quality data numbers (voice quality specifying information) are unique numbers (K1 to Kn in FIG. 2 ) for identifying a plurality of voice quality parameters (phoneme data processing information), which will be described later, respectively. In the present embodiment, by appropriately selecting and utilizing the tone quality parameter, it is possible to obtain synthesized speech of various tone qualities from one type of phoneme data uttered by a specific speaker (in this embodiment, it is assumed to be a "male speaker"). Detailed description).
[0041] Pitch information (pitch designation information) is information for assigning a pitch to a synthesized speech (in other words, designating the pitch of the synthesized speech), and it is specified by "C (Dou)" to "B (叏). )" and other scale information structure (see Figure 2).
[0042]The text analysis unit 220 analyzes the text information supplied from the input unit 210 , and supplies the analysis results to the phoneme data acquisition unit 230 , the voice quality change unit 250 , and the speech signal generation unit 270 , respectively. Specifically, after being provided with the text information shown in FIG. 2 , the text analysis unit 220 first decomposes the text content information such as "こんにちわ" into "こ", "ん", "に", "ち", "こんにちわ" わ" is the phoneme of a short syllable unit. The so-called mora refers to a syllable that represents a pronunciation unit and basically consists of one consonant and one vowel.
[0043] The text analysis unit (acquisition unit) 220 generates phoneme information (phoneme specification information) for specifying each phoneme of the synthesized speech after decomposing the text content information into phonemes in mora units in this manner, and sequentially supplies the phonemes to the phonemes Data acquisition unit 230 . Next, the text analysis unit 220 obtains the sound quality data number (for example, K3) and the pitch information (for example, C(do)) from the text information, and then supplies the obtained sound quality data number to the sound quality change section 250, and changes the obtained sound quality data number. The acquired pitch information is supplied to the speech signal generation unit 270 .
[0044] The phoneme data acquisition unit (first extraction unit) 230 searches the phoneme database 240 as a key with the phoneme information supplied from the text analysis unit 220, thereby acquiring phoneme data corresponding to the phoneme indicated by the phoneme information. FIG. 3 is a diagram illustrating registration contents of the phoneme database 240 . As shown in FIG. 3 , in the phoneme database (first storage unit) 240, in addition to registering each phoneme (“あ”, “い”, … “ん”, etc.) representing the mora unit of one male speaker In addition to the series of phoneme data 1 to m, the number of the series of phoneme data (hereinafter referred to as the number of registered phoneme data) and the like are also registered.
[0045] FIG. 4 is a diagram illustrating a configuration of phoneme data representing a certain phoneme (eg, "こ", etc.), and FIG. 5 is a diagram for explaining each frame information included in the phoneme data. In addition, FIG. 5A shows the relationship between the speech waveform vw and each frame FR when the male speaker reads a certain phoneme (for example, “こ”, etc.), and FIG. 5B , FIG. 5C and FIG. 5 D shows the formant analysis results for the first frame FR1, the second frame FR2, and the nth frame FRn, respectively.
[0046] As shown in FIG. 4 , the phoneme data is composed of the first frame information to the nth frame information. Each frame information includes the first formant information to the k-th formant information obtained by performing formant analysis on the corresponding frame Fr (see FIG. 5 ), and indicating whether the voice of each frame FR is a voiced sound or a voiced sound. Voiceless/unvoiced discrimination flags for voiceless sounds (eg, "1"=voiced, "0"=unvoiced).
[0047] The first frame information to the kth frame information constituting each frame information are composed of a paired formant frequency F and formant amplitude A indicating a corresponding formant (see B to D in FIG. 5 ). For example, the first formant information to the k-th formant information constituting the first frame information are respectively composed of (F11, A11), (F12, A12), ... (F1k, A1k) which are pairs of formant frequencies and formant amplitudes. The first formant information to the k-th formant information constituting the n-th frame information are composed of (Fn1, An1), (Fn2, An2), ... (Fnk, Ank). The paired formant frequency and formant amplitude constitute (see D of FIG. 5 ).
[0048] The phoneme data acquisition unit 230 acquires corresponding phoneme data based on each phoneme information (each phoneme information representing "こ", "ん", "に", "ち", "わ", etc.) supplied from the text analysis unit 220 Then, these phoneme data are supplied to the voice quality changing unit 250 .
[0049] The tone quality changing unit 250 changes the tone quality of the phoneme represented by each phoneme data acquired by the phoneme data acquiring unit 230 . Specifically, the voice quality changing unit (second extraction unit) 250 first uses the voice quality data number provided by the text analysis unit 220 as a search key to search the voice quality database (second storage unit) 260 to obtain corresponding voice quality parameters. Then, the sound quality changing unit 250 changes the sound quality of each of the above-mentioned phonemes based on the acquired sound quality parameters.
[0050] FIG. 6 is a diagram illustrating the registered contents of the sound quality database 260 .
[0051] As shown in FIG. 6 , the sound quality database (second storage unit) 260 stores, as necessary information for changing the sound quality of each of the above-mentioned phonemes, various sound quality parameters 1 to L indicating the processing contents of the phoneme data, and The registration number information of the number of the sound quality parameters.
[0052] FIG. 7 is a diagram showing a configuration example of a sound quality parameter.
[0053] As shown in FIG. 7 , the voice quality parameter (phoneme data processing information) includes a voice quality data number for specifying the parameter, a gender change flag indicating whether or not to change the gender of the synthesized speech, and a change content of the first to k-th formants. 1st to kth formant change information. However, when the gender change flag is set to "1", for example, the voice quality changing unit 250 performs processing for changing the gender of the synthesized speech (hereinafter referred to as gender change processing), and when the gender change flag is set to "1" When "0" is set, the above-mentioned gender change processing (described in detail later) is not performed. In the present embodiment, one type of phoneme data is assumed to be uttered by a male speaker, so when the gender change flag is set to "1", the feature of the synthesized speech is changed from a male feature to a female feature. On the other hand, when the gender change flag is set to "0", the characteristics of the synthesized speech remain male characteristics without changing.
[0054] On the other hand, each formant change information includes basic waveform selection information for selecting each formant basic waveform (sine wave or the like) to be described later, formant frequency change information indicating the content of each formant frequency change, and formant frequency change information indicating each formant frequency. The formant width change information of the change content of the formant width.
[0055] Each formant frequency change information and each formant amplitude change information respectively include information indicating the conversion amount of the formant frequency, the oscillation speed, and the oscillation amplitude, and the information indicating the conversion amount of the formant amplitude, the oscillation speed, and the oscillation amplitude, respectively. . The amount of conversion between the formant frequency and the formant amplitude, the oscillation speed, and the oscillation amplitude will be described in detail later.
[0056] FIG. 8 is a flowchart showing the sound quality changing process executed by the sound quality changing unit 250 .
[0057] After receiving the sound quality data number from the text analysis unit 220, the sound quality changing unit (generating unit) 250 searches the sound quality database 260 with the sound quality data number as a search key to obtain the corresponding sound quality parameter (step S1). Then, the voice quality changing unit 250 refers to the gender change flag included in the acquired voice quality parameter, and judges whether the gender of the synthesized speech should be changed (ie, whether the gender change process should be executed) (step S2 ). For example, when the gender change flag is set to "0" and the voice quality changing unit 250 determines that the gender change should not be performed, step S3 is skipped and the process proceeds to step S4. When the changing unit 250 judges that the gender change should be performed, the process proceeds to step S3 to execute the gender change process.
[0058] FIG. 9 is a diagram illustrating an example of the mapping function mf for gender change processing stored in a storage unit (not shown), and FIGS. 10 and 11 are diagrams showing a case where male and female read the same phoneme (for example, "あ", etc.) Plot of analysis results. In addition, the horizontal axis of the mapping function mf shown in FIG. 9 represents the input frequency (the formant frequency input to the sound quality changing unit 250 ), and the vertical axis represents the output frequency (the formant frequency after the frequency output from the sound quality changing unit 250 is changed). , fmax represents the maximum value of the formant frequency that can be entered. In addition, in the analysis graphs g1 and g2 shown in FIGS. 10 and 11 , the horizontal axis represents the frequency, and the vertical axis represents the amplitude.
[0059] By comparing the analysis graphs g1 and g2 shown in FIG. 10 and FIG. 11 , it can be seen that the first formant frequency fm1 to the fourth formant frequency fm4 of the male phoneme are higher than the first formant frequency ff1 to the fourth formant frequency ff4 of the female phoneme. Low. Therefore, in the present embodiment, as shown in FIG. 9 , the phoneme with male characteristics is changed by the mapping function mf (see the solid line) located above the straight line n1 (input frequency=output frequency, see the dotted line). It is a phoneme with female characteristics.
[0060] Specifically, the voice quality changing unit 250 uses the mapping function mf shown in FIG. 9 to convert each formant frequency of the input phoneme data to a direction with a higher frequency. As a result, each formant frequency of the input male phoneme is changed to a formant frequency having female characteristics. In addition, when the formant frequency of the female phoneme is input, contrary to the above-mentioned case, the mapping function mf' located below the straight line n1 can be used (refer to the portion indicated by the dotted line in FIG. 9 ).
[0061] The voice quality changing unit 250 performs the above-described gender changing process and proceeds to step S4, and then converts each formant frequency according to the conversion amount of each formant frequency indicated by each formant change information. Further, the sound quality changing unit 250 oscillates each of the converted formant frequencies, and executes frequency oscillation processing (step S5 ).
[0062] FIG. 12 is a diagram illustrating an oscillation table TA that is stored in a storage unit (not shown) and used in the frequency oscillation process, and FIG. 13 is an example of a time interval between oscillation values read from the oscillation table TA and time diagram of the relationship. In the present embodiment, for convenience of description, it is assumed that the same oscillation table TA is used to oscillate the respective formant frequencies described above, but different oscillation tables such as oscillation values may be used for each formant frequency.
[0063] The oscillation table TA is a table in which oscillation values are registered in chronological order. The sound quality changing unit 250 controls the reading speed (or skips (ie, does not read) the number of oscillation values) of the oscillation values registered in the oscillation table TA according to the oscillation speed of the formant frequency indicated by each formant change information. On the other hand, frequency oscillation processing is performed, that is, each read oscillation value is multiplied by the oscillation amplitude of the formant frequency indicated by each formant change information. Thereby, a waveform in which the formant frequency fm shown in FIG. 14 is oscillated at the oscillation speed sp and the oscillation amplitude lv can be obtained. In the present embodiment, in order to reduce the amount of calculation of the oscillation amplitude of the formant frequency, the above-described oscillation table TA is used as an example. However, instead of using the oscillation table TA, a predetermined function may be used to obtain the formant frequency. oscillation amplitude.
[0064] After performing the frequency oscillation process, the sound quality changing unit 250 proceeds to step S6, and converts each formant amplitude according to the conversion amount of each formant amplitude indicated by each formant change information. Furthermore, the sound quality changing unit 250 oscillates the amplitudes of the converted formants, executes amplitude oscillation processing (step S7 ), and then ends the processing. In addition, since the oscillation table used in the amplitude oscillation process and the operation in the case of oscillating the amplitude of each formant by using the oscillation table can be described in substantially the same way as the case of oscillating the frequency of each formant described above, the description thereof will be omitted here. . In addition, the oscillation of the formant amplitude may be oscillated using the same oscillation table as the oscillation of the formant frequency, but may be oscillated using an oscillation table different from the oscillation of the formant frequency.
[0065] The tone quality changing unit (generating unit) 250 changes the tone quality of each phoneme (that is, processing the phoneme data) based on the acquired tone quality parameter (phoneme data processing information), and then converts the basic waveform selection information for each formant, Each formant frequency and each formant amplitude are supplied to the speech signal generating unit 270 .
[0066]After receiving the basic waveform selection information supplied from the voice quality changing unit 250 , the audio signal generation unit 270 acquires the waveform data indicated by the basic waveform selection information from the waveform database 280 . The basic waveform indicated by the basic waveform selection information may be different for each formant. For example, the basic waveform of a low-frequency formant may be a sine wave, and the basic waveform of a high-frequency formant that expresses individuality may be other than a sine wave. waveforms (such as rectangular waves or sawtooth waves, etc.) Of course, instead of using a plurality of basic waveforms, only a single basic waveform (for example, a sine wave) may be used.
[0067] After selecting each waveform data in this manner, the speech signal generating unit (generating unit) 270 generates a formant waveform for each formant using the selected waveform data, each formant frequency, and each formant amplitude. Then, the speech signal generating unit (generating means) 270 adds the respective formant waveforms to generate a synthesized speech signal. Then, the speech signal generation unit 270 performs processing (hereinafter referred to as pitch assignment processing) to the generated synthesized speech signal to assign the pitch, which is the pitch information (pitch designation information) supplied from the text analysis unit 220 . the pitch indicated.
[0068] FIG. 15 is a diagram for explaining the pitch adding process. In FIG. 15 , in order to facilitate the understanding of the description, the case where the pitch is given to the synthesized speech signal of the sine wave is shown as an example.
[0069] The speech signal generation unit 270 calculates the period of the time envelope tp shown in FIG. 15 based on the pitch information supplied from the text analysis unit 220 . The pitch of the synthesized speech depends on the period of the time envelope tp. The longer the period of the time envelope tp, the lower the pitch, and the shorter the period of the time envelope, the higher the pitch. After obtaining the period of the time envelope tp in this way, the speech signal generation unit 270 repeatedly multiplies the time envelope tp and the synthesized speech signal with the obtained cycle of the time envelope tp, and obtains This results in a synthesized speech signal given a prescribed pitch.
[0070] FIG. 16 is a diagram illustrating an example of a formant waveform of a specific formant after the voice quality changing process and the pitch imparting process are performed. As shown in FIG. 16 , processing related to sound quality change (eg, oscillation processing of formant frequency and formant amplitude, etc.) can be controlled in frame periods (frame units). The speech signal generating unit (generating unit) 270 obtains the above-mentioned synthesized speech signal to which the predetermined pitch is assigned, and then outputs the synthesized speech signal to the outside as synthesized speech. Thereby, the user can confirm the content of the text ("こんにちわ" etc.) input to the speech synthesis device 100 by the synthesized speech of the desired sound quality.
[0071] As described above, according to the speech synthesis apparatus of the present embodiment, since the voice quality changing unit can perform various voice quality changing processes in units of formants, even if only one type of phoneme data is stored (that is, only the specific speaker's phoneme data), and can also perform speech synthesis of various voice qualities.
[0072] B. Other
[0073] In the above-described present embodiment, the text information input to the speech synthesis device 100 includes pitch information as an example (see FIG. 2 ), but the text information may not include pitch information. Assuming this situation, when the substitute pitch information (refer to the part in parentheses in FIG. 3 ) is registered in the phoneme database 240 in advance and the text information does not include the pitch information, the pitch indicated by the substitute pitch information can be used. (eg C (do), etc.) as the pitch of the synthesized speech. Further, in addition to the substitute pitch information, the number of formant information for each frame shown in FIG. 4 may be pre-registered in the phoneme database 240 (formant number information, see parentheses in FIG. 3 ).
[0074] In addition, in order to realize the various functions of the speech synthesis apparatus 100 described above by executing a program stored in a memory such as a ROM by the CPU (or DSP), the program may be recorded on a recording medium such as a CD-ROM and distributed, or It may be issued via a communication network such as the Internet.
[0075] In the above description, the voice change processing is performed based on the sound quality data number obtained from the text information, but it is also possible to automatically extract keywords from the input text information, and then use the extracted keywords to The sound quality suitable for the text information is automatically determined by referring to a database having keywords for each sound quality, which is preset in the electronic device.
PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.