Speech synthesis device, speech synthesis method, and program
The speech synthesis device improves response time and enables detailed processing of prosodic features by generating intermediate representations and prosodic features in advance, facilitating sequential output of speech waveforms.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- KK TOSHIBA
- Filing Date
- 2022-03-22
- Publication Date
- 2026-07-01
AI Technical Summary
Conventional DNN speech synthesis technologies using encoder-decoder structures face long response times and difficulty in performing detailed processing of prosodic features such as pitch and intonation until the entire input is processed.
A speech synthesis device that includes an analysis unit, a first processing unit with an encoder and a prosodic feature decoder, and a second processing unit with a speech waveform decoder, which generates intermediate representations and prosodic features in advance, allowing sequential output of speech waveforms, thereby improving response time and enabling detailed processing of prosodic features.
The proposed solution reduces response time by allowing output of speech waveforms while previous waveforms are being played and enables detailed processing of prosodic features before waveform generation, enhancing the quality and responsiveness of speech synthesis.
Smart Images

Figure 0007883337000001 
Figure 0007883337000002 
Figure 0007883337000003
Abstract
Description
Technical Field
[0001] Embodiments of the present invention relate to a speech synthesis device, a speech synthesis method, and a program.
Background Art
[0002] In recent years, speech synthesis devices using deep neural networks (DNNs) have been known. Among them, in particular, multiple DNN speech syntheses based on an encoder-decoder structure have been proposed.
[0003] For example, in Patent Document 1, a sequence-to-sequence recurrent neural network that inputs a sequence of natural language characters and outputs a spectrogram of oral speech has been proposed. Also, for example, in Non-Patent Document 1, a DNN speech synthesis technology based on an encoder-decoder structure using a self-attention mechanism that inputs a phoneme notation of natural language and outputs a mel spectrogram or a speech waveform via each of its duration, pitch, and energy has been proposed.
Prior Art Documents
Patent Documents
[0004]
Patent Document 1
Non-Patent Documents
[0005]
Non-Patent Document 1
Non-Patent Document 2
[0006] The present invention aims to provide a speech synthesis device, a speech synthesis method, and a program that improve the response time until waveform generation and enable detailed processing of prosodic features based on the entire input before waveform generation. [Means for solving the problem]
[0007] The speech synthesis apparatus of this embodiment comprises an analysis unit, a first processing unit, and a second processing unit. The analysis unit analyzes input text and generates a sequence of language features including one or more vectors representing language features. The first processing unit comprises an encoder that converts the sequence of language features into an intermediate representation sequence including one or more vectors representing latent variables using a first neural network, and a prosodic feature decoder that generates prosodic features from the intermediate representation sequence using a second neural network. The second processing unit comprises a speech waveform decoder that sequentially generates a speech waveform from the intermediate representation sequence and the prosodic features using a third neural network. [Brief explanation of the drawing]
[0008] [Figure 1] Figure 1 shows an example of the functional configuration of the speech synthesis device according to the first embodiment. [Figure 2] Figure 2 shows an example of a vector representation of context information in the first embodiment. [Figure 3] Figure 3 is a flowchart showing an example of the speech synthesis method of the first embodiment. [Figure 4] Figure 4 shows an example of the functional configuration of the prosodic feature decoder in the first embodiment. [Figure 5] Figure 5 is a flowchart showing an example of a method for generating prosodic features according to the first embodiment. [Figure 6] Figure 6 shows an example of the functional configuration of the speech synthesis device according to the second embodiment. [Figure 7] Figure 7 is a flowchart showing an example of a speech synthesis method according to the second embodiment. [Figure 8] Figure 8 is a diagram illustrating an example of processing in the machining section of the second embodiment. [Figure 9] Figure 9 shows an example of the functional configuration of the speech synthesis device according to the third embodiment. [Figure 10] Figure 10 shows an example of the functional configuration of the continuous audio frame count generation unit of the third embodiment. [Figure 11] Figure 11 shows an example of a pitch waveform in the third embodiment. [Figure 12] FIG. 12 is a flowchart showing an example of the speech synthesis method of the third embodiment. [Figure 13] FIG. 13 is a diagram for explaining an example of the processing of the continuous speech frame number generation unit of the third embodiment. [Figure 14] FIG. 14 is a diagram showing an example of the functional configuration of the speech synthesizer of the fourth embodiment. [Figure 15] FIG. 15 is a flowchart showing an example of the speech synthesis method of the fourth embodiment. [Figure 16] FIG. 16 is a diagram for explaining an example of the processing of the first processing unit of the fourth embodiment. [Figure 17] FIG. 17 is a diagram showing an example of the hardware configuration of the speech synthesizer of the first to fourth embodiments. MODE FOR CARRYING OUT THE INVENTION
[0009] In DNN speech synthesis using an encoder-decoder structure, two types of neural networks, an encoder and a decoder, are used. The encoder converts an input sequence into a latent variable. A latent variable is a value that cannot be directly observed from the outside, and in speech synthesis, a sequence of intermediate representations, which is the conversion result of each input, is used. The decoder converts the obtained latent variable (that is, the intermediate representation sequence) into acoustic feature quantities, speech waveforms, and the like. When the lengths of the intermediate representation sequence and the sequence of acoustic feature quantities output by the decoder are different, a attention mechanism is used as in Patent Document 1, or the number of frames of the acoustic feature quantities corresponding to each intermediate representation is separately obtained as in Non-Patent Document 1 for corresponding.
[0010] However, in the conventional technology, since a decoder based on an attention mechanism is used, it is necessary to process the entire input during synthesis, and there is a problem that the response time becomes long. Also, as a means for improving this, it is conceivable to sequentially output all acoustic feature quantities and speech waveforms, but there has been a problem that detailed processing for phoneme time lengths and feature quantities related to rhythm such as pitch and intonation of sounds (rhythm feature quantities) cannot be performed until the entire input is processed.
[0011] Referring to the accompanying drawings below, embodiments of a speech synthesis apparatus, a speech synthesis method, and a program for solving the above problems will be described in detail.
[0012] (First Embodiment) First, an example of the functional configuration of the speech synthesis apparatus according to the first embodiment will be described.
[0013] [Example of Functional Configuration] FIG. 1 is a diagram showing an example of the functional configuration of a speech synthesis apparatus 10 according to the first embodiment. In the DNN speech synthesis with an encoder-decoder structure, the speech synthesis apparatus 10 outputs a sequence of intermediate representations and prosodic features in advance, and then sequentially outputs a speech waveform. Thereby, the response time is improved compared to the conventional DNN speech synthesis processing with an encoder-decoder structure.
[0014] The speech synthesis apparatus 10 according to the first embodiment includes an analysis unit 1, a first processing unit 2, and a second processing unit 3.
[0015] The analysis unit 1 analyzes the input text and generates a language feature quantity sequence 101. The language feature quantity sequence 101 is information in which the utterance information (language feature quantity) obtained by analyzing the input text is arranged in time series order. As the utterance information (language feature quantity), for example, context information used as a unit for classifying voices such as phonemes, semi-phonemes, and syllables is used.
[0016] FIG. 2 is a diagram showing an example of the vector representation of the context information according to the first embodiment. FIG. 2 is an example of the vector representation of the context information when a phoneme is used as the voice unit, and a sequence of this vector representation is used as the language feature quantity sequence 101.
[0017] The vector representation in FIG. 2 includes a phoneme, phoneme type information, accent type, position within an accent phrase, word-ending information, and part-of-speech information. The phoneme is a one-hot vector indicating which phoneme the phoneme is. The phoneme type information is flag information indicating the type of the phoneme. The type indicates classification by the voiced or voiceless sound of the phoneme and attributes of more detailed phoneme types.
[0018] The accent type is a numerical value indicating the accent type of the phoneme. The accent phrase position is a numerical value indicating the position of the phoneme within an accent phrase. The ending information is a one-hot vector indicating the ending information of the phoneme. The part of speech information is a one-hot vector indicating the part of speech information of the phoneme.
[0019] Note that information other than the vector representation sequence in Figure 2 may be used as the language feature sequence 101. For example, the input text may be converted into a sequence of symbols such as the symbols for Japanese text-to-speech synthesis defined in the JEITA standard IT-4006, each symbol may be converted into a one-hot vector as speech information, and the sequence of these one-hot vectors arranged in chronological order may be used as the language feature sequence 101.
[0020] Returning to Figure 1, the first processing unit 2 comprises an encoder 21 and a prosodic feature decoder 22. The encoder 21 converts the language feature sequence 101 into an intermediate representation sequence 102.
[0021] As described above, the intermediate representation sequence 102 is a latent variable in the speech synthesizer 10 and contains information for obtaining the prosodic features 103 and speech waveform 104 in the subsequent prosodic feature decoder 22 and second processing unit 3, etc. Each vector included in the intermediate representation sequence 102 represents an intermediate representation. The sequence length of the intermediate representation sequence 102 is determined by the sequence length of the language feature sequence 101, but it does not need to match the sequence length of the language feature sequence 101. For example, multiple intermediate representations may correspond to one language feature.
[0022] The prosodic feature decoder 22 generates prosodic features 103 from the intermediate representation sequence 102.
[0023] The prosodic features 103 are features related to prosody, such as speech rate, pitch, and intonation, and include the number of continuous speech frames for each vector included in the intermediate representation sequence 102, and the pitch features in each speech frame. Here, a speech frame is a waveform extraction unit when analyzing a speech waveform to obtain acoustic features, and during synthesis, the speech waveform 104 is synthesized from the acoustic features generated for each speech frame. In the first embodiment, the interval between each speech frame is a fixed time length. The number of continuous speech frames represents the number of speech frames included in the speech section corresponding to each vector included in the intermediate representation sequence 102. Examples of pitch features include the fundamental frequency and the logarithm of the fundamental frequency.
[0024] In addition to the above example, the prosodic feature 103 may also include the gain in each audio frame and the duration of each vector included in the intermediate representation sequence 102.
[0025] The second processing unit 3 includes an audio waveform decoder 31 that sequentially generates an audio waveform 104 from the intermediate representation sequence 102 and the prosodic features 103, and sequentially outputs the audio waveform 104. Here, sequential generation and output processing is a process that outputs the audio waveform 104 for each section by performing waveform generation processing only for each section that is divided into small amounts from the beginning of the intermediate representation sequence 102. For example, sequential generation and output processing is a process that generates and outputs the audio waveform 104 for a predetermined number of samples (a predetermined data length) arbitrarily determined by the user. Sequential generation and output processing allows the calculation processing related to waveform generation to be divided into sections, and it becomes possible to output and play back the audio for each section without waiting for the generation processing of the audio waveform 104 for the entire input text.
[0026] Specifically, the audio waveform decoder 31 comprises a spectral feature generation unit 311 and a waveform generation unit 312. The spectral feature generation unit 311 generates spectral features from the intermediate representation sequence 102 and the prosodic features 103.
[0027] Spectral features are features that represent the spectral characteristics of the speech waveform of each speech frame. The acoustic features necessary for speech synthesis consist of prosodic features and spectral features. Spectral features include information on the spectral envelope, which represents vocal tract characteristics such as the formant structure of the speech, and information on aperiodicity indices, which represent the mixing ratio of noise components excited by breath sounds, etc., and harmonic components excited by vocal cord vibrations. For example, spectral envelope information can include the Mel-cepstrum and Mel-linear spectral pairs. An example of aperiodicity indices can be the band aperiodicity index. In addition, waveform reproducibility may be improved by including features related to the phase spectrum in the spectral features.
[0028] For example, the spectral feature generation unit 311 generates spectral features for a number of audio frames corresponding to a predetermined number of samples, in chronological order, from the intermediate representation sequence 102 and the prosodic features 103.
[0029] The waveform generation unit 312 generates a synthesized waveform (speech waveform 104) by performing speech synthesis processing using spectral features. For example, the waveform generation unit 312 sequentially generates the speech waveform 104 by generating a predetermined number of speech waveforms 104 in chronological order using spectral features. This makes it possible to synthesize the speech waveform 104 in chronological order, for example, by generating a predetermined number of speech waveform samples determined by the user, thereby improving the response time until the speech waveform 104 is generated. The waveform generation unit 312 may also synthesize the speech waveform 104 using prosodic features 103 as needed.
[0030] [Examples of speech synthesis methods] Figure 3 is a flowchart illustrating an example of a speech synthesis method according to the first embodiment. First, the analysis unit 1 analyzes the input text and outputs a linguistic feature sequence 101 containing one or more vectors representing linguistic features (step S1). For example, the analysis unit 1 performs morphological analysis on the input text to obtain linguistic information necessary for speech synthesis, such as reading information and accent information, and outputs a linguistic feature sequence 101 from the obtained reading information and linguistic information. Alternatively, for example, the analysis unit 1 may create a linguistic feature sequence 101 from pre-created, modified reading and accent information for the input text.
[0031] Next, the first processing unit 2 outputs the intermediate representation sequence 102 and the prosodic features 103 by performing the processing in steps S2 and S3. Specifically, first, the encoder 21 converts the language feature sequence 101 into the intermediate representation sequence 102 (step S2). Subsequently, the prosodic feature decoder 22 generates the prosodic features 103 from the intermediate representation sequence 102 (step S3).
[0032] Next, the audio waveform decoder 31 of the second processing unit 3 performs the processing in steps S4 to S6. First, the spectral feature generation unit 311 generates the required amount of spectral features from the intermediate representation sequence 102 and the necessary prosodic features 103, such as the number of continuous audio frames for each vector included in the intermediate representation sequence 102 to be processed (step S4). Subsequently, the waveform generation unit 312 generates the required amount of audio waveform 104 using the spectral features (step S5). By performing processing such as playback and saving on the audio waveform 104 generated by the processing in step S5 asynchronously with respect to the second processing unit 3, the delay until playback starts due to waveform generation can be suppressed.
[0033] If the synthesis of all audio waveforms 104 is not complete (step S6, No), the process returns to step S4. By repeatedly executing steps S4 and S5, the entire audio waveform 104 can be generated. If the synthesis of all audio waveforms 104 is complete (step S6, Yes), the process ends.
[0034] Next, we will describe the details of each part of the speech synthesis device 10 of the first embodiment. [Details of each part] In the speech synthesis device 10 shown in Figure 1, the encoder 21 converts the language feature sequence 101 into an intermediate representation sequence 102 using a first neural network. By using structures such as a recurrent structure, a convolutional structure, and a self-attention mechanism that can process time series as the neural network, information about the preceding and succeeding sequences can be provided to the intermediate representation sequence 102.
[0035] Figure 4 shows an example of the functional configuration of the prosodic feature decoder 22 of the first embodiment. The prosodic feature decoder 22 of the first embodiment includes a continuous speech frame count generation unit 221 and a pitch feature generation unit 222.
[0036] The continuous audio frame count generation unit 221 generates the continuous audio frame count for each vector included in the intermediate representation sequence 102.
[0037] The pitch feature generation unit 222 generates pitch features for each audio frame from the intermediate representation sequence 102 based on the number of continuous audio frames for each vector. In addition, the prosodic feature decoder 22 may generate, for example, the gain for each audio frame.
[0038] The processing in the continuous speech frame count generation unit 221 and the pitch feature generation unit 222 uses a neural network included in the second neural network. As the neural network used in the processing of the pitch feature decoder 222, for example, a recurrent structure, a convolutional structure, and a self-attention mechanism that can process time series are used. This makes it possible to obtain pitch features for each speech frame that take preceding and succeeding information into account, thereby increasing the smoothness of the synthesized speech.
[0039] [Example of a method for generating prosodic features] Figure 5 is a flowchart showing an example of a method for generating prosodic features 103 in the first embodiment. First, the continuous speech frame count generation unit 221 generates the continuous speech frame count for each vector included in the intermediate representation sequence 102 (step S11). Next, the pitch feature generation unit 222 generates pitch features for each speech frame (step S12).
[0040] Furthermore, in the speech synthesis device 10 shown in Figure 1, the spectral feature generation unit 311 of the speech waveform decoder 31 of the second processing unit 3 generates the amount of spectral features necessary for the sequential generation of the speech waveform 104 from the intermediate representation sequence 102 and the prosodic features 103 using a neural network included in the third neural network. As the neural network, for example, a neural network having at least one of a recurrent structure and a convolutional structure is used. Specifically, by using a unidirectional gated recurrent structure (GRU Gated Recurrent Unit) and a causal convolutional structure as the neural network, smooth spectral features can be generated without processing all speech frames. In addition, spectral features that reflect the time-series structure can be obtained, and smooth synthesized sound can be synthesized.
[0041] The waveform generation unit 312 of the second processing unit 3 synthesizes the amount of audio waveform 104 necessary for sequential generation using signal processing or a vocoder provided by a neural network included in the third neural network. When using a neural network, for example, the waveform can be generated by a neural vocoder such as WaveNet proposed in Non-Patent Document 2.
[0042] As described above, the speech synthesis device 10 of the first embodiment comprises an analysis unit 1, a first processing unit 2, and a second processing unit 3. The analysis unit 1 analyzes the input text and generates a language feature sequence 101 that includes one or more vectors representing language features. In the first processing unit 2, the encoder 21 converts the language feature sequence 101 into an intermediate representation sequence 102 that includes one or more vectors representing latent variables using a first neural network. The prosodic feature decoder 22 generates prosodic features 103 from the intermediate representation sequence 102. In the second processing unit 3, the speech waveform decoder 31 sequentially generates a speech waveform 104 from the intermediate representation sequence 102 and the prosodic features 103.
[0043] As a result, the response time until waveform generation can be improved according to the speech synthesizer 10 of the first embodiment. Specifically, in the speech synthesizer 10 of the first embodiment, processing is divided into a first processing unit 2 and a second processing unit 3. The first processing unit 2 outputs an intermediate representation sequence 102 and prosodic features 103 in advance, and the second processing unit 3 outputs speech waveforms 104 sequentially. This makes it possible to output the next speech waveform 104 while the previous speech waveform 104 is being played. Therefore, according to the speech synthesizer 10 of the first embodiment, the response time is reduced to the time it takes to play the initial speech waveform 104, resulting in an improved response time compared to conventional techniques that obtain all acoustic features and speech waveforms 104 at once.
[0044] (Second Embodiment) Next, a second embodiment will be described. In the description of the second embodiment, explanations similar to those of the first embodiment will be omitted, and the differences from the first embodiment will be described.
[0045] [Example of functional configuration] Figure 6 shows an example of the functional configuration of the speech synthesis device 10-2 of the second embodiment. In the speech synthesis device 10-2 of the second embodiment, the first processing unit 2-2 further includes a processing unit 23. This makes it possible to perform detailed processing on the prosodic features 103 of the entire input text before processing by the second processing unit 3 to obtain the speech waveform 104.
[0046] When the processing unit 23 receives a processing instruction for the prosodic feature 103, it reflects that processing instruction in the prosodic feature 103. The processing instruction is received, for example, from user input.
[0047] A processing instruction is an instruction to change the value of each prosodic feature 103. For example, a processing instruction is an instruction to change the value of the pitch feature in each audio frame of a certain interval. Specifically, a processing instruction is an instruction to change the pitch from the 2nd frame to the 10th frame to 300Hz. Another example is a processing instruction to change the number of continuous audio frames for each vector included in the intermediate representation sequence 102. Another example is a processing instruction to change the number of continuous audio frames for the 17th intermediate representation included in the intermediate representation sequence 102 to 30.
[0048] In addition to the above example, the processing instruction may also be an instruction to project the prosodic features 103 of the spoken audio of the input text. Specifically, the processing unit 23 uses the spoken audio of the input text that has been prepared in advance. The processing unit 23 then receives an instruction to project the prosodic features 103 generated from the input text by the analysis unit 1, encoder 21, and prosodic feature decoder 22 so as to match the prosodic features of the spoken audio. In this case, the desired processing result can be obtained without directly manipulating the values of the prosodic features 103 generated from the input text.
[0049] The second processing unit 3 receives the prosodic feature 103 generated by the prosodic feature decoder 22, or the prosodic feature 103 processed by the processing unit 23.
[0050] [Examples of speech synthesis methods] Figure 7 is a flowchart showing an example of a speech synthesis method according to the second embodiment. First, the analysis unit 1 analyzes the input text and outputs a language feature sequence 101 that includes one or more vectors representing language features (step S21). Next, the first processing unit 2-2 obtains an intermediate representation sequence 102 and prosodic features 103 from the language feature sequence 101 (step S22).
[0051] Next, the processing unit 23 determines whether or not to process the prosodic features 103 (step S23). The determination of whether or not to process the prosodic features 103 is made, for example, based on whether or not there are any unprocessed processing instructions for the prosodic features 103. Processing instructions are made, for example, by displaying values such as the pitch features and the duration of each phoneme generated based on the prosodic features 103 on a display device and editing the values by mouse operation or the like.
[0052] If the prosodic feature 103 is not processed (step S23, No), the process proceeds to step S25.
[0053] If the prosodic feature 103 is to be processed (step S23, Yes), the processing unit 23 reflects the processing instructions to the prosodic feature 103 (step S24). If it is necessary to regenerate the prosodic feature 103, such as when changing the number of continuous audio frames for each vector included in the intermediate representation sequence 102, the prosodic feature decoder 22 regenerates the prosodic feature 103. Processing of the prosodic feature 103 is repeated as long as the system accepts processing instructions from the user.
[0054] Next, the second processing unit 3 (audio waveform decoder 31) sequentially outputs the audio waveform 104 (step S25). The details of the process in step S25 are the same as in the first embodiment, so the explanation is omitted.
[0055] Next, the waveform generation unit 312 determines whether or not to reprocess the prosodic features 103 in order to resynthesize the audio waveform 104 (step S26). If the prosodic features 103 are to be reprocessed (step S26, Yes), the process returns to step S24. For example, if the desired audio waveform 104 is not obtained, the system accepts further processing instructions from the user and returns to the process in step S24.
[0056] If the prosodic feature 103 is not processed again (step S26, No), the process ends.
[0057] [Details of processing] The details of the processing when the processing is prosodic projection are described below. When the processing unit 23 receives a projection instruction for the prosodic features 103 of the utterance of the input text, the following processing is performed in step S24. First, the processing unit 23 analyzes the utterance and obtains the prosodic features 103. Of the prosodic features 103, the duration of each phoneme is obtained by performing phoneme alignment according to the content of the utterance and performing phoneme boundary extraction. In addition, the pitch features in each speech frame are obtained by performing acoustic feature extraction of the utterance. Next, the processing unit 23 changes the number of continuous speech frames of each vector included in the intermediate representation sequence 102 based on the phoneme duration obtained from the utterance. Then, the processing unit 23 changes the pitch features in each speech frame to match the pitch features extracted from the utterance. Other features included in the prosodic features 103 are similarly changed to match the features obtained by analyzing the utterance.
[0058] Figure 8 is a diagram illustrating an example of processing by the processing unit 23 of the second embodiment. The example in Figure 8 is an example of processing when the processing unit 23 receives a projection instruction for the pitch features of the utterance of the input text. Pitch features 105 represent the pitch features generated by the prosodic feature decoder 22. Pitch features 106 represent the pitch features of the utterance of the input text (for example, the user's utterance). Pitch features 107 represent the pitch features generated by the processing unit 23. For example, the processing unit 23 generates pitch features 107 by processing pitch features 106 so that its maximum and minimum values (or mean and variance) match the maximum and minimum values (or mean and variance) of pitch features 105.
[0059] As explained above, in the speech synthesis device 10-2 of the second embodiment, the first processing unit 2-2 outputs prosodic features 103, and the processing unit 23 reflects the user's processing instructions. That is, since the prosodic features 103 for the entire input text are output before the generation of the speech waveform 104, it becomes possible to perform detailed processing on the entire input text before waveform generation. In conventional technology, when all acoustic features and speech waveforms 104 are output sequentially as a means of improving response time, it was difficult to perform detailed processing on the prosodic features 103 for the entire input text.
[0060] In the speech synthesizer 10-2 of the second embodiment, detailed processing of the pitch of each speech frame of the entire input text becomes possible before the processing by the second processing unit 3 that obtains the speech waveform 104. As a result, the second processing unit 3 can synthesize a speech waveform 104 that reflects the detailed processing instructions for the prosodic features 103 given by the user.
[0061] (Third embodiment) Next, the third embodiment will be described. In the description of the third embodiment, explanations similar to those of the first embodiment will be omitted, and the differences from the first embodiment will be described.
[0062] [Example of functional configuration] Figure 9 shows an example of the functional configuration of the speech synthesizer 10-3 of the third embodiment. In the speech synthesizer 10-3 of the third embodiment, speech frames are determined based on pitch. Specifically, the interval between speech frames is changed to the pitch period. This makes it possible to apply precise speech analysis by pitch synchronization analysis in the third embodiment.
[0063] The speech synthesis device 10-3 of the third embodiment comprises an analysis unit 1, a first processing unit 2-3, and a second processing unit 3. The first processing unit 2-3 comprises an encoder 21 and a prosodic feature decoder 22. The prosodic feature decoder 22 comprises a continuous speech frame count generation unit 221 and a pitch feature generation unit 222.
[0064] Figure 10 shows an example of the functional configuration of the continuous audio frame count generation unit 221 of the third embodiment. The continuous audio frame count generation unit 221 of the third embodiment includes a coarse pitch generation unit 2211, a duration generation unit 2212, and a calculation unit 2213.
[0065] The coarse pitch generation unit 2211 generates the average pitch feature of each vector included in the intermediate representation sequence 102. The duration generation unit 2212 generates the duration of each vector included in the intermediate representation sequence 102. The average pitch feature and duration represent the average of the pitch features in each audio frame included in the audio interval corresponding to each vector, and the duration of the audio interval.
[0066] The calculation unit 2213 calculates the number of pitch waveforms, which indicates the number of pitch waveforms, from the average pitch feature and duration of each vector included in the intermediate representation sequence 102.
[0067] A pitch waveform is the unit of waveform extraction from an audio frame in pitch synchronization analysis.
[0068] Figure 11 shows an example of a pitch waveform in the third embodiment. The pitch waveform is obtained as follows. First, the waveform generation unit 312 creates pitch mark information 108 representing the center time of each period of the periodic speech waveform 104 from the pitch features in each speech frame included in the prosodic features 103.
[0069] Next, the waveform generation unit 312 sets the position of the pitch mark information 108 as the center position and synthesizes the audio waveform 104 based on the pitch period. By synthesizing with the position of the appropriately assigned pitch mark information 108 as the center time, it becomes possible to synthesize appropriately even in response to local changes in the audio waveform 104, thereby reducing sound quality degradation.
[0070] However, even within intervals of the same duration, intervals with higher pitches have more pitch waveforms, while intervals with lower pitches have fewer pitch waveforms, resulting in different numbers of audio frames in each interval. Therefore, the calculation unit 2213 does not directly calculate the number of continuous audio frames (number of pitch waveforms) for each vector included in the intermediate representation sequence 102, but rather calculates it from the duration of that vector and the average pitch feature.
[0071] [Examples of speech synthesis methods] Figure 12 is a flowchart showing an example of a speech synthesis method according to the third embodiment. First, the analysis unit 1 analyzes the input text and outputs a language feature sequence 101 containing one or more vectors representing language features (step S31). Next, the encoder 21 converts the language feature sequence 101 into an intermediate representation sequence 102 (step S32).
[0072] Next, the continuous speech frame count generation unit 221 generates the continuous speech frame count for each vector included in the intermediate representation sequence 102 (step S33). Next, the pitch feature generation unit 222 generates pitch features for each speech frame (step S34).
[0073] Next, the second processing unit 3 (speech waveform decoder 31) sequentially outputs a speech waveform 104 from the intermediate representation sequence 102 and the prosodic features 103 (step S35).
[0074] [Details of the process for generating the number of continuous audio frames] Figure 13 is a diagram illustrating an example of processing by the continuous audio frame count generation unit 221 of the third embodiment. First, the coarse pitch generation unit 2211 generates the average pitch feature of each vector included in the intermediate representation sequence 102 (step S41). Next, the duration generation unit 2212 generates the duration of each vector included in the intermediate representation sequence 102 (step S42). Note that the execution order of steps S41 and S42 may be reversed.
[0075] Next, the calculation unit 2213 calculates the number of pitch waveforms for each vector from the average pitch feature and duration of each vector included in the intermediate representation sequence 102 (step S43). The number of pitch waveforms obtained in step S43 is output as the number of continuous audio frames.
[0076] [Details of each part] The coarse pitch generation unit 2211 and the duration generation unit 2212 each use the neural network included in the second neural network to generate the average pitch feature and duration of each vector included in the intermediate representation sequence 102 from the intermediate representation sequence 102. Examples of neural network structures include multilayer perceptrons, convolutional structures, and recurrent structures. In particular, using convolutional and recurrent structures allows time-series information to be reflected in the average pitch feature and duration.
[0077] The calculation unit 2213 calculates the number of pitch waveforms for each vector from the average pitch feature and duration of each vector included in the intermediate representation sequence 102. For example, if the average pitch feature of a vector (intermediate representation) in the intermediate representation sequence 102 is the average of the fundamental frequencies f (Hz) and the duration is d (seconds), then the number of pitch waveforms n for this vector (intermediate representation) is calculated as n = f × d.
[0078] The pitch feature generation unit 222 may use the average pitch feature of each vector included in the intermediate representation sequence 102, in addition to the intermediate representation sequence 102, to determine the pitch in each audio frame. By doing so, the difference between the average pitch feature generated by the coarse pitch generation unit 2211 and the actually generated pitch is reduced, and it is expected that synthesized speech (audio waveform 104) with a duration close to that generated by the duration generation unit 2212 can be obtained.
[0079] As explained above, in the speech synthesis device 10-3 of the third embodiment, the processing is divided into a first processing unit 2-3 that generates prosodic features 103 and a second processing unit 3 that generates spectral features and speech waveforms 104, etc. Furthermore, the speech frame is determined based on pitch. As a result, the speech synthesis device 10-3 of the third embodiment makes it possible to utilize precise speech analysis by pitch synchronization analysis, improving the quality of the synthesized speech (speech waveform 104).
[0080] (Fourth Embodiment) Next, the fourth embodiment will be described. In the description of the fourth embodiment, explanations similar to those of the first embodiment will be omitted, and the differences from the first embodiment will be described.
[0081] [Example of functional configuration] Figure 14 shows an example of the functional configuration of the speech synthesis device 10-4 of the fourth embodiment. The speech synthesis device 10-4 of the fourth embodiment comprises an analysis unit 1, a first processing unit 2-4, a second processing unit 3, a speaker identification information conversion unit 4, and a style identification information conversion unit 5. The first processing unit 2-4 comprises an encoder 21, a prosodic feature decoder 22, and an assignment unit 24.
[0082] In the speech synthesis device 10-4 of the fourth embodiment, the speaker identification information conversion unit 4, the style identification information conversion unit 5, and the assignment unit 24 reflect the speaker identification information and style identification information in the synthesized speech (speech waveform 104). As a result, the speech synthesis device 10-4 of the fourth embodiment can obtain synthesized speech with multiple speakers and styles.
[0083] Speaker identification information identifies the input speaker. For example, speaker identification information is indicated by "Speaker No. 2 (speaker identified by number)" and "Speaker of this audio (speaker indicated by the spoken audio)."
[0084] Style identification information identifies the speaking style (e.g., emotion). For example, style identification information is indicated by "Style 1 (a style identified by a number)" and "Style of this voice (a style presented by the spoken voice)."
[0085] The speaker identification information conversion unit 4 converts the speaker identification information into a speaker vector that represents the speaker's characteristic information. The speaker vector is a vector for use by the speech synthesizer 10-4. For example, if the speaker identification information includes the specification of a speaker that can be synthesized by the speech synthesizer 10-4, the speaker vector becomes a vector of the embedded representation corresponding to that speaker. Also, if the speaker identification information is the speech of a speaker that has been prepared separately, the speaker vector becomes a vector obtained from acoustic features of the speech, such as an i-vector, and a statistical model used for speaker identification, as proposed in Non-Patent Document 3, for example.
[0086] The style identification information conversion unit 5 converts style identification information, which identifies the speaking style, into a style vector that represents the characteristic information of the style. The style vector, like the speaker vector, is a vector for use by the speech synthesizer 10-4 to utilize the style identification information. For example, if the style identification information includes a specification of a style that can be synthesized by the speech synthesizer 10-4, the style vector becomes a vector of the embedding representation corresponding to that style. Also, if the style identification information is for a speech voice in a separately prepared style, the style vector becomes a vector obtained by converting the acoustic features of the speech voice using a neural network or the like, for example, Global Style Tokens (GST) proposed in Non-Patent Document 4.
[0087] The assignment unit 24 assigns feature information, such as speaker vectors and style vectors, to the intermediate representation sequence 102 obtained by the encoder 21.
[0088] [Examples of speech synthesis methods] Figure 15 is a flowchart showing an example of a speech synthesis method according to the fourth embodiment. First, the analysis unit 1 analyzes the input text and outputs a language feature sequence 101 that includes one or more vectors representing language features (step S51). Next, the speaker identification information conversion unit 4 converts the speaker identification information into a speaker vector using the method described above (step S52). Next, the style identification information conversion unit 5 converts the style identification information into a style vector using the method described above (step S53). Note that the execution order of steps S52 and S53 may be reversed.
[0089] Next, the assignment unit 24 assigns information such as speaker vectors and style vectors to the intermediate representation sequence 102, and the prosodic feature decoder 22 generates prosodic features 103 from the intermediate representation sequence 102 (step S54). Then, the second processing unit 3 (speech waveform decoder 31) sequentially outputs the speech waveform 104 from the intermediate representation sequence 102 and the prosodic features 103 (step S55).
[0090] [Details of the processing in the first processing unit] Figure 16 is a diagram illustrating an example of processing by the first processing unit 2-4 of the fourth embodiment. First, the encoder 21 converts the language feature sequence 101 into an intermediate representation sequence 102 (step S61).
[0091] Next, the assignment unit 24 assigns information such as speaker vectors and style vectors to the intermediate representation sequence 102 (step S62).
[0092] There are several possible methods for assigning information in step S62. For example, information may be assigned to the intermediate representation sequence 102 by adding the speaker vector and the style vector to each vector (intermediate representation) included in the intermediate representation sequence 102.
[0093] Alternatively, information may be added to the intermediate representation sequence 102 by combining a speaker vector and a style vector with each vector (intermediate representation) included in the intermediate representation sequence 102. Specifically, information may be added to the intermediate representation sequence 102 by combining the components of an n-dimensional vector (intermediate representation) with the components of an m1-dimensional speaker vector and the components of an m2-dimensional style vector to form an n+m1+m2-dimensional vector.
[0094] Alternatively, for example, the intermediate representation sequence 102, which is a combination of the speaker vector and the style vector, may be further transformed into a more appropriate vector representation by applying a linear transformation.
[0095] Next, the prosodic feature decoder 22 generates prosodic features 103 from the intermediate representation sequence 102 obtained in step S62 (step S63).
[0096] Since the intermediate representation sequence 102 obtained in step S62 and the prosodic feature quantity 103 generated in step S63 reflect speaker and style information, the speech waveform 104 obtained by the subsequent second processing unit 3 has the characteristics and style features of that speaker.
[0097] Furthermore, when the waveform generation unit 312 of the audio waveform decoder 31 of the second processing unit 3 generates a waveform using the neural network included in the third neural network, that neural network may utilize the speaker vector and the style vector. By doing so, it is expected that the accuracy of reproduction of the speaker and style of the synthesized speech (audio waveform 104) will be improved.
[0098] As described above, the speech synthesis device 10-4 of the fourth embodiment receives speaker identification information and style identification information and reflects them in the speech waveform 104, thereby obtaining synthesized speech (speech waveform 104) of multiple speakers and styles.
[0099] (modified version) The analysis unit 1 of the speech synthesis device 10 (10-2, 10-3, 10-4) of the first to fourth embodiments may divide the input text into multiple subtexts and output a linguistic feature sequence 101 for each subtext. For example, if the input text consists of multiple sentences, it may be divided into subtexts based on the sentences, and a linguistic feature sequence 101 may be obtained for each subtext. If multiple linguistic feature sequences 101 are output, subsequent processing is performed on each linguistic feature sequence 101. For example, each linguistic feature sequence 101 may be processed sequentially in chronological order. Alternatively, multiple linguistic feature sequences 101 may be processed in parallel.
[0100] Furthermore, the neural networks used in the speech synthesis devices 10 (10-2, 10-3, 10-4) of the first to fourth embodiments are all trained using statistical methods. In this process, by training several neural networks simultaneously, the overall optimal parameters can be obtained.
[0101] For example, in the speech synthesis device 10 of the first embodiment, the neural network used in the first processing unit 2 and the neural network used in the spectral feature generation unit 311 may be optimized simultaneously. This allows the speech synthesis device 10 to utilize the optimal neural network for generating both prosodic features 103 and spectral features.
[0102] Finally, examples of the hardware configuration of the speech synthesis devices 10 (10-2, 10-3, 10-4) of the first to fourth embodiments will be described. The speech synthesis devices 10 (10-2, 10-3, 10-4) of the first to fourth embodiments can be realized, for example, by using any computer device as the basic hardware.
[0103] [Example hardware configuration] Figure 17 shows examples of the hardware configuration of the speech synthesis devices 10 (10-2, 10-3, 10-4) according to the first to fourth embodiments. The speech synthesis devices 10 (10-2, 10-3, 10-4) according to the first to fourth embodiments include a processor 201, a main memory 202, an auxiliary storage device 203, a display device 204, an input device 205, and a communication device 206. The processor 201, main memory 202, auxiliary storage device 203, display device 204, input device 205, and communication device 206 are connected via a bus 210.
[0104] Furthermore, the speech synthesis device 10 (10-2, 10-3, 10-4) may not be equipped with some of the above-described components. For example, if the speech synthesis device 10 (10-2, 10-3, 10-4) can utilize the input and display functions of an external device, the speech synthesis device 10 (10-2, 10-3, 10-4) may not be equipped with the display device 204 and the input device 205.
[0105] The processor 201 executes the program read from the auxiliary storage device 203 into the main memory device 202. The main memory device 202 is memory such as ROM and RAM. The auxiliary storage device 203 is such as an HDD (Hard Disk Drive) and a memory card.
[0106] The display device 204 is, for example, a liquid crystal display. The input device 205 is an interface for operating the information processing device 100. The display device 204 and the input device 205 may be implemented by a touch panel or the like that has both display and input functions. The communication device 206 is an interface for communicating with other devices.
[0107] For example, a program executed by the speech synthesizer 10 (10-2, 10-3, 10-4) is provided as a computer program product, recorded in an installable or executable file format on a computer-readable storage medium such as a memory card, hard disk, CD-RW, CD-ROM, CD-R, DVD-RAM, and DVD-R.
[0108] Alternatively, for example, the program executed by the speech synthesis device 10 (10-2, 10-3, 10-4) may be stored on a computer connected to a network such as the Internet, and provided by being downloaded via the network.
[0109] Alternatively, for example, the system may be configured to provide the program executed by the speech synthesis device 10 (10-2, 10-3, 10-4) via a network such as the Internet without requiring downloads. Specifically, the system may be configured to execute speech synthesis processing using a so-called ASP (Application Service Provider) type service, where the server computer does not transfer the program, but instead implements the processing function only by issuing execution instructions and obtaining the results.
[0110] Alternatively, for example, the program for the speech synthesizer 10 (10-2, 10-3, 10-4) may be pre-installed and provided in ROM or the like.
[0111] The program executed by the speech synthesizer 10 (10-2, 10-3, 10-4) has a modular configuration that includes functions that can also be implemented by the program, among the functional configurations described above. In actual hardware terms, each of these functions is loaded onto the main memory 202 by the processor 201 reading and executing a program from the storage medium. In other words, each of these functional blocks is generated on the main memory 202.
[0112] Furthermore, some or all of the above-mentioned functions may be implemented using hardware such as ICs instead of software.
[0113] Alternatively, each function may be implemented using multiple processors 201, in which case each processor 201 may implement one of the functions, or two or more of the functions.
[0114] While several embodiments of the present invention have been described, these embodiments are presented as examples only and are not intended to limit the scope of the invention. These novel embodiments can be carried out in a variety of other forms, and various omissions, substitutions, and modifications can be made without departing from the spirit of the invention. These embodiments and their variations are included in the scope and spirit of the invention, as well as in the claims of the invention and its equivalents. [Explanation of symbols]
[0115] 1 Analysis section 2. First Processing Unit 3. Second Processing Unit 4. Speaker Identification Information Conversion Unit 5. Style-Specific Information Conversion Unit 10. Speech synthesis device 21 Encoders 22 Prosodic Feature Decoder 23 Processing Department 24 Granting section 31 Audio Waveform Decoder 311 Spectral Feature Generation Unit 312 Waveform generator 201 Processor 202 Main storage 203 Auxiliary storage device 204 Display device 205 Input device 206 Communication equipment 210 Bus 221 Continuous audio frame count generation unit 222 Pitch Feature Generation Unit 2211 Coarse pitch generation unit 2212 Duration generation unit 2213 Calculation Department
Claims
1. An analysis unit that analyzes the input text and generates a language feature sequence containing one or more vectors representing language features, It comprises a first processing unit and a second processing unit, The first processing unit is, An encoder that converts the aforementioned language feature sequence into an intermediate representation sequence containing one or more vectors representing latent variables using a first neural network, The system comprises a prosodic feature decoder that generates prosodic features from the aforementioned intermediate representation sequence using a second neural network, The second processing unit includes a speech waveform decoder that sequentially generates a speech waveform from the intermediate representation sequence and the prosodic features using a third neural network, The audio waveform decoder of the second processing unit is The system includes a spectral feature generation unit that generates spectral feature quantities for a number of audio frames corresponding to a predetermined number of samples, in chronological order, from the intermediate representation sequence and the prosodic features, using a neural network having a recurrent structure included in the third neural network. The aforementioned prosodic feature decoder is A continuation audio frame count generation unit that generates the number of continuation audio frames for each vector included in the intermediate representation sequence, The system includes a pitch feature generation unit that generates pitch feature quantities in each audio frame using a neural network included in the second neural network, based on the number of continuous audio frames. The audio frame is determined based on pitch, The continuous audio frame count generation unit, A coarse pitch generation unit that generates the average pitch feature quantity of each vector included in the intermediate representation sequence, A duration generation unit that generates the duration of each vector included in the intermediate representation sequence, The system includes a calculation unit that calculates the number of pitch waveforms from the average pitch feature and the duration. A speech synthesizer.
2. The audio waveform decoder of the second processing unit is A waveform generation unit generates the audio waveform sequentially by generating the audio waveform in a time-series order for a predetermined number of samples from the spectral features. The speech synthesis device according to claim 1, further comprising the following:
3. The first processing unit is, The system further comprises a processing unit for processing the aforementioned prosodic features, The second processing unit receives prosodic features generated by the prosodic feature decoder, or prosodic features processed by the processing unit. The speech synthesis device according to claim 1 or 2.
4. The processing unit receives processing instructions from the user for the prosodic features and processes the prosodic features based on the user's processing instructions. The user's processing instruction is an instruction to change the value of the prosodic feature, or a projection instruction to project the prosodic feature onto the prosodic feature obtained by speech analysis of the utterance of the input text. The speech synthesis device according to claim 3, including the following:
5. The system further includes a speaker identification information conversion unit that converts speaker identification information, which identifies a speaker, into a speaker vector representing the characteristic information of the speaker. The first processing unit is, An assignment unit that assigns the speaker vector feature information to the intermediate representation sequence. A speech synthesis device according to any one of claims 1 to 4, further comprising:
6. The system further includes a style identification information conversion unit that converts style identification information, which identifies a speaking style, into a style vector representing the characteristic information of the style. The first processing unit is, A unit that assigns the feature information of the style vector to the intermediate representation sequence. The speech synthesis device according to any one of claims 1 to 5, further comprising:
7. The analysis unit performs the steps of analyzing the input text and generating a language feature sequence that includes one or more vectors representing language features, The first processing unit performs the steps of converting the language feature sequence into an intermediate representation sequence containing one or more vectors representing latent variables using a first neural network, The first processing unit performs the step of generating prosodic features from the intermediate representation sequence using a second neural network, The second processing unit includes the step of sequentially generating an audio waveform from the intermediate representation sequence and the prosodic features using a third neural network, The step of sequentially generating the aforementioned audio waveform is: The process includes the step of generating spectral features for a number of audio frames corresponding to a predetermined number of samples, in chronological order, from the intermediate representation sequence and the prosodic features, using a neural network having a recurrent structure included in the third neural network, The step of generating the aforementioned prosodic features is: The steps include generating the number of continuous audio frames for each vector included in the intermediate representation sequence, The step includes generating pitch features in each audio frame using a neural network included in the second neural network, based on the number of continuous audio frames. The audio frame is determined based on pitch, The step of generating the number of continuous audio frames is: The steps include generating the average pitch feature of each vector included in the intermediate representation sequence, The steps include generating the duration of each vector included in the intermediate representation sequence, The process includes the step of calculating the number of pitch waveforms from the average pitch feature and the duration. Speech synthesis method.
8. Computers, An analysis unit that analyzes the input text and generates a language feature sequence containing one or more vectors representing language features, It functions as a first processing unit and a second processing unit. The first processing unit is, An encoder that converts the aforementioned language feature sequence into an intermediate representation sequence containing one or more vectors representing latent variables using a first neural network, It has the function of a prosodic feature decoder that generates prosodic features from the aforementioned intermediate representation sequence using a second neural network, The second processing unit has the function of an audio waveform decoder that sequentially generates an audio waveform from the intermediate representation sequence and the prosodic features using a third neural network. The audio waveform decoder of the second processing unit is The third neural network includes a neural network having a recurrent structure, which generates spectral features for a number of audio frames corresponding to a predetermined number of samples in chronological order from the intermediate representation sequence and the prosodic features, and has the function of a spectral feature generation unit. The aforementioned prosodic feature decoder is A continuation audio frame count generation unit that generates the number of continuation audio frames for each vector included in the intermediate representation sequence, It has the function of a pitch feature generation unit that generates pitch feature quantities in each audio frame by a neural network included in the second neural network, based on the number of continuous audio frames. The audio frame is determined based on pitch, The continuous audio frame count generation unit, A coarse pitch generation unit that generates the average pitch feature quantity of each vector included in the intermediate representation sequence, A duration generation unit that generates the duration of each vector included in the intermediate representation sequence, A calculation unit has the function of calculating the number of pitch waveforms from the average pitch feature and the duration. program.