Text-based speech generation
By inserting additional phonemes into the initial phoneme sequence of text generation and using an expert model to determine the duration, combined with the speech characteristics of the target speaker, the problem of stiff and unfluent speech generation in the prior art is solved, and a more realistic and natural speech generation effect is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- MICROSOFT TECHNOLOGY LICENSING LLC
- Filing Date
- 2021-06-28
- Publication Date
- 2026-06-19
AI Technical Summary
Existing text-based speech generation technologies produce speech that differs from real human speech, sounding stiff and not fluent, especially when simulating human spoken language.
By generating an initial phoneme sequence and inserting additional phonemes related to natural spoken language characteristics, the duration of the phonemes is determined using an expert model, and natural spoken language-type speech is generated by combining the speech characteristics of the target speaker.
It improves the similarity between the generated speech and real, natural human speech, making the generated speech more realistic and vivid, with more varied rhythms and natural pauses and repetitions.
Smart Images

Figure CN115602145B_ABST
Abstract
Description
Background Technology
[0001] Text-based speech generation, also known as Text-to-Speech (TTS), is used to convert text into natural speech output. TTS is a type of speech synthesis application and plays an important role in applications such as assisted reading and voice prompts. However, the speech generated using current TTS methods still differs from real human speech. For example, the generated speech is more stiff and less fluent than real human speech. Therefore, methods capable of generating more realistic speech from text are needed. Summary of the Invention
[0002] According to the implementation of this disclosure, a scheme for text-based speech generation is proposed. In this scheme, an initial phoneme sequence corresponding to the text is generated, the initial phoneme sequence including feature representations of multiple phonemes. A first phoneme sequence is generated by inserting feature representations of additional phonemes related to the characteristics of natural spoken language into the initial phoneme sequence. The duration of the phonemes is determined using an expert model corresponding to the phonemes in the multiple phonemes and the additional phonemes, and a second phoneme sequence is generated based on the first phoneme sequence. Based on the second phoneme sequence, the natural spoken language type of speech corresponding to the text is determined. In this way, the scheme can generate more realistic natural spoken language type speech with varied prosody based on the additional phonemes of the natural spoken language type and multiple expert models.
[0003] The summary section is provided to present the chosen concepts in a simplified form, which will be further described in the detailed description below. The summary section is not intended to identify key or principal features of the claimed subject matter, nor is it intended to limit the scope of the claimed subject matter. Attached Figure Description
[0004] Figure 1 A block diagram of a computing device capable of implementing multiple implementations of the present disclosure is shown;
[0005] Figure 2 A system architecture diagram for text-based speech generation according to an implementation of this disclosure is shown;
[0006] Figure 3 A schematic diagram is shown illustrating the process of generating a second phoneme sequence using a duration determination module according to an implementation of this disclosure;
[0007] Figure 4 A flowchart illustrating a text-based speech generation method according to an implementation of this disclosure is shown; and
[0008] Figure 5 A flowchart is shown of a method for training a text-based speech generation model according to an implementation of this disclosure;
[0009] In these accompanying figures, the same or similar reference symbols are used to indicate the same or similar elements. Detailed Implementation
[0010] This disclosure will now be discussed with reference to several example implementations. It should be understood that these implementations are discussed only to enable those skilled in the art to better understand and thus implement this disclosure, and not to imply any limitation on the scope of this disclosure.
[0011] As used herein, the term "comprising" and its variations are to be interpreted as open-ended terms meaning "including but not limited to". The term "based on" is to be interpreted as "at least partially based on". The terms "an implementation" and "an implementation" are to be interpreted as "at least one implementation". The term "another implementation" is to be interpreted as "at least one other implementation". The terms "first", "second", etc., may refer to different or the same objects. Other explicit and implicit definitions may also be included below.
[0012] As used herein, a "neural network" is capable of processing input and providing a corresponding output. It typically includes an input layer, an output layer, and one or more hidden layers between the input and output layers. Neural networks used in deep learning applications often include many hidden layers, thus extending the network's depth. The layers of a neural network are connected sequentially, so that the output of the previous layer is provided as the input to the next layer, where the input layer receives the input to the neural network, and the output of the output layer serves as the final output of the neural network. Each layer of a neural network includes one or more nodes (also called processing nodes or neurons), each node processing the input from the layer above. In this paper, the terms "neural network," "network," and "neural network model" are used interchangeably.
[0013] As mentioned above, current TTS solutions still produce speech that differs from real human speech. For example, the generated speech is more stiff and less fluent than real human speech. Conventional TTS solutions have proposed some methods to simulate the pitch and volume variations in real human speech, thereby generating higher-quality reading-style speech. However, the generated speech cannot accurately simulate the pauses, repetitions, and more varied prosody of spoken human speech. Therefore, there is still a need for solutions capable of generating natural spoken-style speech (also known as spontaneous speech) from text.
[0014] According to the implementation of this disclosure, a scheme for text-based speech generation is proposed, in which an initial phoneme sequence corresponding to the text is generated, the initial phoneme sequence including feature representations of multiple phonemes. A first phoneme sequence is generated by inserting feature representations of additional phonemes related to the characteristics of natural spoken language into the initial phoneme sequence. The duration of the phonemes is determined by utilizing an expert model corresponding to the phonemes among the multiple phonemes and the additional phonemes, and a second phoneme sequence is generated based on the first phoneme sequence. Based on the second phoneme sequence, the speech of the natural spoken language type corresponding to the text is determined. Various example implementations of this scheme are further described in detail below with reference to the accompanying drawings.
[0015] Figure 1 A block diagram of a computing device 100 capable of implementing multiple implementations of the present disclosure is shown. It should be understood that... Figure 1 The computing device 100 shown is merely exemplary and should not constitute any limitation on the functionality and scope of the implementation described in this disclosure. Figure 1 As shown, computing device 100 includes computing device 100 in the form of general computing device. Components of computing device 100 may include, but are not limited to, one or more processors or processing units 110, memory 120, storage device 130, one or more communication units 140, one or more input devices 150, and one or more output devices 160.
[0016] In some implementations, computing device 100 can be implemented as various user terminals or service terminals with computing capabilities. Service terminals can be servers, large computing devices, etc., provided by various service providers. User terminals can be any type of mobile terminal, fixed terminal, or portable terminal, including mobile phones, sites, units, devices, multimedia computers, multimedia tablets, internet nodes, communicators, desktop computers, laptop computers, notebook computers, netbook computers, tablet computers, personal communication system (PCS) devices, personal navigation devices, personal digital assistants (PDAs), audio / video players, digital cameras / camcorders, positioning devices, television receivers, radio receivers, e-book devices, gaming devices, or any combination thereof, including accessories and peripherals of these devices, or any combination thereof. It is also foreseeable that computing device 100 can support any type of user-facing interface (such as "wearable" circuitry).
[0017] Processing unit 110 can be a physical or virtual processor and is capable of performing various processes according to programs stored in memory 120. In a multiprocessor system, multiple processing units execute computer-executable instructions in parallel to improve the parallel processing capability of computing device 100. Processing unit 110 may also be referred to as a central processing unit (CPU), microprocessor, controller, or microcontroller.
[0018] Computing device 100 typically includes multiple computer storage media. Such media can be any available media accessible to computing device 100, including but not limited to volatile and non-volatile media, removable and non-removable media. Memory 120 can be volatile memory (e.g., registers, cache, random access memory (RAM)), non-volatile memory (e.g., read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory), or some combination thereof. Memory 120 may include speech generation modules 122, which are configured to perform the functions of the various implementations described herein. Speech generation modules 122 can be accessed and operated by processing unit 110 to implement the corresponding functions.
[0019] Storage device 130 may be a removable or non-removable medium and may include machine-readable media capable of storing information and / or data and accessible within computing device 100. Computing device 100 may further include additional removable / non-removable, volatile / non-volatile storage media. Although not explicitly stated... Figure 1 As shown, disk drives for reading from or writing to removable, non-volatile disks and optical disc drives for reading from or writing to removable, non-volatile optical discs can be provided. In these cases, each drive can be connected to a bus (not shown) via one or more data media interfaces.
[0020] The communication unit 140 enables communication with other computing devices via a communication medium. Additionally, the functionality of the components of the computing device 100 can be implemented as a single computing cluster or multiple computing machines capable of communicating via communication connections. Therefore, the computing device 100 can operate in a networked environment using logical connections to one or more other servers, personal computers (PCs), or another general network node.
[0021] Input device 150 can be one or more various input devices, such as a mouse, keyboard, trackball, voice input device, etc. Output device 160 can be one or more output devices, such as a monitor, speaker, printer, etc. Computing device 100 can also communicate as needed with one or more external devices (not shown) via communication unit 140. These external devices include storage devices, display devices, etc., and can communicate with one or more devices that enable user interaction with computing device 100, or with any device that enables computing device 100 to communicate with one or more other computing devices (e.g., network card, modem, etc.). Such communication can be performed via input / output (I / O) interface (not shown).
[0022] In some implementations, in addition to being integrated into a single device, some or all of the components of computing device 100 may be configured in the form of a cloud computing architecture. In a cloud computing architecture, these components can be remotely deployed and can work together to achieve the functionality described herein. In some implementations, cloud computing provides computing, software, data access, and storage services without requiring end users to know the physical location or configuration of the systems or hardware providing these services. In various implementations, cloud computing provides services over a wide area network (such as the Internet) using appropriate protocols. For example, cloud computing providers offer applications over a wide area network, and these applications can be accessed through a web browser or any other computing component. The software or components of the cloud computing architecture, along with the corresponding data, may be stored on servers at remote locations. Computing resources in a cloud computing environment may be consolidated at remote data center locations or they may be distributed. Cloud computing infrastructure can provide services through shared data centers, even if they appear as a single access point for users. Therefore, the components and functionality described herein can be provided from service providers at remote locations using a cloud computing architecture. Alternatively, they may also be provided from conventional servers, or they may be installed directly or otherwise on client devices.
[0023] The computing device 100 can perform text-based speech generation according to various implementations of this disclosure. For example... Figure 1 As shown, computing device 100 can receive text 170 via input device 150. Text 170 is used to generate the desired speech. Text 170 may include multiple text sequences. Input device 150 can transmit text 170 to speech generation module 122. Speech generation module 122 generates corresponding natural spoken language speech 190 based on text 170. Natural spoken language speech 190 has unique characteristics. Compared to spoken language, natural spoken language speech 190 can have more varied prosody. The prosody of speech can be characterized using the duration and pitch of phonemes. Natural spoken language speech 190 can include more varied phoneme durations. For example, in human spoken language, the lengthening or shortening of specific phonemes occurs more frequently.
[0024] Natural spoken language speech 190 may have additional phonemes. Additional phonemes can be phonemes of speech that have no actual meaning and do not provide additional information. Examples of additional phonemes may include phonemes indicating pauses, phonemes indicating repetition, and phonemes indicating idiomatic expressions. For example, pauses such as "um" or "ah" may occur in natural spoken language speech 190. In another example, humans tend to repeat a word after saying it, so that natural spoken language speech 190 may include phonemes indicating repetition. In yet another example, some people habitually say idiomatic expressions such as "right?" after saying a particular word, so that natural spoken language speech 190 may include phonemes indicating personal idiomatic expressions.
[0025] Figure 2 An architecture diagram of a system 200 for text-based speech generation according to an implementation of this disclosure is shown. System 200 can be implemented in... Figure 1 In the computing device 100. The system 200 can be an end-to-end neural network model. For example... Figure 2 As shown, system 200 may include a preprocessing module 210, an additional phoneme determination module 220, a duration determination module 230, and a postprocessing module 240.
[0026] The preprocessing module 210 preprocesses the received text 170. The preprocessing module 210 can perform phonetic conversion on the text 170. Phonetic conversion can convert the English text “It's called um right uh apple” into the corresponding phonemes “ihtsk aoldah mraytahae paxl”. Various phonetic conversion methods can be used to convert the text 170 into corresponding phonemes. The scope of this disclosure is not limited in terms of phoneme conversion methods.
[0027] The preprocessing module 210 can also encode the phonemes obtained from the phonetic conversion to generate an initial phoneme sequence 250 corresponding to the text 170. The initial phoneme sequence 250 includes feature representations of multiple phonemes, each phoneme having a corresponding vector-like feature representation. The initial phoneme sequence 250 can be an initial feature representation used to represent the text 170. Various methods can be used to generate the initial phoneme sequence 250 based on the converted phonemes. The preprocessing module 210 can utilize an embedder to generate embeddings of the phonemes in vector form. The embedder can use a phoneme embedding algorithm to capture acoustic information (e.g., articulation features) in the phonemes to generate embeddings representing this acoustic information. The preprocessing module 210 can also utilize an encoder to encode the phoneme embeddings into feature representations of the phonemes. The encoder can be a network consisting of attention layers and convolutional layers. Training of the network for phoneme embedding and encoding will be described below. The scope of this disclosure is not limited in terms of the methods used for phoneme embedding and encoding.
[0028] The additional phoneme determination module 220 generates a first phoneme sequence 260 based on the initial phoneme sequence 250. The additional phoneme determination module 220 generates the first phoneme sequence 260 by inserting feature representations of additional phonemes into the initial phoneme sequence 250. In other words, the first phoneme sequence 260 includes not only feature representations of the multiple phonemes in the initial phoneme sequence 250, but also feature representations of the additional phonemes. As mentioned above, the additional phonemes are related to natural spoken language characteristics. For example, additional phonemes may be related to pauses, repetitions, or idioms.
[0029] In some implementations, the feature representation of the additional phonemes can be an embedding of the additional phonemes. The feature representation of the additional phonemes can also be a variation of the embedding of the additional phonemes. In other implementations, the feature representation of the additional phonemes can be determined by the additional phoneme determination module 220 based on the initial phoneme sequence 250.
[0030] The additional phoneme determination module 220 can determine the appropriate position to insert the additional phoneme in the initial phoneme sequence 250 based on the initial phoneme sequence 250. In other words, the additional phoneme determination module 220 can determine where and what additional phoneme to insert among multiple phonemes corresponding to the text 170 based on the initial phoneme sequence 250. For example, the initial phoneme sequence 250 can determine to insert a phoneme indicating the pause "um" in the speech corresponding to the example "This is an apple" in the text 170, and the initial phoneme sequence 250 can determine to insert the pause "um" at the position between "This is" and "Apple". In another example, the initial phoneme sequence 250 can determine to insert a phoneme indicating the idiom "right?" in the speech corresponding to the example "This is an apple" in the text 170, and the initial phoneme sequence 250 can determine to insert the idiom "right?" at the end of "This is an apple".
[0031] The additional phoneme determination module 220 can be a network composed of common neural network layers such as convolutional layers, linear layers, and normalization layers. In some implementations, the additional phoneme determination module 220 may include two 1D convolutional layers with ReLU activation functions, dropout layers, normalization layers, linear layers, and softmax layers. The softmax layer can be used to predict the probability that the additional phoneme belongs to different categories. For example, the categories of additional phonemes may include no additional phoneme, pause "um", pause "ah", repeat the previous word, idiom "right?", etc. The training of the additional phoneme determination module 220 will be detailed below. The scope of this disclosure is not limited in terms of model construction and training of the additional phoneme determination module 220.
[0032] By generating the first phoneme sequence 260 by inserting appropriate feature representations of additional phonemes at suitable positions in the initial phoneme sequence 250, the speech generated based on the text 170 can include more additional phonemes related to natural spoken language. In this way, the similarity between the generated speech and real human natural spoken language can be improved, making the generated natural spoken language type speech 190 sound more realistic and vivid.
[0033] Based on the first phoneme sequence 260, the duration determination module 230 generates the second phoneme sequence 270 by determining the duration of the phonemes in the first phoneme sequence 260. It should be understood that the phonemes in the first phoneme sequence 260 include any inserted additional phonemes. The duration of a phoneme can be measured in frames, with each frame lasting, for example, 10 ms. The duration determination module 230 can predict the corresponding duration, expressed in frames, for each phoneme in the first phoneme sequence 260. Specifically, the duration determination module 230 determines the duration of a phoneme by utilizing an expert model corresponding to the phonemes in the first phoneme sequence 260. The duration determination module 230 can utilize a Hybrid Expert (MOE) algorithm to determine the duration of a phoneme. The following will refer to... Figure 3 To describe the details of the duration determination module 230.
[0034] Figure 3 A schematic diagram is shown illustrating the generation of a second phoneme sequence 270 using a duration determination module 230 according to an implementation of the present disclosure. The duration determination module 230 may include a routing module (such as... Figure 3 The routing module 310 (shown) and multiple expert models are used. The routing module 310 can classify the phonemes in the first phoneme sequence 260 into different categories. The categories can be related to the duration of the phonemes. In some implementations, the routing module 310 can classify the phonemes into two categories, namely, long duration or short duration.
[0035] For different categories of phonemes, the duration of the corresponding phoneme can be predicted using the expert model that performs best for that category among multiple expert models. Multiple expert models can include two, three, or more expert models. Multiple expert models can include, for example... Figure 3 The first expert model 320-1 and the second expert model 320-2 are shown. In some implementations, the first expert model 320-1 can be used to predict the duration of phonemes classified as long-duration, and the second expert model 320-2 can be used to predict the duration of phonemes classified as short-duration.
[0036] In some implementations, predictions of the duration of the same phoneme from multiple expert models can be considered together. As an example, routing module 310 can determine the probability of a phoneme being classified into different categories while simultaneously determining the phoneme's classification. Multiple expert models can be used to predict the duration of the same phoneme. The probabilities of different categories can be used as weights to sum the durations of the same phoneme predicted by the multiple expert models. The weighted sum of durations can be used as the duration of the phoneme determined by duration determination module 230.
[0037] The duration determination module 230 further updates the first phoneme sequence 260 based on the determined duration of the phonemes, thereby generating the second phoneme sequence 270. In some implementations, the duration determination module 230 may extend the first phoneme sequence 260 to update it based on the determined duration of the phonemes. In other words, the feature representations of the phonemes in the first phoneme sequence 260 can be arranged according to the corresponding durations. For example, if it is determined that the duration of the first phoneme in the first phoneme sequence 260 is 5 frames and the duration of the second phoneme is 2 frames, the first phoneme sequence 260 can be updated using the arrangement of the feature representation of the first phoneme repeated 5 times and the feature representation of the second phoneme repeated 2 times, as the second phoneme sequence 270.
[0038] In some implementations, if the first phoneme sequence 260 is already associated with an initial duration, the duration determination module 230 can update the first phoneme sequence 260 by extending or shortening the duration of the phonemes in the first phoneme sequence 260. For example, if the duration of the first phoneme in the first phoneme sequence 260 is determined to be 5 frames, the feature representation of the first phoneme repeated 3 times in the first phoneme sequence 260 can be extended to the feature representation of the first phoneme repeated 5 times, thereby updating the first phoneme sequence 260 as the second phoneme sequence 270.
[0039] The network structure of routing module 310 and multiple expert models can be similar to that of the additional phoneme determination module 220 described above. The training of routing module 310 and multiple expert models will be detailed below. The scope of this disclosure is not limited in terms of model building and training of duration determination module 230.
[0040] By updating the first phoneme sequence 260 based on the duration of phonemes, the speech generated from the text 170 can have more varied rhythms. In this way, the similarity between the generated speech and real human natural speech can be improved, making the generated natural speech type speech 190 sound more realistic and vivid.
[0041] Continue to refer to Figure 2Based on the second phoneme sequence 270, the post-processing module 240 can determine the speech 190 of the natural spoken language type corresponding to the text 170. In some implementations, the post-processing module 240 can determine the pitch of the phonemes in the second phoneme sequence 270. The post-processing module 240 can update the second phoneme sequence 270 based on the determined pitches. Specifically, the post-processing module 240 can utilize a network similar to the additional phoneme determination module 220 to predict the pitch of the phonemes. The predicted pitches can be converted into pitch embedding vectors. The pitch embedding vectors can be added to the feature representation of the corresponding phoneme, thereby updating the second phoneme sequence 270. The scope of this disclosure is not limited in terms of the methods used to determine pitches.
[0042] In some implementations, post-processing module 240 may update the second phoneme sequence 270 based on the speech characteristics of the target speaker to generate a third phoneme sequence (not shown). The speech characteristics of the target speaker may be timbre. Post-processing module 240 may determine the speech 190 of the natural spoken language type corresponding to both text 170 and the target speaker based on the third phoneme sequence. Specifically, post-processing module 240 may update the second phoneme sequence 270 by adding an embedding vector indicating the speech characteristics of the target speaker to the feature representation of the corresponding phoneme. The scope of this disclosure is not limited in terms of the method used to determine the embedding vector indicating the speech characteristics of the target speaker.
[0043] In some implementations, the post-processing module 240 may use a decoder to generate a Mel spectrum corresponding to the text 170 based on the second phoneme sequence 270. This Mel spectrum can then be converted into speech, i.e., natural spoken language speech 190. The decoder can be any suitable network architecture, and the scope of this disclosure is not limited in this respect.
[0044] It should be understood that the structure and function of system 200 are described for illustrative purposes only and do not imply any limitation on the scope of the topics described herein. The topics described herein may be embodied in different structures and / or functions.
[0045] Figure 4 A flowchart of a text-based speech generation method 400 according to some implementations of the present disclosure is shown. Method 400 can be implemented by computing device 100, for example, it can be implemented at speech generation module 122 in memory 120 of computing device 100.
[0046] like Figure 4As shown, at box 410, computing device 100 generates an initial phoneme sequence 250 corresponding to text 170, the initial phoneme sequence 250 including feature representations of multiple phonemes. At box 420, computing device 100 generates a first phoneme sequence 260 by inserting feature representations of additional phonemes into the initial phoneme sequence 250, the additional phonemes being related to the characteristics of natural spoken language. In some implementations, the additional phonemes include at least one of the following: phonemes indicating pauses; phonemes indicating repetitions; and phonemes indicating idiomatic expressions.
[0047] At box 430, computing device 100 determines the duration of a phoneme using an expert model corresponding to a phoneme among a plurality of phonemes and additional phonemes, and generates a second phoneme sequence 270 based on a first phoneme sequence 260. In some implementations, generating the second phoneme sequence 270 based on the first phoneme sequence 260 includes: determining the category of a phoneme among a plurality of phonemes and additional phonemes; and predicting the duration of a phoneme using an expert model corresponding to the category among a plurality of expert models.
[0048] At box 440, computing device 100 determines speech 190 of a natural spoken language type corresponding to text 170 based on second phoneme sequence 270. In some implementations, determining speech 190 of a natural spoken language type corresponding to text 170 based on second phoneme sequence 270 includes: updating second phoneme sequence 270 based on speech characteristics of target speaker to generate third phoneme sequence; and determining speech 190 of a natural spoken language type corresponding to both text 170 and target speaker based on third phoneme sequence.
[0049] In this way, based on the additional phonemes and the varying duration of phonemes related to the characteristics of natural spoken language, the similarity between the generated speech and real human natural spoken language can be improved, making the generated natural spoken language type speech190 sound more realistic and vivid.
[0050] The above is for reference only. Figure 1-4 The working principle of the text-based speech generation method implemented according to this disclosure is described in detail. The training process of the end-to-end neural network model used in this method will be described below.
[0051] Figure 5 A flowchart of a method 500 for training a text-based speech generation model according to some implementations of the present disclosure is shown. Method 500 can be implemented by computing device 100, for example, it can be implemented at speech generation module 122 in memory 120 of computing device 100.
[0052] like Figure 5As shown, at box 510, computing device 100 uses a first training dataset to train a first model, which is used to generate speech based on text. The first model can generate speech corresponding to text 170. The first model can be any suitable TTS model. The first model can be a multi-speaker TTS model. The first model can include... Figure 2 The preprocessing module 210 and postprocessing module 240 shown are similar modules. The first model may also include a prosodic determination module for predicting the duration and pitch of phonemes.
[0053] The first training dataset can be any suitable dataset for speech synthesis. In some implementations, the first training dataset may include text and its corresponding speech. Audio transcription methods can be used to obtain the corresponding text based on the original speech. The text and the original speech can be temporally aligned. In some implementations, the text can be converted into a corresponding series of phonemes. The first training dataset may include a series of phonemes and their corresponding original speech. The first training dataset may also include the duration of each phoneme. The first training dataset may also include the pitch of each phoneme extracted from the original speech. For a first model serving as a multi-speaker TTS model, the first training dataset may also include original speech from multiple speakers and their corresponding speaker identifiers.
[0054] At box 520, computing device 100 uses a second training dataset to fine-tune a second model generated based on the first model. This second model is used to generate natural spoken language-type speech based on text. The second model can be used for, for example... Figure 2 The example shown generates natural spoken language-type speech 190 based on text 170. The second model may include... Figure 2 The preprocessing module 210, the additional phoneme determination module 220, the duration determination module 230, the postprocessing module 240, or similar modules are shown. Alternatively or additionally, the second model may also include any other suitable modules for generating natural spoken language-type speech.
[0055] The second training dataset can be any suitable dataset for synthesizing natural spoken language. It can be constructed from raw speech of the natural spoken language type. Compared to the first training dataset, the second training dataset has less training data. In other words, it can be constructed from less speech data. A method similar to that used to determine the first training dataset can be used to determine the corresponding text and a set of phonemes based on the raw speech. Based on the raw speech and the determined text and set of phonemes, the second training dataset can be constructed for specific modules in the second model.
[0056] In some implementations, the second model can be generated by adding an additional phoneme determination module 220 to the first model. As referred to above Figure 2 The additional phoneme determination module 220 is used to determine additional phonemes related to the characteristics of natural spoken language among the multiple phonemes corresponding to the speech of the natural spoken language type. The additional phonemes can be phonemes indicating pauses, phonemes indicating repetitions, and phonemes indicating idiomatic expressions. In this case, a second training dataset can be constructed for specifically training the additional phoneme determination module 220. Specifically, the additional phonemes in a series of phonemes determined from the original speech can be identified. A corresponding label can be assigned to the phoneme followed by an additional phoneme in the series of phonemes. The label can indicate that the phoneme is not followed by an additional phoneme. The label can also indicate the category of the additional phoneme. For example, the label can indicate no additional phoneme, the additional phoneme is the pause "um", the pause "ah", repeating the previous word, or the idiomatic expression "right?" etc. The additional phonemes can be removed from the series of phonemes to generate a pure series of phonemes. Each phoneme in the pure series of phonemes has a label indicating the additional phoneme.
[0057] The pure series of phonemes with labels can be used as the second training dataset to fine-tune the second model with the additional phoneme determination module 220 added. In other words, a part of the parameters of the trained first model can be accepted as the corresponding parameters of the second model. The parameters in the additional phoneme determination module 220 can be specifically trained using the second training dataset while keeping these parameters unchanged. As referred to above Figure 2 [[ID={7]]As described, the additional phoneme determination module 220 can receive an initial phoneme sequence 250. The initial phoneme sequence 250 can be generated by the embedder and the encoder based on the phonemes obtained from grapheme-to-phoneme conversion. Therefore, when training the additional phoneme determination module 220, the parameters of the trained embedder and encoder can be kept unchanged, and only the parameters in the additional phoneme determination module 220 are trained.
[0058] In some implementations, formula (1) can be used as the loss function for training the additional phoneme determination module 220:
[0059]
[0060] where, [s0, s1, s2] represents the probabilities that the phoneme is predicted as three different additional phoneme categories, s_{0} represents the probability of no additional phoneme, s_{1} represents the probability of the additional phoneme "um", s_{2} represents the probability of the additional phoneme "ah", [y0, y1, y2] represents the one-hot encoding of the true category label of the phoneme, and σ represents an adjustable parameter for adjusting the density of the additional phonemes.
[0061] In some implementations, the second model can be generated by adding a duration determination module 230 to the first model. Alternatively, the module for determining the duration in the first model can be modified to... Figure 2 The duration determination module 230 shown is used to generate the second model. In this case, a second training dataset can be constructed for targeted training of the duration determination module 230. Specifically, an alignment tool can be used to determine the duration of a series of phonemes from the original speech. Details of the alignment tool are not elaborated here.
[0062] A second training dataset, including the duration determination module 230, can be used to fine-tune the second model. Similarly, a subset of the parameters from the trained first model can be accepted as corresponding parameters for the second model. The parameters in the duration determination module 230 can be specifically trained using the second training dataset while keeping these parameters unchanged. For example, the parameters of the trained embedder and encoder can be kept unchanged, and only the parameters in the duration determination module 230 can be trained.
[0063] Specifically, the duration determination module 230 can be trained using a set of phonemes labeled with their actual durations. In some implementations, the actual duration category can be determined based on the actual duration of the phonemes. The routing module 310 in the duration determination module 230 can be trained using a set of phonemes labeled with their actual duration categories. As described above, the routing module 310 can classify phonemes into the corresponding categories associated with the duration of the phonemes. In other words, the routing module 310 can determine the expert model corresponding to a phoneme among multiple expert models. In some implementations, the parameters of each expert model can be initialized by the parameters of the module used to determine the duration in the first trained model.
[0064] In some implementations, the training of the additional phoneme determination module 220 and the duration determination module 230 in the second model can also be phased. Specifically, the parameters of the additional phoneme determination module 220 can be determined first using the training dataset used to train the additional phoneme determination module 220. Then, the parameters of the duration determination module 230 can be determined using the training dataset used to train the duration determination module 230, based on the inherited parameters of the additional phoneme determination module 220.
[0065] In this way, by inheriting some parameters from the first trained model, the second model can be fine-tuned using less natural spoken language type speech data, thereby improving training efficiency.
[0066] At box 530, computing device 100 uses a third training dataset to fine-tune the second model a second time. This fine-tuned second model is used to generate natural spoken language-type speech based on text, relating to the speech characteristics of the target speaker. The third training dataset can be constructed from the target speaker's raw speech. Compared to the first and second training datasets, the third training dataset has less training data. The third training dataset can be constructed using raw speech and a corresponding set of phonemes. It should be noted that the third training dataset can be constructed from non-natural spoken language-type speech data.
[0067] The second model can be fine-tuned a second time using a third training dataset to learn the speech characteristics of the target speaker. Similarly, a subset of the parameters in the fine-tuned second model can be kept unchanged, and modules in the second model specifically trained for the speech characteristics of the target speaker can be trained using the third training dataset. For example, the parameters of the trained embedder, encoder, additional phoneme determination module 220, and duration determination module 230 can be kept unchanged. Only the parameters of the layers in the post-processing module 240 used to learn the speech characteristics of the target speaker, such as the parameters of conditional layer normalization, can be trained.
[0068] In this way, by inheriting some parameters of the fine-tuned second model, the second model can be fine-tuned again using less speech data from the target speaker, thereby improving training efficiency. The finely-tuned second model can generate natural spoken language speech 190 based on text 170 that conforms to the speech characteristics of the target speaker.
[0069] It should be understood that the strategy of training the speech synthesis model in stages based on the embodiments of this disclosure can also be applied to other scenarios. For example, a second model generated based on the first model can be fine-tuned using a second training dataset for speech of different natural language types, building upon the first model already trained. Examples of different natural language types of speech can include whisper-type speech, speech-type speech, etc. In this way, the need for training data for specific natural language types of speech can be reduced, thereby improving training efficiency.
[0070] The following are some example implementations of this disclosure.
[0071] In a first aspect, this disclosure provides a computer-implemented method. The method includes: generating an initial phoneme sequence corresponding to text, the initial phoneme sequence including feature representations of multiple phonemes; generating a first phoneme sequence by inserting feature representations of additional phonemes related to the characteristics of natural spoken language into the initial phoneme sequence; generating a second phoneme sequence based on the first phoneme sequence by determining the duration of phonemes using an expert model corresponding to the phonemes among the multiple phonemes and the additional phonemes; and determining the speech of the natural spoken language type corresponding to the text based on the second phoneme sequence.
[0072] In some implementations, the additional phonemes include at least one of the following: phonemes indicating pauses; phonemes indicating repetitions; and phonemes indicating idioms.
[0073] In some implementations, generating a second phoneme sequence 270 based on a first phoneme sequence 260 includes: determining the class of phonemes among a plurality of phonemes and additional phonemes; and predicting the duration of phonemes using an expert model corresponding to the class among a plurality of expert models.
[0074] In some implementations, determining the speech 190 corresponding to the natural spoken language type of the text 170 based on the second phoneme sequence 270 includes: updating the second phoneme sequence 270 based on the speech characteristics of the target speaker to generate a third phoneme sequence; and determining the speech 190 corresponding to the natural spoken language type of both the text 170 and the target speaker based on the third phoneme sequence.
[0075] In a second aspect, this disclosure provides an electronic device. The electronic device includes: a processing unit; and a memory coupled to the processing unit and containing instructions stored thereon, the instructions, when executed by the processing unit, causing the device to perform actions, the actions including: generating an initial phoneme sequence corresponding to text, the initial phoneme sequence including feature representations of a plurality of phonemes; generating a first phoneme sequence by inserting feature representations of additional phonemes related to the characteristics of natural spoken language into the initial phoneme sequence; determining the duration of phonemes using an expert model corresponding to the phonemes among the plurality of phonemes and the additional phonemes; generating a second phoneme sequence based on the first phoneme sequence; and determining speech of a natural spoken language type corresponding to the text based on the second phoneme sequence.
[0076] In some implementations, the additional phonemes include at least one of the following: phonemes indicating pauses; phonemes indicating repetitions; and phonemes indicating idioms.
[0077] In some implementations, generating a second phoneme sequence 270 based on a first phoneme sequence 260 includes: determining the class of phonemes among a plurality of phonemes and additional phonemes; and predicting the duration of phonemes using an expert model corresponding to the class among a plurality of expert models.
[0078] In some implementations, determining the speech 190 corresponding to the natural spoken language type of the text 170 based on the second phoneme sequence 270 includes: updating the second phoneme sequence 270 based on the speech characteristics of the target speaker to generate a third phoneme sequence; and determining the speech 190 corresponding to the natural spoken language type of both the text 170 and the target speaker based on the third phoneme sequence. In another aspect, this disclosure provides a computer program product tangibly stored in a non-transient computer storage medium and including machine-executable instructions that, when executed by a device, cause the device to perform the methods described above.
[0079] In a third aspect, this disclosure provides a computer program product including machine-executable instructions that, when executed by a device, cause the device to perform the method of the first aspect.
[0080] In a fourth aspect, this disclosure provides a computer-readable medium having machine-executable instructions stored thereon, which, when executed by a device, cause the device to perform the method of the second aspect.
[0081] In a fifth aspect, this disclosure provides a computer-implemented method. The method includes: training a first model using a first training dataset, the first model being used to generate speech based on text; fine-tuning a second model generated based on the first model using a second training dataset, the second model being used to generate natural spoken language-type speech based on text; and further fine-tuning the second model using a third training dataset, the second model being used to generate natural spoken language-type speech based on text that relates to the speech characteristics of a target speaker; wherein the sizes of the first training dataset, the second training dataset, and the third training dataset decrease sequentially.
[0082] In some implementations, the additional phonemes include at least one of the following: phonemes indicating pauses; phonemes indicating repetitions; and phonemes indicating idioms.
[0083] In some implementations, using a second training dataset to fine-tune a second model generated based on a first model includes: adding an additional phoneme determination module to the first model to generate a second model; an additional phoneme prediction module for determining additional phonemes related to the characteristics of natural spoken language among multiple phonemes corresponding to speech of a natural spoken language type; and using the second training dataset to train the additional phoneme determination module.
[0084] In some implementations, using a second training dataset to fine-tune a second model generated based on a first model includes: using the second training dataset to train a duration determination module in the second model, the duration determination module being used to determine the duration of multiple phonemes corresponding to speech of a natural spoken language type.
[0085] In some implementations, determining the duration of multiple phonemes corresponding to speech of a natural spoken language type includes: identifying expert models among multiple expert models that correspond to the phonemes in the multiple phonemes; and using the expert models to determine the duration of the phonemes.
[0086] In a sixth aspect, this disclosure provides an electronic device. The electronic device includes: a processing unit; and a memory coupled to the processing unit and containing instructions stored thereon, the instructions, when executed by the processing unit, causing the device to perform actions, the actions including: training a first model using a first training dataset, the first model being used to generate speech based on text; fine-tuning a second model generated based on the first model using a second training dataset, the second model being used to generate natural spoken language based on text; and further fine-tuning the second model using a third training dataset, the second model being used to generate natural spoken language based on text related to the speech characteristics of a target speaker; and wherein the sizes of the first training dataset, the second training dataset, and the third training dataset decrease sequentially.
[0087] In some implementations, the additional phonemes include at least one of the following: phonemes indicating pauses; phonemes indicating repetitions; and phonemes indicating idioms.
[0088] In some implementations, using a second training dataset to fine-tune a second model generated based on a first model includes: adding an additional phoneme determination module to the first model to generate a second model; an additional phoneme prediction module for determining additional phonemes related to the characteristics of natural spoken language among multiple phonemes corresponding to speech of a natural spoken language type; and using the second training dataset to train the additional phoneme determination module.
[0089] In some implementations, using a second training dataset to fine-tune a second model generated based on a first model includes: using the second training dataset to train a duration determination module in the second model, the duration determination module being used to determine the duration of multiple phonemes corresponding to speech of a natural spoken language type.
[0090] In some implementations, determining the duration of multiple phonemes corresponding to speech of a natural spoken language type includes: identifying expert models among multiple expert models that correspond to the phonemes in the multiple phonemes; and using the expert models to determine the duration of the phonemes.
[0091] In a seventh aspect, this disclosure provides a computer program product including machine-executable instructions that, when executed by a device, cause the device to perform the method of the fifth aspect.
[0092] In an eighth aspect, this disclosure provides a computer-readable medium having machine-executable instructions stored thereon, which, when executed by a device, cause the device to perform the method of the fifth aspect.
[0093] The functions described above in this document can be performed at least in part by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload programmable logic devices (CPLDs), and so on.
[0094] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0095] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0096] Furthermore, although the operations are described in a specific order, this should be understood as requiring that such operations be performed in the specific order shown or in sequential order, or requiring that all illustrated operations be performed to achieve the desired result. In certain environments, multitasking and parallel processing may be advantageous. Similarly, although several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of this disclosure. Certain features described in the context of a single implementation may also be implemented in combination in a single implementation. Conversely, various features described in the context of a single implementation may also be implemented individually or in any suitable sub-combination in multiple implementations.
[0097] Although the subject matter has been described using language specific to structural features and / or methodological logic, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. Rather, the specific features and actions described above are merely illustrative examples of implementing the claims.
Claims
1. A method for speech generation, comprising: Generate an initial phoneme sequence corresponding to the text, wherein the initial phoneme sequence includes feature representations of multiple phonemes; A first phoneme sequence is generated by inserting feature representations of additional phonemes into the initial phoneme sequence, the additional phonemes being related to the characteristics of natural spoken language; The duration of the phonemes is determined by using an expert model corresponding to the phonemes in the plurality of phonemes and the additional phonemes, and a second phoneme sequence is generated based on the first phoneme sequence. as well as Based on the second phoneme sequence, determine the natural spoken language type of speech corresponding to the text; The generation of the second phoneme sequence based on the first phoneme sequence includes: Determine the category of the phonemes among the plurality of phonemes and the additional phonemes; as well as The duration of the phoneme is predicted using an expert model corresponding to the category from among multiple expert models.
2. The method of claim 1, wherein determining the natural spoken language type of the speech corresponding to the text based on the second phoneme sequence comprises: The second phoneme sequence is updated based on the speech characteristics of the target speaker to generate the third phoneme sequence; as well as Based on the third phoneme sequence, the speech type of natural spoken language corresponding to both the text and the target speaker is determined.
3. The method of claim 1, wherein the additional phonemes comprise at least one of the following: Phonemes indicating pauses; Indicates repeated phonemes; and Phonemes indicating idiomatic expressions.
4. A method for training a speech generation model, comprising: A first model is trained using a first training dataset, and the first model is used to generate speech based on text; The second training dataset is used to fine-tune a second model generated based on the first model, the second model being used to generate natural spoken language-type speech based on the text; as well as The second model is further fine-tuned using a third training dataset. This second-fine-tuned model is then used to generate natural spoken language based on the text, which is related to the speech characteristics of the target speaker. The sizes of the first training dataset, the second training dataset, and the third training dataset decrease sequentially. The fine-tuning of the second model generated based on the first model using the second training dataset includes: The second training dataset is used to train a duration determination module in the second model, the duration determination module being used to determine the duration of multiple phonemes corresponding to the natural spoken language type of speech, and The determination of the duration of the plurality of phonemes corresponding to the speech of the natural spoken language type includes: Identify the expert model among multiple expert models that corresponds to the phonemes among the multiple phonemes; and The duration of the phoneme is determined using the expert model.
5. The method of claim 4, wherein fine-tuning the second model generated based on the first model using a second training dataset comprises: An additional phoneme determination module is added to the first model to generate the second model. The additional phoneme prediction module is used to determine additional phonemes related to the characteristics of natural spoken language among multiple phonemes corresponding to the speech of the natural spoken language type. as well as The additional phoneme determination module is trained using the second training dataset.
6. The method of claim 5, wherein the additional phonemes comprise at least one of the following: Phonemes indicating pauses; Indicates repeated phonemes; and Phonemes indicating idiomatic expressions.
7. An electronic device for speech generation, comprising: Processing unit; as well as A memory, coupled to the processing unit and containing instructions stored thereon, which, when executed by the processing unit, cause the device to perform actions, including: Generate an initial phoneme sequence corresponding to the text, wherein the initial phoneme sequence includes feature representations of multiple phonemes; A first phoneme sequence is generated by inserting feature representations of additional phonemes into the initial phoneme sequence, the additional phonemes being related to the characteristics of natural spoken language; The duration of the phonemes is determined by utilizing an expert model corresponding to the phonemes among the plurality of phonemes and the additional phonemes, and a second phoneme sequence is generated based on the first phoneme sequence; and Based on the second phoneme sequence, determine the natural spoken language type of speech corresponding to the text; The generation of the second phoneme sequence based on the first phoneme sequence includes: Determine the category of the phonemes among the plurality of phonemes and the additional phonemes; and The duration of the phoneme is predicted using an expert model corresponding to the category from among multiple expert models.
8. The electronic device of claim 7, wherein determining the natural spoken language type of speech corresponding to the text based on the second phoneme sequence comprises: The second phoneme sequence is updated based on the speech characteristics of the target speaker to generate the third phoneme sequence; as well as Based on the third phoneme sequence, the speech type of natural spoken language corresponding to both the text and the target speaker is determined.
9. The electronic device of claim 7, wherein the additional phoneme comprises at least one of the following: Phonemes indicating pauses; Indicates repeated phonemes; and Phonemes indicating idiomatic expressions.
10. An electronic device for training a speech generation model, comprising: Processing unit; as well as A memory, coupled to the processing unit and containing instructions stored thereon, which, when executed by the processing unit, cause the device to perform actions, including: A first model is trained using a first training dataset, and the first model is used to generate speech based on text; A second training dataset is used to fine-tune a second model generated based on the first model, the second model being used to generate natural spoken language-type speech based on the text; and The second model is further fine-tuned using a third training dataset. This second-fine-tuned model is then used to generate natural spoken language based on the text, which is related to the speech characteristics of the target speaker. The sizes of the first training dataset, the second training dataset, and the third training dataset decrease sequentially. The fine-tuning of the second model generated based on the first model using the second training dataset includes: The second training dataset is used to train a duration determination module in the second model, the duration determination module being used to determine the duration of multiple phonemes corresponding to the natural spoken language type of speech, and The determination of the duration of the plurality of phonemes corresponding to the speech of the natural spoken language type includes: Identify the expert model among multiple expert models that corresponds to the phonemes among the multiple phonemes; and The duration of the phoneme is determined using the expert model.
11. The electronic device of claim 10, wherein fine-tuning the second model generated based on the first model using a second training dataset comprises: Add an additional phoneme determination module to the first model to generate the second model; The additional phoneme determination module is used to determine the additional phonemes related to the characteristics of natural spoken language among multiple phonemes corresponding to the speech of the natural spoken language type. as well as The additional phoneme determination module is trained using the second training dataset.
12. The electronic device of claim 11, wherein the additional phoneme comprises at least one of the following: Phonemes indicating pauses; Indicates repeated phonemes; and Phonemes indicating idiomatic expressions.
13. A computer program product comprising machine-executable instructions that, when executed by a device, cause the device to perform an action, the action comprising: Generate an initial phoneme sequence corresponding to the text, wherein the initial phoneme sequence includes feature representations of multiple phonemes; A first phoneme sequence is generated by inserting feature representations of additional phonemes into the initial phoneme sequence, the additional phonemes being related to the characteristics of natural spoken language; The duration of the phonemes is determined by using an expert model corresponding to the phonemes in the plurality of phonemes and the additional phonemes, and a second phoneme sequence is generated based on the first phoneme sequence. as well as Based on the second phoneme sequence, determine the natural spoken language type of speech corresponding to the text; The generation of the second phoneme sequence based on the first phoneme sequence includes: Determine the category of the phonemes among the plurality of phonemes and the additional phonemes; as well as The duration of the phoneme is predicted using an expert model corresponding to the category from among multiple expert models.
14. The computer program product of claim 13, wherein determining the natural spoken language type of speech corresponding to the text based on the second phoneme sequence comprises: The second phoneme sequence is updated based on the speech characteristics of the target speaker to generate the third phoneme sequence; as well as Based on the third phoneme sequence, the speech type of natural spoken language corresponding to both the text and the target speaker is determined.
15. The computer program product of claim 13, wherein the additional phonemes include at least one of the following: Phonemes indicating pauses; Indicates repeated phonemes; and Phonemes indicating idiomatic expressions.
16. A computer program product comprising machine-executable instructions that, when executed by a device, cause the device to perform an action, the action comprising: A first model is trained using a first training dataset, and the first model is used to generate speech based on text; The second training dataset is used to fine-tune a second model generated based on the first model, the second model being used to generate natural spoken language-type speech based on the text; as well as The second model is further fine-tuned using a third training dataset. This second-fine-tuned model is then used to generate natural spoken language based on the text, which is related to the speech characteristics of the target speaker. The sizes of the first training dataset, the second training dataset, and the third training dataset decrease sequentially. The fine-tuning of the second model generated based on the first model using the second training dataset includes: The second training dataset is used to train a duration determination module in the second model, the duration determination module being used to determine the duration of multiple phonemes corresponding to the natural spoken language type of speech, and The determination of the duration of the plurality of phonemes corresponding to the speech of the natural spoken language type includes: Identify the expert model among multiple expert models that corresponds to the phonemes among the multiple phonemes; and The duration of the phoneme is determined using the expert model.
17. The computer program product of claim 16, wherein fine-tuning the second model generated based on the first model using a second training dataset comprises: An additional phoneme determination module is added to the first model to generate the second model. The additional phoneme prediction module is used to determine additional phonemes related to the characteristics of natural spoken language among multiple phonemes corresponding to the speech of the natural spoken language type. as well as The additional phoneme determination module is trained using the second training dataset.
18. The computer program product of claim 17, wherein the additional phonemes include at least one of the following: Phonemes indicating pauses; Indicates repeated phonemes; and Phonemes indicating idiomatic expressions.
Citation Information
Patent Citations
Model training method, voice synthesis method, device and equipment and storage medium
CN111667816A
Systems and methods for multi-speaker neural text-to-speech
US20180336880A1
Method and apparatus for speech synthesis using paralinguistic variation
US8103505B1