Speech synthesis method, system, terminal and storage medium
By training a tone prediction model and an end-to-end speech synthesis model, and using fundamental frequency information for clustering and tone setting, the problem of low efficiency in existing speech synthesis systems is solved, and automatic tone annotation and speech synthesis are realized, thereby improving the efficiency of system construction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING UNISOUND INFORMATION TECH CO LTD
- Filing Date
- 2023-07-27
- Publication Date
- 2026-06-19
AI Technical Summary
Existing speech synthesis systems are inefficient to build, and their reliance on manually configured dictionaries leads to low efficiency.
By acquiring speech samples to train a tone prediction model, using fundamental frequency information for clustering and tone setting, and combining it with an end-to-end speech synthesis model, tone prediction and speech synthesis are automatically performed, avoiding the need for manually constructing pronunciation information rules.
It improves the efficiency of building speech synthesis systems, and realizes automatic tone marking and speech synthesis without the need for manual construction of pronunciation information rules.
Smart Images

Figure CN116884389B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing technology, and in particular to a speech synthesis method, system, terminal, and storage medium. Background Technology
[0002] Speech synthesis is a technology that transforms text information into fluent, understandable spoken output. The speech synthesis process involves first converting text information into linguistic features or phonemes, and then converting those features or phonemes into audio waveforms.
[0003] In existing speech synthesis processes, speech pronunciation information is generally constructed by manually set dictionaries, resulting in low efficiency in building speech synthesis systems. Summary of the Invention
[0004] The purpose of this invention is to provide a speech synthesis method, system, terminal, and storage medium, aiming to solve the problem of low efficiency in building existing speech synthesis systems.
[0005] The present invention is implemented as follows: a speech synthesis method, the method comprising:
[0006] Acquire speech samples and train a tone prediction model based on the sample speech;
[0007] The fundamental frequency information of the speech sample and the speech synthesis sample are obtained respectively to obtain the first fundamental frequency information and the second fundamental frequency information, and the first fundamental frequency information is clustered to obtain the first clustering information;
[0008] The speech synthesis samples are tone-set according to the first clustering information and the second fundamental frequency information, and an end-to-end speech synthesis model is trained according to the speech synthesis samples after tone setting.
[0009] The pinyin information of the text to be synthesized is input into the trained tone prediction model to predict the tone and obtain the output pinyin. The output pinyin is then input into the trained end-to-end speech synthesis model to synthesize the speech and obtain the synthesized speech.
[0010] Preferably, the step of setting the tone of the speech synthesis sample based on the first clustering information and the second fundamental frequency information includes:
[0011] The fundamental frequency word information of each character in the speech synthesis sample in the second fundamental frequency information is obtained to obtain the second fundamental frequency word information, and the second fundamental frequency word information is sampled to obtain the second sampling information;
[0012] Calculate the distance between each second sampled information and the first cluster information center to obtain the cluster distance;
[0013] The tone of the speech synthesis sample is set according to the tone of the tone category corresponding to the minimum clustering distance.
[0014] Preferably, training the tone prediction model based on the sample speech includes:
[0015] Align the sample text and speech information in the speech sample, and delete the tone of the aligned sample text;
[0016] The sample text after tone removal is input into the tone prediction model to predict the tone, and the tone prediction speech is obtained. The model loss is determined based on the tone prediction speech and the tone of the aligned sample text.
[0017] The parameters of the tone prediction model are updated based on the model loss until the tone prediction model converges.
[0018] Preferably, before inputting the pinyin information of the text to be synthesized into the trained tone prediction model for tone prediction, the method further includes:
[0019] The pronunciation type of each speech character in the text to be synthesized is queried, and the pronunciation type includes polyphonic type and monophonic type;
[0020] If the speech character is a monophonic type, the pronunciation of the speech character is determined according to a preset speech dictionary to obtain the character pronunciation;
[0021] If the speech character is of the polyphonic type, the corresponding word or text sentence is input into the polyphonic character prediction model for pronunciation prediction, and the pronunciation of the speech character is determined based on the pronunciation prediction result to obtain the character pronunciation;
[0022] Remove the tones from the pronunciation of each character to obtain the pinyin information.
[0023] Preferably, the clustering of the first fundamental frequency information includes:
[0024] The fundamental frequency character information of each character in the speech sample in the first fundamental frequency information is obtained to obtain the first fundamental frequency character information, and the first fundamental frequency character information is sampled to obtain the first sampling information;
[0025] The first sampled signal is clustered according to a preset tone category to obtain the first clustering information.
[0026] Another objective of this invention is to provide a speech synthesis system, the system comprising:
[0027] A tone training module is used to acquire speech samples and train a tone prediction model based on the sample speech.
[0028] The clustering module is used to obtain the fundamental frequency information of the speech samples and the speech synthesis samples respectively, to obtain the first fundamental frequency information and the second fundamental frequency information, and to cluster the first fundamental frequency information to obtain the first clustering information;
[0029] The speech training module is used to set the tone of the speech synthesis sample according to the first clustering information and the second fundamental frequency information, and to train an end-to-end speech synthesis model according to the speech synthesis sample after tone setting.
[0030] The speech synthesis module is used to input the pinyin information of the text to be synthesized into the trained tone prediction model to predict the tone, obtain the output pinyin, and input the output pinyin into the trained end-to-end speech synthesis model to synthesize the speech, thereby obtaining the synthesized speech.
[0031] Preferably, the voice training module is further used for:
[0032] The fundamental frequency word information of each character in the speech synthesis sample in the second fundamental frequency information is obtained to obtain the second fundamental frequency word information, and the second fundamental frequency word information is sampled to obtain the second sampling information;
[0033] Calculate the distance between each second sampled information and the first cluster information center to obtain the cluster distance;
[0034] The tone of the speech synthesis sample is set according to the tone of the tone category corresponding to the minimum clustering distance.
[0035] Another objective of this invention is to provide a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method described above.
[0036] Another objective of this invention is to provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the above-described method.
[0037] In this embodiment of the invention, by inputting the pinyin information of the text to be synthesized into a trained tone prediction model for tone prediction, the tone information of the text to be synthesized can be automatically annotated to obtain the output pinyin. By inputting the output pinyin into a trained end-to-end speech synthesis model for speech synthesis, the synthesized speech corresponding to the text to be synthesized can be automatically obtained. There is no need to manually establish the construction rules for pronunciation information, which improves the efficiency of speech synthesis system construction. Attached Figure Description
[0038] Figure 1This is a flowchart of the speech synthesis method provided in the first embodiment of the present invention;
[0039] Figure 2 This is a flowchart of the speech synthesis method provided in the second embodiment of the present invention;
[0040] Figure 3 This is a schematic diagram of the speech synthesis system provided in the third embodiment of the present invention;
[0041] Figure 4 This is a schematic diagram of the structure of the terminal device provided in the fourth embodiment of the present invention. Detailed Implementation
[0042] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.
[0043] To illustrate the technical solution described in this invention, specific embodiments are described below.
[0044] Example 1
[0045] Please see Figure 1 This is a flowchart of a speech synthesis method provided in the first embodiment of the present invention. This speech synthesis method can be applied to any terminal device or system, and includes the following steps:
[0046] Step S10: Obtain speech samples and train a tone prediction model based on the speech samples;
[0047] Specifically, speech samples are obtained based on a Cantonese speech recognition database. These speech samples include sample text and the corresponding speech information. A tone prediction model is trained based on the sample speech, so that the trained tone prediction model can effectively predict the tone of the input text, thereby achieving the effect of tone annotation of the input text.
[0048] Optionally, training the tone prediction model based on the sample speech includes:
[0049] The sample text and speech information in the speech samples are aligned, and the tones of the aligned sample text are removed; wherein, by aligning the sample text and speech information, the accuracy of the correspondence between the sample text and speech information is effectively improved.
[0050] The sample text after tone removal is input into the tone prediction model to predict the tone, and the tone prediction speech is obtained. The model loss is determined based on the tone prediction speech and the tone of the aligned sample text.
[0051] The tone prediction model is updated with parameters based on the model loss until it converges. The end-to-end Transformer model is trained by taking the pinyin information (excluding tone) of the sample text as input and the pinyin information with tone as output, to obtain the trained tone prediction model.
[0052] Step S20: Obtain the fundamental frequency information of the speech sample and the speech synthesis sample respectively to obtain the first fundamental frequency information and the second fundamental frequency information, and cluster the first fundamental frequency information to obtain the first clustering information;
[0053] In this step, by clustering the first fundamental frequency information, the fundamental frequency information of each character in the speech sample can be effectively clustered into a preset tone category. The preset tone category can be set according to the requirements. In this step, there are 9 different tone categories, and the tone corresponding to different tone categories is different.
[0054] Step S30: Set the tone of the speech synthesis sample according to the first clustering information and the second fundamental frequency information, and train an end-to-end speech synthesis model according to the speech synthesis sample after tone setting.
[0055] Among them, setting the tone of the speech synthesis sample based on the first clustering information and the second fundamental frequency information can effectively set the tone category in the speech synthesis sample to the tone category corresponding to the speech sample. For example, when the speech sample is Cantonese, the tone of the speech synthesis sample can be effectively set to the tone corresponding to Cantonese.
[0056] Optionally, setting the tone of the speech synthesis sample based on the first clustering information and the second fundamental frequency information includes:
[0057] The fundamental frequency word information of each character in the speech synthesis sample in the second fundamental frequency information is obtained to obtain the second fundamental frequency word information, and the second fundamental frequency word information is sampled to obtain the second sampling information;
[0058] Calculate the distance between each second sampled information and the first cluster information center to obtain the cluster distance;
[0059] The tone of the speech synthesis sample is set according to the tone of the tone category corresponding to the minimum clustering distance.
[0060] Step S40: Input the pinyin information of the text to be synthesized into the trained tone prediction model to perform tone prediction, obtain the output pinyin, and input the output pinyin into the trained end-to-end speech synthesis model to perform speech synthesis, obtain the synthesized speech;
[0061] Among them, the tone prediction model based on training can automatically annotate the tone information of the text to be synthesized to obtain the output pinyin, and the end-to-end speech synthesis model based on training can automatically synthesize the output pinyin and the text to be synthesized to obtain the synthesized speech.
[0062] Optionally, before inputting the pinyin information of the text to be synthesized into the trained tone prediction model for tone prediction, the method further includes:
[0063] The pronunciation type of each speech character in the text to be synthesized is queried, where the pronunciation type includes polyphonic type and monophonic type;
[0064] If the speech character is a monosyllabic type, the pronunciation of the speech character is determined according to a preset speech dictionary to obtain the character pronunciation; wherein, the preset speech dictionary can be set according to requirements, and the preset speech dictionary stores the correspondence between speech characters of different monosyllabic types and their corresponding pronunciations;
[0065] If the speech character is of the polyphonic type, the corresponding phrase or text sentence is input into the polyphonic character prediction model for pronunciation prediction, and the pronunciation of the speech character is determined based on the pronunciation prediction result to obtain the character pronunciation; wherein, the polyphonic character prediction model can be set according to requirements, and the polyphonic character prediction model is used to predict the pronunciation of polyphonic characters;
[0066] Remove the tones from the pronunciation of each character to obtain the pinyin information.
[0067] In this embodiment, by inputting the pinyin information of the text to be synthesized into the trained tone prediction model for tone prediction, the tone information of the text to be synthesized can be automatically labeled to obtain the output pinyin. By inputting the output pinyin into the trained end-to-end speech synthesis model for speech synthesis, the synthesized speech corresponding to the text to be synthesized can be automatically obtained. There is no need to manually establish the construction rules for pronunciation information, which improves the efficiency of speech synthesis system construction.
[0068] Example 2
[0069] Please see Figure 2 This is a flowchart of a speech synthesis method provided in the second embodiment of the present invention. This embodiment further refines step S20 in the first embodiment, including the following steps:
[0070] Step S21: Obtain the base frequency word information of each character in the speech sample in the first base frequency information to obtain the first base frequency word information, and sample the first base frequency word information to obtain the first sampling information;
[0071] In this step, the fundamental frequency information corresponding to each character in the speech sample is sampled to obtain the first sample information. The number of samples of the fundamental frequency information can be set according to the requirements. For example, the number of samples can be set to 10, 15 or 20. In this step, the number of samples is set to 10 points, that is, the fundamental frequency information corresponding to each character is sampled as 10 sample points.
[0072] Step S22: Perform tone clustering on the first sampled information according to the preset tone category to obtain the first clustering information;
[0073] In this step, all fundamental frequency information corresponding to the same character in the first sampling information is clustered into 9 categories. The algorithm used for tone clustering can be set according to requirements. In this step, tone clustering is performed based on the K-means clustering algorithm to obtain the first cluster information.
[0074] In this embodiment, by obtaining the fundamental frequency character information of each character in the first fundamental frequency information, the sampling of the fundamental frequency character information is effectively guaranteed. Based on the preset tone category, the first sampled information is clustered by tone, which effectively guarantees the calculation of the clustering distance between the first clustering information and the second fundamental frequency information.
[0075] Example 3
[0076] Please see Figure 3 This is a schematic diagram of the structure of a speech synthesis system 100 provided in the third embodiment of the present invention, including: a tone training module 10, a clustering module 11, a speech training module 12, and a speech synthesis module 13, wherein:
[0077] The tone training module 10 is used to acquire speech samples and train a tone prediction model based on the sample speech.
[0078] Optionally, the tone training module 10 is further configured to: align the sample text and speech information in the speech sample, and delete the tone of the aligned sample text;
[0079] The sample text after tone removal is input into the tone prediction model to predict the tone, and the tone prediction speech is obtained. The model loss is determined based on the tone prediction speech and the tone of the aligned sample text.
[0080] The parameters of the tone prediction model are updated based on the model loss until the tone prediction model converges.
[0081] Clustering module 11 is used to obtain the fundamental frequency information of the speech sample and the speech synthesis sample respectively, to obtain the first fundamental frequency information and the second fundamental frequency information, and to cluster the first fundamental frequency information to obtain the first clustering information.
[0082] Optionally, the clustering module 11 is further configured to: obtain the base frequency word information of each character in the speech sample in the first base frequency information to obtain the first base frequency word information, and sample the first base frequency word information to obtain the first sampling information;
[0083] The first sampled information is clustered according to the preset tone category to obtain the first cluster information.
[0084] The speech training module 12 is used to set the tone of the speech synthesis sample according to the first clustering information and the second fundamental frequency information, and to train an end-to-end speech synthesis model according to the speech synthesis sample after tone setting.
[0085] Optionally, the speech training module 12 is further configured to: acquire the fundamental frequency word information of each character in the speech synthesis sample in the second fundamental frequency information, obtain the second fundamental frequency word information, and sample the second fundamental frequency word information to obtain the second sampling information;
[0086] Calculate the distance between each second sampled information and the first cluster information center to obtain the cluster distance;
[0087] The tone of the speech synthesis sample is set according to the tone of the tone category corresponding to the minimum clustering distance.
[0088] The speech synthesis module 13 is used to input the pinyin information of the text to be synthesized into the trained tone prediction model to predict the tone, obtain the output pinyin, and input the output pinyin into the trained end-to-end speech synthesis model to synthesize the speech, thereby obtaining the synthesized speech.
[0089] Optionally, the speech synthesis module 13 is further configured to: query the pronunciation type of each speech character in the text to be synthesized, wherein the pronunciation type includes polyphonic type and monophonic type;
[0090] If the speech character is a monophonic type, the pronunciation of the speech character is determined according to a preset speech dictionary to obtain the character pronunciation;
[0091] If the speech character is of the polyphonic type, the corresponding word or text sentence is input into the polyphonic character prediction model for pronunciation prediction, and the pronunciation of the speech character is determined based on the pronunciation prediction result to obtain the character pronunciation;
[0092] The tone marks of each character's pronunciation are removed to obtain the pinyin information.
[0093] In this embodiment, by inputting the pinyin information of the text to be synthesized into the trained tone prediction model for tone prediction, the tone information of the text to be synthesized can be automatically labeled to obtain the output pinyin. By inputting the output pinyin into the trained end-to-end speech synthesis model for speech synthesis, the synthesized speech corresponding to the text to be synthesized can be automatically obtained. There is no need to manually establish the construction rules for pronunciation information, which improves the efficiency of speech synthesis system construction.
[0094] Example 4
[0095] Figure 4 This is a structural block diagram of a terminal device 2 provided in the fourth embodiment of this application. For example... Figure 4 As shown, the terminal device 2 in this embodiment includes: a processor 20, a memory 21, and a computer program 22 stored in the memory 21 and executable on the processor 20, such as a speech synthesis method program. When the processor 20 executes the computer program 22, it implements the steps in the various embodiments of the speech synthesis methods described above.
[0096] For example, the computer program 22 may be divided into one or more modules, which are stored in the memory 21 and executed by the processor 20 to complete this application. The one or more modules may be a series of computer program instruction segments capable of performing specific functions, which describe the execution process of the computer program 22 in the terminal device 2. The terminal device may include, but is not limited to, the processor 20 and the memory 21.
[0097] The processor 20 may be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or any conventional processor.
[0098] The memory 21 can be an internal storage unit of the terminal device 2, such as a hard drive or memory of the terminal device 2. The memory 21 can also be an external storage device of the terminal device 2, such as a plug-in hard drive, Smart Media Card (SMC), Secure Digital (SD) card, or Flash Card equipped on the terminal device 2. Furthermore, the memory 21 can include both internal and external storage units of the terminal device 2. The memory 21 is used to store the computer program and other programs and data required by the terminal device. The memory 21 can also be used to temporarily store data that has been output or will be output.
[0099] Furthermore, the functional modules in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0100] If an integrated module is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. This computer-readable storage medium can be non-volatile or volatile. Based on this understanding, all or part of the processes in the methods of the above embodiments can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable storage medium can include: any entity or device capable of carrying computer program code, recording media, USB flash drives, portable hard drives, magnetic disks, optical disks, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the contents of a computer-readable storage medium may be appropriately added to or subtracted from the contents as required by the legislation and patent practice in a jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, a computer-readable storage medium may not include electrical carrier signals and telecommunication signals.
[0101] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.
Claims
1. A speech synthesis method, characterized in that, The method includes: Acquire speech samples and train a tone prediction model based on the speech samples; The fundamental frequency information of the speech sample and the speech synthesis sample are obtained respectively to obtain the first fundamental frequency information and the second fundamental frequency information, and the first fundamental frequency information is clustered to obtain the first clustering information; The speech synthesis samples are tone-set according to the first clustering information and the second fundamental frequency information, and an end-to-end speech synthesis model is trained according to the speech synthesis samples after tone setting. The pinyin information of the text to be synthesized is input into the trained tone prediction model to predict the tone, and the output pinyin is then input into the trained end-to-end speech synthesis model to synthesize the synthesized speech. The step of setting the tone of the speech synthesis sample based on the first clustering information and the second fundamental frequency information includes: The fundamental frequency word information of each character in the speech synthesis sample in the second fundamental frequency information is obtained to obtain the second fundamental frequency word information, and the second fundamental frequency word information is sampled to obtain the second sampling information; Calculate the distance between each second sampled information and the first cluster information center to obtain the cluster distance; The tone of the speech synthesis sample is set according to the tone of the tone category corresponding to the minimum clustering distance; The clustering of the first fundamental frequency information includes: The fundamental frequency character information of each character in the speech sample in the first fundamental frequency information is obtained to obtain the first fundamental frequency character information, and the first fundamental frequency character information is sampled to obtain the first sampling information; The first sampled information is clustered according to the preset tone category to obtain the first cluster information.
2. The speech synthesis method as described in claim 1, characterized in that, The step of training the tone prediction model based on the speech samples includes: Align the sample text and speech information in the speech sample, and delete the tone of the aligned sample text; The sample text after tone removal is input into the tone prediction model to predict the tone, and the tone prediction speech is obtained. The model loss is determined based on the tone prediction speech and the tone of the aligned sample text. The parameters of the tone prediction model are updated based on the model loss until the tone prediction model converges.
3. The speech synthesis method as described in claim 1, characterized in that, Before inputting the pinyin information of the text to be synthesized into the trained tone prediction model for tone prediction, the method further includes: The pronunciation type of each speech character in the text to be synthesized is queried, and the pronunciation type includes polyphonic type and monophonic type; If the speech character is a monophonic type, the pronunciation of the speech character is determined according to a preset speech dictionary to obtain the character pronunciation; If the speech character is of the polyphonic type, the corresponding word or text sentence is input into the polyphonic character prediction model for pronunciation prediction, and the pronunciation of the speech character is determined based on the pronunciation prediction result to obtain the character pronunciation; Remove the tones from the pronunciation of each character to obtain the pinyin information.
4. A speech synthesis system, characterized in that, The system includes: A tone training module is used to acquire speech samples and train a tone prediction model based on the speech samples. The clustering module is used to obtain the fundamental frequency information of the speech samples and the speech synthesis samples respectively, to obtain the first fundamental frequency information and the second fundamental frequency information, and to cluster the first fundamental frequency information to obtain the first clustering information; The speech training module is used to set the tone of the speech synthesis sample according to the first clustering information and the second fundamental frequency information, and to train an end-to-end speech synthesis model according to the speech synthesis sample after tone setting. The speech synthesis module is used to input the pinyin information of the text to be synthesized into the trained tone prediction model to predict the tone, obtain the output pinyin, and input the output pinyin into the trained end-to-end speech synthesis model to synthesize the synthesized speech; The speech training module is also used for: The fundamental frequency word information of each character in the speech synthesis sample in the second fundamental frequency information is obtained to obtain the second fundamental frequency word information, and the second fundamental frequency word information is sampled to obtain the second sampling information; Calculate the distance between each second sampled information and the first cluster information center to obtain the cluster distance; The tone of the speech synthesis sample is set according to the tone of the tone category corresponding to the minimum clustering distance; The clustering module is further configured to: obtain the fundamental frequency character information of each character in the speech sample in the first fundamental frequency information, obtain the first fundamental frequency character information, and sample the first fundamental frequency character information to obtain the first sampling information; The first sampled information is clustered according to the preset tone category to obtain the first cluster information.
5. A terminal device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method as described in any one of claims 1 to 3.
6. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1 to 3.