Artificial intelligence-based speech synthesis method and device, computer equipment and medium
By introducing a difference loss function into the HiFi-GAN vocoder model, the discriminant performance of the discriminator is optimized, solving the problem of poor speech quality and improving the sound quality and accuracy of the speech synthesis method.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- PING AN TECH (SHENZHEN) CO LTD
- Filing Date
- 2023-05-26
- Publication Date
- 2026-06-19
AI Technical Summary
In existing speech synthesis methods, the HiFi-GAN vocoder model uses the same loss function in the discriminator, resulting in poor speech quality and failing to achieve the desired effect.
A difference loss subfunction is introduced into the discriminator loss function of the HiFi-GAN vocoder model. By calculating the distance between the real speech signal and the predicted speech signal, the discriminator's discrimination performance is optimized, thereby improving the speech signal detail quality of the generator.
By optimizing the discriminator loss function, the adversarial training process of the HiFi-GAN model was improved, the generator's ability to learn detailed features was enhanced, and the sound quality and accuracy of speech synthesis were improved.
Smart Images

Figure CN116580690B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of digital medical technology, and in particular to a speech synthesis method, device, computer equipment and medium based on artificial intelligence. Background Technology
[0002] With the development of digital healthcare technology, hospitals are procuring intelligent robots to guide and serve patients, reducing the workload of medical staff. These intelligent robots rely on voice technology, and speech generation is the aspect that best embodies intelligence, providing answers based on the different needs of patients. The demand for speech generation tasks in real-world scenarios is becoming increasingly widespread. Existing technologies offer many algorithms for speech synthesis, such as using GAN (Generative Adversarial Networks) models as vocoders to address the inefficiency of Auto Regression speech synthesis models.
[0003] To address the aforementioned issues, the generator and discriminator of the HiFi-GAN vocoder model can be used for speech processing, resulting in a vocoder that balances efficiency and quality. Furthermore, the HiFi-GAN vocoder model can generate 22.05kHz speech on a GPU at a speed 167.9 times faster than real-time, and on a CPU at a speed 13.4 times faster than an autoregressive model. However, the HiFi-GAN vocoder model retains the same loss function as general GAN network models for its discriminator, therefore, the speech quality obtained using the HiFi-GAN vocoder model does not yet reach the ideal level.
[0004] Therefore, improving the speech quality of speech synthesis methods has become an urgent problem to be solved. Summary of the Invention
[0005] In view of this, embodiments of the present invention provide a speech synthesis method, apparatus, computer device and medium based on artificial intelligence to solve the problem of poor speech quality in existing speech synthesis methods.
[0006] Firstly, an artificial intelligence-based speech synthesis method is provided, the speech synthesis method comprising:
[0007] Obtain the speech text to be synthesized; input the speech text to be synthesized into a preset acoustic model to obtain the spectral features of the speech text;
[0008] The spectral features of the spoken text are input into a HiFi-GAN vocoder model with an optimized discriminator loss function for speech synthesis to obtain the spoken audio of the spoken text; wherein, the HiFi-GAN vocoder model is trained using the optimized discriminator loss function, which includes:
[0009] The preset difference loss function includes: acquiring the sample speech text and the real speech signal of the sample speech text, and the predicted speech signal of the sample obtained by speech synthesis in the HiFi-GAN vocoder model;
[0010] Calculate the distance between the real speech signal and the preset target speech signal to obtain the real speech distance; calculate the distance between the real speech signal and the predicted speech signal to predict the speech distance;
[0011] The difference loss function is determined based on the deviation between the actual speech distance and the predicted speech distance.
[0012] Secondly, an artificial intelligence-based speech synthesis device is provided, the speech synthesis device comprising:
[0013] The spectral feature extraction module is used to acquire the speech text to be synthesized; the speech text to be synthesized is input into a preset acoustic model to obtain the spectral features of the speech text;
[0014] A speech synthesis module is used to input the spectral features of the speech text into a HiFi-GAN vocoder model with an optimized discriminator loss function for speech synthesis, thereby obtaining the speech audio of the speech text; wherein, the HiFi-GAN vocoder model is trained using a discriminator to obtain the optimized discriminator loss function, and the discriminator includes:
[0015] The difference loss sub-function calculation unit is used to acquire the real speech signal of the sample speech text and the predicted speech signal of the sample obtained by speech synthesis in the HiFi-GAN vocoder model; calculate the distance between the real speech signal and the preset target speech signal to obtain the real speech distance; calculate the distance between the real speech signal and the predicted speech signal to predict the speech distance; and determine the difference loss sub-function based on the deviation between the real speech distance and the predicted speech distance.
[0016] Thirdly, embodiments of the present invention provide a computer device, the computer device including a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the artificial intelligence-based speech synthesis method as described in the first aspect.
[0017] Fourthly, embodiments of the present invention provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the artificial intelligence-based speech synthesis method as described in the first aspect.
[0018] The advantages of this invention compared to the prior art are:
[0019] The speech synthesis method, apparatus, computer equipment, and medium of the present invention acquire speech text to be synthesized; input the speech text to be synthesized into a preset acoustic model to obtain the spectral features of the speech text; input the spectral features of the speech text into a HiFi-GAN vocoder model with an optimized discriminator loss function for speech synthesis to obtain the speech audio of the speech text. By optimizing the discriminator loss function in the HiFi-GAN vocoder model, a difference loss sub-function is introduced into the discriminator loss function, which can improve the discriminator's discrimination performance and the discriminator training process, that is, improve the discriminator's ability to distinguish between real and fake speech signals generated by the generator, thereby improving the entire adversarial training process of the HiFi-GAN model generator-discriminator, enabling the discriminator and generator to learn more detailed features simultaneously, thereby improving the detail quality of the speech signal synthesized by the final model generator. Attached Figure Description
[0020] To more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0021] Figure 1 This is a schematic diagram of an application environment for a speech synthesis method provided in Embodiment 1 of the present invention;
[0022] Figure 2 This is a flowchart illustrating an artificial intelligence-based speech synthesis method provided in Embodiment 1 of the present invention.
[0023] Figure 3 This is a schematic diagram of the structure of an artificial intelligence-based speech synthesis device provided in Embodiment 2 of the present invention;
[0024] Figure 4 This is a schematic diagram of the structure of a computer device provided in Embodiment 3 of the present invention. Detailed Implementation
[0025] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of the invention. However, those skilled in the art will understand that the invention can be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods are omitted so as not to obscure the description of the invention with unnecessary detail.
[0026] It should be understood that, when used in this specification and the appended claims, the term "comprising" indicates the presence of the described features, integrals, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or collections thereof.
[0027] It should also be understood that the term “and / or” as used in this specification and the appended claims refers to any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.
[0028] As used in this specification and the appended claims, the term "if" may be interpreted, depending on the context, as "when," "once," "in response to determination," or "in response to detection." Similarly, the phrase "if determined" or "if [described condition or event] is detected" may be interpreted, depending on the context, as meaning "once determined," "in response to determination," "once [described condition or event] is detected," or "in response to detection of [described condition or event]."
[0029] Furthermore, in the description of this invention and the appended claims, the terms "first," "second," "third," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0030] References to "one embodiment" or "some embodiments" as described in this specification mean that one or more embodiments of the invention include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless otherwise specifically emphasized.
[0031] The embodiments of this invention can acquire and process relevant data based on artificial intelligence technology. Artificial intelligence (AI) refers to the theories, methods, technologies, and application systems that utilize digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to obtain optimal results.
[0032] Foundational technologies for artificial intelligence generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies mainly encompass computer vision, robotics, biometrics, speech processing, natural language processing, and machine learning / deep learning.
[0033] It should be understood that the sequence number of each step in the following embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
[0034] To illustrate the technical solution of the present invention, specific embodiments are described below.
[0035] The first embodiment of this invention provides an artificial intelligence-based speech synthesis method, which can be applied to, for example... Figure 1 In this application environment, the client communicates with the server. Clients include, but are not limited to, handheld computers, desktop computers, laptops, ultra-mobile personal computers (UMPCs), netbooks, cloud computing devices, and personal digital assistants (PDAs). The server can be implemented using a standalone server or a server cluster consisting of multiple servers.
[0036] See Figure 2 This is a flowchart illustrating an artificial intelligence-based speech synthesis method provided in Embodiment 1 of the present invention. The speech synthesis method described above can be applied to... Figure 1 The client, whose corresponding computer device connects to the target database via a pre-defined Application Programming Interface (API), generates task logs as the target database is driven to perform corresponding tasks. These task logs can be collected via the API. For example... Figure 2 As shown, the health prediction method may include the following steps:
[0037] Step S201: Obtain the speech text to be synthesized.
[0038] In this step, the speech text to be synthesized specifically includes the text characters to be synthesized. All text characters are combined into a single speech text file for storage. For medical intelligent robots, which can collect sounds from the environment to provide corresponding responses, the speech text to be synthesized can be the response text based on the sounds in the environment.
[0039] Step S202: Input the speech text to be synthesized into a preset acoustic model to obtain the spectral features of the speech text.
[0040] In this step, a pre-defined acoustic model, a spectrogram prediction network, is used to extract acoustic features from the speech text to be synthesized, obtaining the Mel spectrum as the spectral features of the speech text. For example, this spectrogram prediction network is obtained by training a deep learning network after preprocessing the text sample data and the corresponding audio sample data.
[0041] For example, the text sample data is first preprocessed into a standard format, and then the standard format text and the corresponding audio sample data are aligned to obtain the training data. The training data is then input into the constructed deep learning network for deep training to obtain the spectrogram prediction network.
[0042] The standard text processing includes: trimming the original text sample data, removing special symbols, converting Arabic numerals into corresponding text words, and converting Chinese text data within the text sample data into Pinyin or phoneme format, thus obtaining standard format text. The audio sample data processing includes: deduplication, noise reduction, and trimming the audio sample data; the effect is to remove meaningless audio, thereby obtaining high-quality, low-noise, and clear audio data.
[0043] The specific process of deep training includes: taking standard format text as samples, inputting audio sample data corresponding one-to-one with the standard format text as labels into the spectral prediction network, extracting the acoustic features of the audio, encoding the text, training the spectral prediction network, obtaining the model's weight parameters, and outputting the corresponding acoustic features, i.e., the Mel spectrum.
[0044] Step S203: Input the spectral features of the speech text into a HiFi-GAN vocoder model with an optimized discriminator loss function for speech synthesis to obtain the speech audio of the speech text.
[0045] The HiFi-GAN vocoder model is trained using the optimized discriminator loss function, which includes:
[0046] The preset difference loss function includes: acquiring the sample speech text and the real speech signal of the sample speech text, and the predicted speech signal of the sample obtained by speech synthesis in the HiFi-GAN vocoder model;
[0047] Calculate the distance between the real speech signal and the preset target speech signal to obtain the real speech distance; calculate the distance between the real speech signal and the predicted speech signal to predict the speech distance;
[0048] The difference loss function is determined based on the deviation between the actual speech distance and the predicted speech distance.
[0049] Specifically, by introducing a difference loss sub-function into the optimized discriminator loss function in the HiFi-GAN vocoder model, the discriminator's ability to distinguish between genuine and fake speech signals generated by the generator can be improved, thereby achieving the effect of detail differentiation. Simultaneously, through adversarial training of the HiFi-GAN vocoder model, this difference loss sub-function also enhances the detail realism of the speech signals generated by the generator in the HiFi-GAN vocoder model.
[0050] Optionally, the predicted speech signal of the sample obtained by speech synthesis from the HiFi-GAN vocoder model includes:
[0051] The sample speech text is input into a preset acoustic model to obtain the spectral features of the sample speech text;
[0052] The spectral features of the sample speech text are input into the generator in the HiFi-GAN vocoder model to generate the predicted speech signal of the sample.
[0053] The preset acoustic model is the spectrogram prediction network trained by the deep learning network in step S202. The trained spectrogram prediction network is used as input to the speech text to be synthesized and the above-mentioned sample speech text to realize the spectral feature extraction of the speech text to be synthesized and the spectral feature extraction of the sample speech text.
[0054] Optionally, before calculating the distance between the real speech signal and the preset target speech signal, the method further includes:
[0055] A voice signal is randomly selected from a pre-set real voice storage database as the target voice signal;
[0056] Calculate the distance between the real speech signal and the preset target speech signal to obtain the real speech distance; calculate the distance between the real speech signal and the predicted speech signal, the predicted speech distance including:
[0057] Calculate the vector difference between the real speech signal and the target speech signal, obtain the L1 norm of the vector difference, and get the distance between the real speech signal and the preset target speech signal.
[0058] Calculate the difference between the real speech signal and the predicted speech signal vector, obtain the L1 norm of the difference between the predicted speech signal vectors, and then obtain the distance between the real speech signal and the predicted speech signal.
[0059] The target speech signal is a randomly sampled reference speech signal. Both this speech signal and the real speech signal of the sample speech text are real speech signals. Therefore, the real speech signal of the sample speech text is used as the sample anchor point. The distance between the sample anchor point and the reference speech signal, as well as the distance between the sample anchor point and the predicted speech signal, are calculated. The difference loss calculated based on the distance deviation between these two can be used to measure the discriminator's ability to distinguish between real and fake speech signals generated by the generator. Through adversarial training of the HiFi-GAN vocoder model using this difference loss function, the discriminator can narrow the distance between the sample anchor point and the real speech signal during training, while widening the distance between the sample anchor point and the speech signal generated by the generator, thereby improving the discriminator's ability to distinguish details.
[0060] Optionally, determining the difference loss sub-function based on the deviation between the actual speech distance and the predicted speech distance includes:
[0061] The deviation is obtained by calculating the difference between the actual speech distance and the predicted speech distance;
[0062] The deviation is summed with a preset distance threshold parameter. When the summed value is positive, the summed value is used as the difference loss of the difference loss sub-function.
[0063] The distance threshold parameter used for subtracting the deviation can be a preset, invariant, and suitable constant, which helps the discriminator to better distinguish between real and predicted speech signals. In other embodiments, the distance threshold parameter can also be a variable parameter, serving as one of the parameters in each iteration of the HiFi-GAN vocoder model. This parameter can be adjusted incrementally within a preset parameter range, or it can be randomly varied within a preset parameter range.
[0064] After summing the deviation with the preset distance threshold parameter, the summed value needs to be compared with zero. If the value is greater than zero, it indicates that the training has not yet achieved a good effect in terms of language difference. That is, the effect of bringing the sample anchor closer to the real speech signal and the effect of distancing the sample anchor from the speech signal generated by the generator are not ideal. The summed value can be used as the difference loss of the difference loss sub-function. If the difference is equal to zero or less than zero, it indicates that the effect of bringing the sample anchor closer to the real speech signal and the effect of distancing the sample anchor from the speech signal generated by the generator are relatively poor. Zero can be used as the difference loss of the difference loss sub-function.
[0065] For example, the difference loss function mentioned above can be the triplet loss function, the formula of which is as follows:
[0066] L triplet (D;G)=E (x,s) [max(||ax||1-||aG(s)||1+margin, 0)]
[0067] Among them, L triplet (D; G) represents the difference loss of the difference loss sub-function, a represents the sample anchor point, x represents the real speech signal extracted from the real speech database, G(s) represents the speech signal generated by the generator, margin is the preset distance threshold parameter, that is, a constant greater than 0, and max is the maximum value to be compared.
[0068] In the formula, the input of the difference loss sub-function is a triple, including the anchor example (a), the positive example (x), and the negative example (G(s)). The positive example is a sample that belongs to the same category as the anchor example, and the negative example is a sample that belongs to a different category than the anchor example. The similarity between samples is calculated by optimizing the distance between the anchor example and the positive example to be smaller than the distance between the anchor example and the negative example.
[0069] In the above formula, the purpose of setting the distance threshold parameter is to achieve better training of the HiFi-GAN vocoder model and avoid the distance between the sample anchor point and the real speech signal after training being too close to the distance between the sample anchor point and the speech signal generated by the generator. Therefore, setting a distance threshold parameter can make the distance between the sample anchor point and the real speech signal smaller after the HiFi-GAN vocoder model is trained, while making the distance between the sample anchor point and the speech signal generated by the generator larger.
[0070] Optionally, the optimized discriminator loss function further includes:
[0071] A preset generative adversarial loss function is used to weight the differential loss function using preset differential parameters to obtain a weighted differential loss function.
[0072] The sum of the generative adversarial loss function and the weighted differential loss function is calculated to obtain the optimized discriminator loss function.
[0073] The expression for the generative adversarial loss subfunction is as follows:
[0074] Ladv(D,G)=E(x,s)[(D(x)-1) 2 +(D(G(s))) 2
[0075] Where x represents the real speech signal, s represents the Mel spectrum, Ladv(D,G) is the generative adversarial loss function of the discriminator, and Ladv(G,D) is the generative adversarial loss function of the generator.
[0076] Optionally, the expression for the optimized discriminator loss function is as follows:
[0077]
[0078] Among them, L d For the value of the loss function of the optimized discriminator, D k L represents the k-th sub-discriminator of the multi-period discriminator and multi-scale discriminator in the HiFi-GAN vocoder model, G represents the generator in the HiFi-GAN vocoder model, and L represents the generator. adv L represents the generative adversarial loss sub-function. triplet Let λ represent the difference loss function. triplet This refers to the preset difference parameter, which is a parameter obtained during model training.
[0079] The aforementioned multi-period discriminator and multi-scale discriminator identify speech from two different perspectives. The multi-scale discriminator is derived from the MelGAN vocoder approach, which continuously averages the speech sequence, halving the length of the speech sequence one by one, and then applies several layers of convolution at different scales of the speech, finally flattening it as the output of the multi-scale discriminator. The multi-period discriminator, on the other hand, folds the one-dimensional audio sequence into a two-dimensional plane with different sequence lengths, and applies two-dimensional convolution on the two-dimensional plane.
[0080] Specifically, the multi-scale discriminator first performs average pooling to shorten the sequence length, pooling the sequence length to half its original value each time, and then performs convolution. Specifically, the multi-scale discriminator first performs "original-size discrimination" on the original sample points, using spectral normalization as the parameter normalization method for one-dimensional convolution; then it performs average pooling on the sample point sequence, successively halving the sequence length, and then discriminates the "downsampled" sample point sequence, using weight normalization as the parameter normalization method for one-dimensional convolution. In each sub-discriminator at a specific scale, several convolutional layers are first performed, all using grouped convolution, and the parameters are normalized using the corresponding method; then leaky_relu activation is used; after multiple convolutional layers, a final post-processing is performed using a convolutional layer with one output channel, flattening the result as the output.
[0081] The multi-period discriminator primarily folds a one-dimensional sample point sequence into a two-dimensional plane with a certain period. For example, a one-dimensional sample point sequence [1,2,3,4,5,6] folded into a two-dimensional plane with a period of 3 would be [[1,2,3],[4,5,6]]. Then, a two-dimensional convolution is applied to this two-dimensional plane. Specifically, each sub-discriminator of a specific period is first padded to ensure that the number of sample points is an integer multiple of the period, facilitating the "folding" into a two-dimensional plane. Next, it enters multiple convolutional layers with output channels of [32,128,512,1024]. After convolution, leaky_relu activation is used, and the convolutional layer parameters are normalized using weight normalization. After multiple convolutional layers, a convolutional layer with 1024 input channels and 1 output channel is used for post-processing. Finally, the output is flattened as the final output of the multi-period discriminator. The multi-period discriminator contains multiple sub-discriminators with different periods; in the paper's code, the period numbers are set to [2,3,5,7,11].
[0082] Optionally, the HiFi-GAN vocoder model is obtained through adversarial training using the optimized discriminator loss function and the generator loss function, wherein the generator loss function includes:
[0083] The generator's audio generation adversarial loss, speech text spectral feature loss, and feature matching loss are weighted using preset spectral feature loss weight parameters to obtain a weighted spectral feature loss; the feature matching loss is weighted using preset feature matching loss weight parameters to obtain a weighted feature matching loss.
[0084] The generator loss function is obtained by summing the audio generation adversarial loss, the weighted spectral feature loss, and the weighted feature matching loss.
[0085] Introducing Mel-spectrum loss into the generator loss function can improve the stability of the model in the early stages of training, the training efficiency of the generator, and the fidelity of the synthesized speech. Introducing feature matching sub-loss into the generator loss function is used to measure the difference between features extracted from real and generated samples.
[0086] Specifically, the expression for the audio generation adversary loss of the generator is as follows:
[0087] Ladv(G,D)=E(x,s)[(D(G(s))-1) 2 ]
[0088] Where x represents the real speech signal, s represents the Mel spectrum, and Ladv(G,D) is the audio generation adversary loss value of the generator.
[0089] The spectral feature loss is calculated by determining the L1 distance between the spectrum extracted from synthesized speech and the spectrum extracted from real speech. Specifically, the expression for the spectral feature loss of speech text is as follows:
[0090] L Mel (G)=E (x,s) [||φ(x)-φ(G(s))||1]
[0091] Where Ф represents the mapping function that converts speech into a Mel spectrum, and L... Mel (G) represents the spectral eigenvalue loss.
[0092] The feature matching sub-loss is used to calculate the L1 distance between the outputs of each convolutional layer of real and synthetic samples. Specifically, the expression for the feature matching sub-loss is as follows:
[0093]
[0094] Where T represents the number of layers in the discriminator that extract features, and D... i N represents the extracted features. i L represents the number of features extracted by the discriminator network in the i-th layer. FM (G;D) represents the feature matching sub-loss value.
[0095] Therefore, the generator loss function can be expressed as:
[0096] L G (G)=L Adv (G;D)+λ fm L FM (G;D)+λ mel L Mel (G)
[0097] Among them, L GLet L be the generator loss function, and Ladv(G,D) be the audio generation adversarial loss value of the generator. FM (G;D) represents the feature matching sub-loss value, λ fm L is the weight parameter for the feature matching sub-loss. Mel (G) represents the spectral eigenvalue, λ mel is the weight parameter for the spectral feature loss.
[0098] This invention optimizes the discriminator loss function in the HiFi-GAN vocoder model by introducing a difference loss sub-function. This improves the discriminator's discrimination performance and training process, enhancing its ability to distinguish between genuine and fake speech signals generated by the generator. This, in turn, improves the overall adversarial training process between the HiFi-GAN model generator and discriminator, enabling both the discriminator and generator to learn more detailed features simultaneously. Ultimately, this improves the detail quality of the speech signal synthesized by the final model generator, thus increasing the accuracy of speech synthesis.
[0099] Corresponding to the method in the above embodiments, Figure 3 A structural block diagram of an artificial intelligence-based speech synthesis device according to Embodiment 2 of the present invention is shown. This speech synthesis device is applied to a computer device, which connects to a target database through a preset application programming interface (API). When the target database is driven to run and perform corresponding tasks, corresponding task logs are generated, which can be collected through the API. For ease of explanation, only the parts relevant to the embodiments of the present invention are shown.
[0100] See Figure 3 The speech synthesis device includes:
[0101] The language feature extraction module 31 is used to acquire the speech text to be synthesized and extract features from the speech text to represent linguistics as language information.
[0102] The speech synthesis module 32 is used to input the spectral features of the speech text into a HiFi-GAN vocoder model with an optimized discriminator loss function for speech synthesis, thereby obtaining the speech audio of the speech text; wherein, the HiFi-GAN vocoder model is trained by using a discriminator to obtain the optimized discriminator loss function, and the loss calculation unit of the discriminator includes:
[0103] The difference loss sub-function calculation unit 33 is used to calculate the difference loss sub-function. This difference loss sub-function calculation unit includes:
[0104] The sample predicted speech signal extraction subunit 331 is used to acquire the sample speech text and the real speech signal of the sample speech text, as well as the predicted speech signal of the sample obtained by speech synthesis in the HiFi-GAN vocoder model.
[0105] The distance calculation subunit 332 is used to calculate the distance between the real speech signal and the preset target speech signal to obtain the real speech distance; and to calculate the distance between the real speech signal and the predicted speech signal to predict the speech distance.
[0106] The loss calculation subunit 333 is used to determine the difference loss subfunction based on the deviation between the real speech distance and the predicted speech distance.
[0107] Optionally, the above-mentioned sample prediction speech signal extraction subunit includes:
[0108] The data acquisition submodule is used to acquire sample speech text.
[0109] The spectral feature extraction submodule is used to input the sample speech text into a preset acoustic model to obtain the spectral features of the sample speech text;
[0110] The sample predicted speech signal synthesis submodule is used to input the spectral features of the sample speech text into the generator in the HiFi-GAN vocoder model to generate the predicted speech signal of the sample.
[0111] Optionally, the difference loss sub-function calculation unit also includes:
[0112] The target speech signal determination subunit is used to randomly extract a speech signal from a preset real speech storage database as the target speech signal before calculating the distance between the real speech signal and the preset target speech signal;
[0113] The distance calculation subunit includes:
[0114] The first distance calculation submodule is used to calculate the difference between the real speech signal vector and the target speech signal, obtain the L1 norm of the difference between the real speech signal vector, and obtain the distance between the real speech signal and the preset target speech signal.
[0115] The second distance calculation submodule is used to calculate the difference between the real speech signal and the predicted speech signal vector, obtain the L1 norm of the difference between the predicted speech signal vector, and obtain the distance between the real speech signal and the predicted speech signal.
[0116] Optionally, the above-mentioned loss calculation subunit includes:
[0117] The deviation calculation submodule is used to calculate the difference between the actual speech distance and the predicted speech distance to obtain the deviation;
[0118] The difference loss calculation submodule is used to subtract the deviation from a preset distance threshold parameter. When the difference is positive, the difference is used as the difference loss of the difference loss subfunction.
[0119] Optionally, the discriminator further includes: a generative adversarial loss function calculation unit, wherein the discriminator is used to weight the differential loss function using a preset differential parameter to obtain a weighted differential loss function; and to calculate the sum of the generative adversarial loss function and the weighted differential loss function to obtain the optimized discriminator loss function.
[0120] Optionally, the expression for the optimized discriminator loss function is as follows:
[0121]
[0122] Among them, L d For the value of the loss function of the optimized discriminator, D k L represents the k-th sub-discriminator of the multi-period discriminator and multi-scale discriminator in the HiFi-GAN vocoder model, G represents the generator in the HiFi-GAN vocoder model, and L represents the generator. adv L represents the generative adversarial loss sub-function. triplet Let λ represent the difference loss function. triplet This refers to the preset difference parameter.
[0123] Optionally, the above HiFi-GAN vocoder model further includes a generator, which is obtained by adversarial training using the optimized discriminator loss function and the generator loss function. The loss calculation unit of the generator includes:
[0124] The generator includes an audio generation adversarial loss calculation unit, a speech text spectral feature loss calculation unit, and a feature matching loss calculation unit. The generator's loss calculation unit is used to weight the spectral feature loss according to preset spectral feature loss weight parameters to obtain a weighted spectral feature loss; to weight the feature matching loss using preset feature matching loss weight parameters to obtain a weighted feature matching loss; and to calculate the sum of the audio generation adversarial loss, the weighted spectral feature loss, and the weighted feature matching loss to obtain the generator's loss function.
[0125] It should be noted that the information interaction and execution process between the above modules are based on the same concept as the method embodiments of the present invention. For details on their specific functions and technical effects, please refer to the method embodiments section, which will not be repeated here.
[0126] Figure 4 This is a schematic diagram of the structure of a computer device provided in Embodiment 3 of the present invention. Figure 4 As shown, the computer device of this embodiment includes: at least one processor ( Figure 4 Only one is shown in the diagram), a memory, and a computer program stored in the memory and capable of running on at least one processor, which, when executed by the processor, implements the steps in the above-described speech synthesis method embodiments.
[0127] This computer device may include, but is not limited to, a processor and memory. Those skilled in the art will understand that... Figure 4 The examples of computer devices are merely examples and do not constitute a limitation on computer devices. Computer devices may include more or fewer components than shown in the illustration, or combinations of certain components, or different components, such as network interfaces, displays, and input devices.
[0128] The processor referred to can be a CPU, but it can also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.
[0129] Memory includes readable storage media, internal memory, etc., wherein internal memory can be the RAM of a computer device, providing an environment for the operation of the operating system and computer-readable instructions stored in the readable storage media. The readable storage media can be the hard drive of a computer device, or in other embodiments, it can be an external storage device of the computer device, such as a plug-in hard drive, Smart Media Card (SMC), Secure Digital (SD) card, or Flash Card. Furthermore, memory can include both internal storage units and external storage devices of a computer device. Memory is used to store the operating system, applications, bootloader, data, and other programs, such as program code for computer programs. Memory can also be used to temporarily store data that has been output or will be output.
[0130] Those skilled in the art will understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is used as an example. In practical applications, the functions described above can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this invention. The specific working process of the units and modules in the above device can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here. If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the present invention can implement all or part of the processes in the methods of the above embodiments by instructing related hardware through a computer program. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the above method embodiments. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. A computer-readable medium can include at least: any entity or device capable of carrying computer program code, a recording medium, a computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media. Examples include USB flash drives, portable hard drives, magnetic disks, or optical disks. In some jurisdictions, according to legislation and patent practice, computer-readable media cannot be electrical carrier signals or telecommunication signals.
[0131] The present invention can implement all or part of the processes in the methods of the above embodiments, or it can be accomplished by a computer program product. When the computer program product is run on a computer device, the computer device executes the steps in the above method embodiments.
[0132] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0133] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.
[0134] In the embodiments provided by this invention, it should be understood that the disclosed apparatus / computer devices and methods can be implemented in other ways. For example, the apparatus / computer device embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0135] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0136] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be included within the protection scope of the present invention.
Claims
1. An artificial intelligence-based speech synthesis method, characterized by, The speech synthesis method includes: Obtain the speech text to be synthesized; input the speech text to be synthesized into a preset acoustic model to obtain the spectral features of the speech text; The spectral features of the spoken text are input into a HiFi-GAN vocoder model with an optimized discriminator loss function for speech synthesis to obtain the spoken audio of the spoken text; wherein, the HiFi-GAN vocoder model is trained using the optimized discriminator loss function, which includes: The preset difference loss function includes: acquiring the sample speech text and the real speech signal of the sample speech text, and the predicted speech signal of the sample obtained by speech synthesis in the HiFi-GAN vocoder model; Calculate the distance between the real speech signal and the preset target speech signal to obtain the real speech distance; calculate the distance between the real speech signal and the predicted speech signal to predict the speech distance; The difference loss sub-function is determined based on the deviation between the actual speech distance and the predicted speech distance; Before calculating the distance between the real speech signal and the preset target speech signal, the method further includes: A voice signal is randomly selected from a pre-set real voice storage database as the target voice signal; The calculation of the distance between the real speech signal and the preset target speech signal yields the real speech distance; the calculation of the distance between the real speech signal and the predicted speech signal, wherein the predicted speech distance includes: Using the real speech signal as a sample anchor point, calculate the real speech signal vector difference between the sample anchor point and the target speech signal, obtain the L1 norm of the real speech signal vector difference, and get the distance between the real speech signal and the preset target speech signal. Calculate the difference between the predicted speech signal vector and the sample anchor point, obtain the L1 norm of the predicted speech signal vector difference, and then obtain the distance between the real speech signal and the predicted speech signal. 2.The AI-based speech synthesis method of claim 1, wherein, The predicted speech signal obtained by speech synthesis from the HiFi-GAN vocoder model includes: The sample speech text is input into a preset acoustic model to obtain the spectral features of the sample speech text; The spectral features of the sample speech text are input into the generator in the HiFi-GAN vocoder model to generate the predicted speech signal of the sample. 3.The AI-based voice synthesis method of claim 1, wherein, Determining the difference loss sub-function based on the deviation between the actual speech distance and the predicted speech distance includes: The deviation is obtained by calculating the difference between the actual speech distance and the predicted speech distance; The deviation is summed with a preset distance threshold parameter. When the summed value is positive, the summed value is used as the difference loss of the difference loss sub-function. 4.The AI-based voice synthesis method of claim 1, wherein, The optimized discriminator loss function also includes: A preset generative adversarial loss function is used to weight the differential loss function using preset differential parameters to obtain a weighted differential loss function. The sum of the generative adversarial loss function and the weighted differential loss function is calculated to obtain the optimized discriminator loss function. 5.The AI-based speech synthesis method of claim 4, wherein, The expression for the loss function of the optimized discriminator is as follows: Among them, L d For the value of the loss function of the optimized discriminator, D k This represents the k-th sub-discriminator of the multi-period discriminator and multi-scale discriminator in the HiFi-GAN vocoder model, and G represents the generator in the HiFi-GAN vocoder model. This refers to the generative adversarial loss sub-function. This represents the difference loss sub-function. This refers to the preset difference parameter.
6. The artificial intelligence-based speech synthesis method of any one of claims 1 to 5, characterized by, The HiFi-GAN vocoder model is obtained through adversarial training using the optimized discriminator loss function and the generator loss function. The generator loss function includes: The generator's audio generation adversarial loss, speech text spectral feature loss, and feature matching loss are weighted using preset spectral feature loss weight parameters to obtain a weighted spectral feature loss; the feature matching loss is weighted using preset feature matching loss weight parameters to obtain a weighted feature matching loss. The generator loss function is obtained by summing the audio generation adversarial loss, the weighted spectral feature loss, and the weighted feature matching loss.
7. A speech synthesis device based on artificial intelligence, characterized in that, The speech synthesis device includes: The spectral feature extraction module is used to acquire the speech text to be synthesized; the speech text to be synthesized is input into a preset acoustic model to obtain the spectral features of the speech text; A speech synthesis module is used to input the spectral features of the speech text into a HiFi-GAN vocoder model with an optimized discriminator loss function for speech synthesis, thereby obtaining the speech audio of the speech text; wherein, the HiFi-GAN vocoder model is trained using a discriminator to obtain the optimized discriminator loss function, and the discriminator includes: The difference loss sub-function calculation unit is used to acquire sample speech text and the real speech signal of the sample speech text, as well as the predicted speech signal of the sample obtained by speech synthesis in the HiFi-GAN vocoder model; calculate the distance between the real speech signal and the preset target speech signal to obtain the real speech distance; calculate the distance between the real speech signal and the predicted speech signal to predict the speech distance; and determine the difference loss sub-function based on the deviation between the real speech distance and the predicted speech distance. The differential loss sub-function calculation unit further includes: The target speech signal determination subunit is used to randomly extract a speech signal from a preset real speech storage database as the target speech signal before calculating the distance between the real speech signal and the preset target speech signal; The distance calculation subunit includes: The first distance calculation submodule is used to calculate the difference between the real speech signal vector between the sample anchor point and the target speech signal, using the real speech signal as the sample anchor point, and to obtain the L1 norm of the difference between the real speech signal vector to get the distance between the real speech signal and the preset target speech signal. The second distance calculation submodule is used to calculate the difference between the predicted speech signal vector and the sample anchor point, obtain the L1 norm of the difference between the predicted speech signal vector, and obtain the distance between the real speech signal and the predicted speech signal.
8. A computer device, characterized in that, The computer device includes a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the speech synthesis method as described in any one of claims 1 to 6.
9. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the speech synthesis method as described in any one of claims 1 to 6.