Audio generation model training method and apparatus, and electronic device

By encoding sample audio using a neural network model to generate features at multiple time scales, and training the audio generation model, the problem of insufficient information reflection at different time scales in the audio generation model is solved, achieving higher accuracy and quality in audio generation.

WO2026130101A1PCT designated stage Publication Date: 2026-06-25TENCENT TECHNOLOGY (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
TENCENT TECHNOLOGY (SHENZHEN) CO LTD
Filing Date
2025-12-02
Publication Date
2026-06-25

AI Technical Summary

Technical Problem

Existing audio generation models struggle to accurately reflect information about audio at different time scales during training, resulting in insufficient similarity between the reconstructed audio and the original audio.

Method used

By encoding sample audio through a neural network model, at least two first sample features are obtained, each corresponding to a different time scale. These features are used to generate reconstructed audio, and the neural network model is trained using the sample audio and the reconstructed audio to improve the accuracy of the audio generation model.

Benefits of technology

It enhances the information representation capability of sample audio at different time scales, improves the accuracy of audio generation models and the ability to generate high-quality audio.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025139207_25062026_PF_FP_ABST
    Figure CN2025139207_25062026_PF_FP_ABST
Patent Text Reader

Abstract

An audio generation model training method and apparatus, and an electronic device. The method comprises: acquiring sample audio and a neural network model to be trained (301); encoding the sample audio by using the neural network model to obtain at least two first sample features, wherein the at least two first sample features correspond to different time scales, an i-th first sample feature among the at least two first sample features is used for representing feature information of a plurality of audio segments obtained by segmenting the sample audio according to an i-th time scale, and i is a positive integer (302); generating reconstructed audio by means of the neural network model on the basis of the at least two first sample features (303); and training the neural network model by means of the sample audio and the reconstructed audio to obtain an audio generation model, wherein the audio generation model is used for generating second audio on the basis of first audio (304).
Need to check novelty before this filing date? Find Prior Art

Description

Training methods, devices, and electronic equipment for audio generation models

[0001] This application claims priority to Chinese patent application No. 202411870691.0, filed on December 18, 2024, entitled “Training Method, Apparatus and Electronic Device for Audio Generation Model”. Technical Field

[0002] This application relates to the field of artificial intelligence technology, and in particular to a training method, apparatus, and electronic device for an audio generation model. Background Technology

[0003] In recent years, artificial intelligence technology has developed rapidly at a remarkable pace and has been widely applied in various industries. Audio is an indispensable modality in these applications and is widely adopted in many scenarios. Taking audio transmission as an example, audio features can be obtained by encoding audio at the source using an audio generation model. Then, the source transmits these audio features to the destination. Finally, at the destination, the audio is reconstructed based on these features using the audio generation model. In this process, it is desirable for the reconstructed audio to be as similar as possible to the original audio. Therefore, the training method of the audio generation model is particularly important. Summary of the Invention

[0004] This application provides a training method, apparatus, and electronic device for an audio generation model, which can be used to train an audio generation model with higher accuracy, so as to generate high-quality audio based on the audio generation model. The technical solution includes the following contents.

[0005] On one hand, a training method for an audio generation model is provided, the method comprising: acquiring sample audio and a neural network model to be trained; encoding the sample audio through the neural network model to obtain at least two first sample features; the at least two first sample features correspond to different time scales, wherein the i-th first sample feature among the at least two first sample features is used to characterize the feature information of multiple audio segments obtained after segmenting the sample audio according to the i-th time scale, where i is a positive integer; generating reconstructed audio based on the at least two first sample features through the neural network model; and training the neural network model through the sample audio and the reconstructed audio to obtain an audio generation model, wherein the audio generation model is used to generate a second audio based on the first audio.

[0006] On the other hand, a training device for an audio generation model is provided, the device comprising: an acquisition module for acquiring sample audio and a neural network model to be trained; an encoding module for encoding the sample audio through the neural network model to obtain at least two first sample features; the at least two first sample features correspond to different time scales, and the i-th first sample feature among the at least two first sample features is used to characterize the feature information of multiple audio segments obtained after segmenting the sample audio according to the i-th time scale, where i is a positive integer; a generation module for generating reconstructed audio based on the at least two first sample features through the neural network model; and a training module for training the neural network model through the sample audio and the reconstructed audio to obtain an audio generation model, the audio generation model being used to generate a second audio based on the first audio.

[0007] On the other hand, an electronic device is provided, comprising a processor and a memory, wherein the memory stores at least one computer program, which is loaded and executed by the processor to enable the electronic device to implement the training method of any of the above-described audio generation models.

[0008] On the other hand, a computer-readable storage medium is also provided, wherein at least one computer program is stored in the computer-readable storage medium, the at least one computer program being loaded and executed by a processor to enable an electronic device to implement the training method of any of the above-described audio generation models.

[0009] On the other hand, a computer program is also provided, wherein the computer program is at least one, and the at least one computer program is loaded and executed by a processor to enable an electronic device to implement the training method of any of the above-mentioned audio generation models.

[0010] On the other hand, a computer program product is also provided, which stores at least one computer program, which is loaded and executed by a processor to enable an electronic device to implement the training method of any of the above-mentioned audio generation models.

[0011] The technical solution provided in this application brings at least the following beneficial effects.

[0012] The technical solution provided in this application encodes sample audio using a neural network model to obtain at least two first sample features. Since the i-th first sample feature represents information from multiple audio segments obtained after segmenting the sample audio according to the i-th time scale, different first sample features can reflect information from the sample audio at different time scales, making each first sample feature more representative and accurately reflecting the sample audio. Based on this, highly accurate reconstructed audio can be generated from each first sample feature, improving the training effect and accuracy of the audio generation model when training the neural network model based on the reconstructed audio and sample audio. This allows for the generation of high-quality audio based on the audio generation model. Attached Figure Description

[0013] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0014] Figure 1 is a schematic diagram of a computer system provided in an embodiment of this application.

[0015] Figure 2 is a schematic diagram of an audio generation process provided in an embodiment of this application.

[0016] Figure 3 is a flowchart of a training method for an audio generation model provided in an embodiment of this application.

[0017] Figure 4 is a schematic diagram of the structure of an adapter provided in an embodiment of this application.

[0018] Figure 5 is a schematic diagram of the structure of a feature extraction module provided in an embodiment of this application.

[0019] Figure 6 is a flowchart of a speech synthesis method provided in an embodiment of this application.

[0020] Figure 7 is a schematic diagram of the training of an audio generation model provided in an embodiment of this application.

[0021] Figure 8 is a schematic diagram of a speech synthesis interface provided in an embodiment of this application.

[0022] Figure 9 is a schematic diagram of the structure of a training device for an audio generation model provided in an embodiment of this application.

[0023] Figure 10 is a schematic diagram of the structure of a terminal device provided in an embodiment of this application.

[0024] Figure 11 is a schematic diagram of the structure of a server provided in an embodiment of this application. Detailed Implementation

[0025] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be described in further detail below with reference to the accompanying drawings.

[0026] It should be noted that the terms "first," "second," etc., used in this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such terms can be used interchangeably where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.

[0027] As shown in Figure 1, which is a schematic diagram of a computer system provided in an embodiment of this application, the computer system includes a terminal device 101 and a server 102. The terminal device 101 has a client (e.g., an audio / video application client) installed and running, and the server 102 provides background services for the client. An individual (user) 103 can use the terminal device 101 to achieve functions such as instant messaging, media viewing, and speech synthesis through interaction between the client and the server 102.

[0028] In one possible implementation, server 102 undertakes the primary computational work, while terminal device 101 undertakes the secondary computational work. Alternatively, server 102 undertakes the secondary computational work, while terminal device 101 undertakes the primary computational work. Or, terminal device 101 and server 102 collaborate on computation using a distributed computing architecture.

[0029] Terminal device 101 can be any electronic device that allows human-computer interaction with a user through one or more methods such as a keyboard, touchpad, remote control, voice interaction, or handwriting device. For example, terminal device 101 can be a smartphone, tablet computer, laptop computer, desktop computer, smart speaker, smartwatch, PC (Personal Computer), mobile phone, PDA (Personal Digital Assistant), wearable device, PPC (Pocket PC), smart car system, smart TV, etc.

[0030] Terminal device 101 can refer to one of multiple terminal devices, and this embodiment only uses terminal device 101 as an example. Those skilled in the art will understand that the number of terminal devices 101 can be more or less. For example, there may be only one terminal device 101, or there may be dozens or hundreds, or even more. This embodiment does not limit the number or type of terminal devices 101. Multiple terminal devices 101 can also interact through server 102.

[0031] Server 102 can be a single server, a server cluster consisting of multiple servers, or any of the following: a cloud computing platform or a virtualization center. This embodiment of the application does not limit this. Server 102 communicates directly or indirectly with terminal device 101 via a wired or wireless network. Server 102 has data receiving, data processing, and data sending functions. Of course, server 102 may also have other functions, which are not limited in this embodiment of the application.

[0032] Those skilled in the art should understand that the terminal device 101 and server 102 described above are merely illustrative examples. Other existing or future terminal devices or servers that are applicable to this application should also be included within the scope of protection of this application, and are hereby incorporated by reference.

[0033] In one possible implementation, an audio generation model is trained by at least one of the terminal device 101 or the server 102 to generate new audio. The audio generation model is obtained by training a neural network model. Therefore, the audio generation model and the neural network model are similar in structure and function, but they may differ in their model parameters.

[0034] As shown in Figure 2, the neural network model includes a first encoder 206, a second encoder 207, a quantizer 208, and a decoder 209. The first encoder 206 receives audio and encodes it into features representing its information; it is a network capable of encoding audio. The second encoder 207 is concatenated after the first encoder 206, receiving the audio features output by the first encoder 206 and further encoding the audio features based on at least two time scales; it is a network capable of encoding audio features at different time scales. The quantizer 208 is concatenated after the second encoder 207, receiving the features corresponding to each time scale output by the second encoder 207, quantizing the features corresponding to each time scale, and fusing them; it is a network capable of performing quantization and fusion of features. The decoder 209 is concatenated after the quantizer 208, receiving the fused features output by the quantizer 208, and decoding the fused features to obtain audio; it is a network capable of decoding features into audio. As shown in Figure 2, this process may include the following steps.

[0035] Step 1: Encode the sample audio to obtain reference audio features. Specifically, the sample audio 201 can be input into the first encoder 206, which encodes the sample audio to obtain reference audio features 202. These reference audio features 202 characterize the sample audio. Z can be used... e Characterize reference audio feature 202, and Z e ∈R N×d Where R represents a real number, N represents the number of audio frames included in the sample audio, and d represents the dimension of the features of each audio frame.

[0036] Step 2: Encode the reference audio features to obtain multi-timescale features. Specifically, the reference audio features 202 can be input into the second encoder 207, and the second encoder 207 encodes the reference audio features to obtain multi-timescale features 203. The multi-timescale features 203 include first sample features at multiple timescales. For ease of description, different labels are used to represent the first sample features at different timescales. For example, labels 203-1, 203-2, and 203-3 are used to represent the first sample features at three timescales, respectively.

[0037] For example, Z1 is used to represent the first sample feature 203-1, and Z1∈R N / 2×d Let Z2 represent the feature 203-2 of the first sample, and Z2∈R. N / 4×d Z3 is used to represent the first sample feature 203-3, and Z3∈R N / 8×dTherefore, it can be seen that the first sample feature 203-1 includes features from N / 2 audio frames, the first sample feature 203-2 includes features from N / 4 audio frames, and the first sample feature 203-3 includes features from N / 8 audio frames. Since the number of audio frames gradually decreases, the time scale gradually increases.

[0038] Furthermore, a mutual information can be calculated based on the first sample feature 203-1 and the first sample feature 203-2, another mutual information can be calculated based on the first sample feature 203-1 and the first sample feature 203-3, and yet another mutual information can be calculated based on the first sample feature 203-3 and the first sample feature 203-2. These three mutual information values ​​are used to train the first encoder 206, the second encoder 207, the quantizer 208, and the decoder 209. Mutual information (MI) is used to quantify the degree of interdependence between two variables.

[0039] Step 3: Quantize and fuse the features at each time scale to obtain the reconstructed audio features. That is, quantizer 208 can quantize the first sample feature 203-1 to obtain one first quantized feature; quantizer 208 can quantize the first sample feature 203-2 to obtain another first quantized feature; quantizer 208 can quantize the first sample feature 203-3 to obtain yet another first quantized feature. By fusing these three first quantized features, the reconstructed audio features 204 are obtained. For example, using Z... q The reconstructed audio features are represented as 204, and Z... q ∈R N / 2×d .

[0040] Step 4: Decode the reconstructed audio features to obtain the reconstructed audio. Specifically, the reconstructed audio features 204 are input into the decoder 209, and the decoder 209 decodes the reconstructed audio features 204 to obtain the reconstructed audio 205. In addition, L2 loss can be calculated based on the reconstructed audio 205 and the sample audio 201 to train the first encoder 206, the second encoder 207, the quantizer 208, and the decoder 209 using the L2 loss.

[0041] Steps 1 to 4 above can be executed multiple times to train the first encoder 206, the second encoder 207, the quantizer 208, and the decoder 209 multiple times, ultimately obtaining the audio generation model. The audio generation model can generate the second audio from the first audio based on the implementation principles of steps 1 to 4 above; the generation process will not be elaborated here.

[0042] This application provides a training method for an audio generation model, which can be applied to the aforementioned computer system and executed by at least one of terminal device 101 or server 102. For ease of description, terminal device 101 or server 102 are collectively referred to as electronic devices; that is, the method of this application embodiment is executed by an electronic device. As shown in FIG3, the method includes the following steps.

[0043] Step 301: Obtain sample audio and the neural network model to be trained.

[0044] The sample audio can be any audio file; for example, it could be a person's voice. Electronic devices can acquire the sample audio using audio acquisition devices, or they can obtain it via the internet. The method of acquisition is not limited here.

[0045] Furthermore, the electronic device can also acquire a neural network model to be trained. For example, the neural network model is a pre-trained model that has undergone preliminary training and has audio generation capabilities. This application does not limit the structure, size, or network layers included in the neural network model.

[0046] Step 302: Encode the sample audio through a neural network model to obtain at least two first sample features; the at least two first sample features correspond to different time scales, and the i-th first sample feature among the at least two first sample features is used to characterize the feature information of multiple audio segments obtained after segmenting the sample audio according to the i-th time scale, where i is a positive integer.

[0047] The timescale is a metric used to segment audio (e.g., sample audio). A larger timescale results in longer audio segments. For audio of fixed duration, longer segments result in fewer segments. For example, if there are three timescales, and the first timescale is twice the length of the second, and the second timescale is twice the length of the third, then the audio segment corresponding to the first timescale will be twice as long as the segment corresponding to the second timescale, and vice versa. For audio of fixed duration, the number of audio segments corresponding to the first timescale is 0.5 times the number of audio segments corresponding to the second timescale, and vice versa.

[0048] In step 302, the sample audio can be input into a neural network model, which determines N first sample features, where N is a positive integer. These N first sample features correspond to N time scales. The i-th first sample feature (where i is a positive integer less than or equal to N) represents the information of multiple audio segments obtained after segmenting the sample audio according to the i-th time scale. As i increases, the time scale either becomes larger or smaller.

[0049] The information in an audio segment includes its semantics, emotion, timbre, and other characteristics. Understandably, different neural network models will extract features from the first sample in different ways.

[0050] In an exemplary embodiment, the neural network model includes a first encoder and a second encoder. Step 302 includes steps 3021 to 3022 (not shown in the figure).

[0051] Step 3021: Encode the sample audio using the first encoder to obtain reference audio features.

[0052] The first encoder is any type of encoder capable of encoding audio. For example, the first encoder can be an autoencoder or a variational autoencoder. It can use x∈R T The sample audio is represented by R, where R represents a real number and T represents the duration of the sample audio. The sample audio x is input into the first encoder, and the reference audio features Z representing the sample audio are obtained through encoding by the first encoder. e ∈R N×d Here, N represents the number of audio frames included in the sample audio, and d represents the dimension of the features of each audio frame. For example, if the sample audio is a continuous analog signal, it can first be sampled through analog-to-digital conversion to obtain discrete signal amplitudes. The sampled signal amplitudes can then be quantized and converted into real numbers R, forming a real number vector x. T can refer to the number of sampling points used to sample the audio.

[0053] For example, step 3021 includes: downsampling the fourth input information using the first encoder to obtain the j-th downsampling result; where j is a positive integer. When j is 1, the fourth input information includes sample audio; when j is greater than 1, the fourth input information includes the (j-1)-th downsampling result; and the last downsampling result is used to determine the reference audio features.

[0054] The first encoder includes at least one first encoding module. If there are at least two first encoding modules, the first encoding modules are connected in series, and the structures of any two first encoding modules can be the same or different. This application does not limit the structure of the first encoding module; any network with downsampling functionality can serve as the first encoding module, and the function of the first encoding module is not limited to downsampling.

[0055] For example, the first encoder includes four first encoding modules connected in series. Each first encoding module includes three residual units and one downsampling layer. The residual unit includes a convolutional layer, a normalization layer, and an activation layer. The convolutional layer performs convolution processing on the input of the residual unit, and the normalization layer normalizes the result of the convolutional layer. The residual unit is implemented in a skip-layer connection manner, allowing the input of the residual unit to be directly added to the output of the normalization layer, and then activated by the activation layer to obtain the output of the residual unit. The residual unit can reduce information loss, improve the feature representation ability, and thus improve model performance. The downsampling layer is used to downsample the input.

[0056] In this example, the first encoding module downsamples the sample audio to obtain the first downsampled result. The second encoding module downsamples the first downsampled result to obtain the second downsampled result. The third encoding module downsamples the second downsampled result to obtain the third downsampled result. This process continues until the last downsampled result is obtained, which is then used to determine the reference audio features. By continuously downsampling the sample audio, information from the sample audio is continuously extracted, allowing the reference audio features to accurately characterize the sample audio, facilitating the subsequent generation of reconstructed audio based on the sample audio.

[0057] In practical applications, network layers such as convolutional layers and normalization layers can be connected in series before or after any first encoding module, without any restrictions.

[0058] For example, the first encoder includes four first encoding modules. A first one-dimensional convolutional layer is concatenated before the first first encoding module, and a second one-dimensional convolutional layer is concatenated after the last first encoding module. The convolutional layers are used to perform convolution processing. Based on this, the electronic device inputs sample audio into the first encoder. First, the sample audio is convolved by the first one-dimensional convolutional layer to obtain the convolution processing result. By using the first one-dimensional convolutional layer, multiple local features (such as short-term features, spectral texture, etc.) in the sample audio can be captured, preserving the core structure of the sample audio and reducing signal distortion in subsequent downsampling. Then, the convolution processing result is gradually downsampled by the four first encoding modules until the last downsampled result is obtained. Then, the last downsampled result is convolved by the second one-dimensional convolutional layer to obtain reference audio features. By using the second one-dimensional convolutional layer, a more global temporal dependency can be extracted from the last downsampled result. The downsampling factor (i.e., the stride of the downsampling) of each first encoding module is not limited here. For example, if the downsampling factors of the four first encoding modules are 2, 4, 5 and 8 respectively, then the total downsampling factor satisfies D = 2 × 4 × 5 × 8 = 320.

[0059] Step 3022: Perform convolution processing on the third input information using the second encoder to obtain the i-th convolution result. Based on the i-th convolution result, determine the i-th first sample feature. Wherein, when i is 1, the third input information includes the reference audio feature; when i is greater than 1, the third input information includes the (i-1)-th convolution result.

[0060] The second encoder is a hierarchical multi-scale encoder. The hierarchical multi-scale encoder is used to extract multi-scale features from audio features to accurately capture the differences in information density of audio features at different time scales, improving the representational power of each first sample feature. The second encoder includes at least two second coding modules, and the structures of any two second coding modules can be the same or different. In practical applications, any network with convolutional functionality can be used as a second coding module. That is, the second coding module includes convolutional layers for performing convolutional processing, and the output of the convolutional layer can be regarded as the i-th convolution result. However, the function of the second coding module is not limited to convolution. For example, the second coding module includes three residual units in series and a downsampling layer, each residual unit including a convolutional layer. The residual unit performs convolutional processing and normalization processing on the input, and then activates the input and the normalized processing result together to obtain the output of the residual unit. The downsampling layer is used to perform downsampling on the input (i.e., the output of the last residual unit), and the output of the downsampling layer can be regarded as the i-th convolution result.

[0061] In this example, the first second encoding module performs convolution processing on the reference audio features to obtain the first convolution result. For example, the first convolution result is used as the first first sample feature, or the first convolution result is further processed to obtain the first first sample feature. The second second encoding module performs convolution processing on the first convolution result to obtain the second convolution result. For example, the second convolution result is used as the second first sample feature, or the second convolution result is further processed to obtain the second first sample feature. The third second encoding module performs convolution processing on the second convolution result to obtain the third convolution result. For example, the third convolution result is used as the third first sample feature, or the third convolution result is further processed to obtain the third first sample feature. This process continues until the last first sample feature is obtained.

[0062] As described above, the embodiments of this application progressively perform convolution processing on the reference audio features. Through convolution processing, the time scale is continuously expanded. For example, the second encoder includes three second encoding modules, whose convolution scales are 2, 2, and 2 respectively. Each convolution process doubles the time scale. Therefore, the time scale corresponding to the first first sample feature is twice that of the second first sample feature, and the time scale corresponding to the second first sample feature is twice that of the third first sample feature.

[0063] For example, the first sample feature Z1∈R N / 2×d The second first sample feature Z2∈R N / 4×d The third first sample feature Z3∈R N / 8×d Where R represents a real number, N / 2, N / 4, and N / 8 represent the number of audio frames corresponding to the three first sample features, and d represents the feature dimension of each audio frame.

[0064] Since the number of audio frames corresponding to the three first sample features is halved sequentially, the time scales corresponding to the three first sample features are doubled sequentially. The first first sample feature Z1 represents the information of the sample audio captured at the finest time scale, the second first sample feature Z2 represents the information of the sample audio captured at the intermediate time scale, and the third first sample feature Z3 represents the information of the sample audio captured at the coarsest time scale.

[0065] By using a second encoder to determine each first sample feature based on reference audio features representing the sample audio, information from the sample audio at different time scales is extracted. This enhances the representational power of each first sample feature, enabling it to accurately reflect the sample audio. Furthermore, features at different time scales increase feature diversity, which improves the generalization ability of the audio generation model. This allows the audio generation model to achieve high performance across various bitrate scenarios and generate high-quality audio.

[0066] In practical applications, network layers such as normalization layers and feedforward layers can be connected in series before or after any second coding module, without any restrictions.

[0067] In an exemplary embodiment, determining the i-th first sample feature based on the i-th convolution result includes: performing feature extraction on the i-th convolution result to obtain local features and global features, wherein the local features represent local information of multiple audio segments and the global features represent global information of multiple audio segments; and fusing the local features and global features to obtain the i-th first sample feature.

[0068] In this example, the i-th second encoding module also includes an adapter, which can be concatenated after the convolutional layer. The adapter includes a feature extraction module, which performs feature extraction on the i-th convolution result to obtain local and global features. This embodiment does not limit the structure of the adapter; in practical applications, the adapter can be flexibly built according to the application scenario.

[0069] For example, an adapter may include at least one Conformer (conformation isomer) layer; for instance, an adapter may include two Conformer layers connected in series. For ease of description, the following example, using an adapter with one Conformer layer, will be used to illustrate the structure and function of the adapter.

[0070] As shown in Figure 4, the adapter includes a backbone network, a projection layer, and at least one feature extraction module. The feature extraction module includes a Transformer block (Trans block) and a convolutional block, which are connected by bridging units. The Trans block is used to extract global features, and the convolutional block is used to extract local features. For ease of description, the k-th feature extraction module is identified by the label 40k (k is a positive integer). For example, the first feature extraction module is identified by label 401, and feature extraction module 401 includes a Trans block 401-1 and a convolutional block 401-2 connected by bridging units; the second feature extraction module is identified by label 402, and feature extraction module 402 includes a Trans block 402-1 and a convolutional block 402-2 connected by bridging units; and so on.

[0071] The backbone network is used to initially process the i-th convolution result. Different backbone network structures result in different processing methods. For example, the backbone network includes a 7×7 convolutional layer with a stride of 2, which performs convolution processing on the i-th convolution result. Following this convolutional layer is a 3×3 max-pooling layer, which performs max-pooling on the convolution result of the convolutional layer. The result of the max-pooling layer is used to determine the processing result of the backbone network. Next, the processing result of the backbone network is used as input to convolutional block 401-2. Simultaneously, a projection layer performs projection mapping on the processing result of the backbone network, inputting the projection mapping result into Trans block 401-1. Then, the output of Trans block 401-1 is used as input to Trans block 402-1, and the output of convolutional block 401-2 is used as input to convolutional block 402-2, and so on.

[0072] The structure of any feature extraction module is shown in Figure 5. The feature extraction module includes a convolutional block 501, a Trans block 502, and a bridging unit 503, the functions of which are described below.

[0073] Feature 504 can be input into convolutional block 501. Feature 504 includes the processing result of the backbone network or the output of the previous convolutional block. Convolutional block 501 includes 1×1 Conv-BN (i.e., a 1×1 scale convolutional layer and a batch normalization layer), 3×3 Conv-BN (i.e., a 3×3 scale convolutional layer and a batch normalization layer), and 1×1 Conv-BN. The convolutional layers are used to perform convolution processing, and the batch normalization layers are used to perform batch normalization processing. After feature 504 undergoes multiple convolutional and batch normalization processes, the processing result is added to feature 504 to obtain feature 505.

[0074] The bridging unit 503 includes a downsampling unit 503-1, whose output features from the first 3×3 Conv-BN in the convolutional block 501 can be input to the downsampling unit 503-1. The downsampling unit 503-1 includes a 1×1 Conv, a downsampling layer, and an LN (layer normalization layer). The downsampling layer performs average pooling and reshaping operations on the features, while the LN performs layer normalization. Feature 507 and the output features of the downsampling unit 503-1 are added together and input to the Trans block 502. Feature 507 includes either the projection mapping result or the output features of the previous Trans block. The Trans block 502 includes an LN and a multi-head self-attention layer, which performs multi-head self-attention processing. After layer normalization and multi-head self-attention processing, the input features of the Trans block 502 are added together to obtain the summed result. Trans block 502 also includes an LN and an MLP (i.e., a multilayer perceptron), with the MLP used to perform mapping operations. After layer normalization and mapping operations, the summation result is added to the summation result to obtain feature 508. Feature 508 can be used as a global feature in this embodiment of the application, or it can be input to the next Trans block.

[0075] The bridging unit 503 further includes an upsampling unit 503-2, and feature 508 can be input into the upsampling unit 503-2. The upsampling unit 503-2 includes an upsampling layer, a 1×1 Conv, and a batch normalization (BN). The upsampling layer is used to perform reshaping and interpolation operations. After reshaping, interpolation, convolution, and batch normalization, feature 508 is processed to obtain the output feature of the upsampling unit 503-2. The convolutional block 501 also includes a 1×1 Conv-BN. After convolution and batch normalization, feature 505 is added to the output feature of the upsampling unit 503-2 to obtain the summed result. The convolutional block 501 also includes a 3×3 Conv-BN and a 1×1 Conv-BN. After multiple convolution and batch normalization processes, the summed result is added to feature 505 to obtain feature 506. Feature 506 can be used as a local feature in this embodiment or input to the next convolutional block.

[0076] In summary, the adapter includes a backbone network, a projection layer, and at least one feature extraction module. The feature extraction module includes Trans blocks and convolutional blocks. The backbone network processes the i-th convolution result, and the projection layer performs projection mapping on the backbone network's processing result. For example, when the dimension of the processing result obtained by the backbone network on the i-th convolution result does not match the input dimension of the feature extraction module, a projection layer can be used to map the processing result to the required input dimension of the feature extraction module through a linear transformation. The k-th Trans block performs global feature extraction on the fifth input information. When k is 1, the fifth input information includes the projection mapping result of the projection layer; when k is a positive integer greater than 1, the fifth input information includes the output features of the (k-1)-th Trans block; and the output features of the last Trans block are global features. The k-th convolutional block performs local feature extraction on the sixth input information. When k is 1, the sixth input information includes the processing result of the backbone network; when k is a positive integer greater than 1, the sixth input information includes the output features of the (k-1)-th convolutional block; and the output features of the last convolutional block are local features.

[0077] Local features represent local information from multiple audio segments. Local information for each audio segment includes at least one of the following: fundamental frequency, timbre, pitch, and energy. These features characterize the basic properties and variation patterns of the sound. Global features represent global information from multiple audio segments. Global information for each audio segment includes at least one of the following: energy, zero-crossing rate, mean, and variance. These features characterize the overall properties and statistical characteristics of the sound. The adapter also includes a fusion network, which fuses the local and global features to obtain the i-th first sample feature. The structure of the fusion network is not limited here; exemplarily, it may include a splicing layer or an MLP (Multi-Level Processing).

[0078] By fusing the local and global features of the i-th convolution result, the local and global information of the audio segment is integrated, ensuring the integrity of the audio segment information. This allows the i-th first sample feature to accurately represent multiple audio segments, demonstrating strong representation ability and improving the training effect of the model.

[0079] Step 303: Generate reconstructed audio based on at least two first sample features using a neural network model.

[0080] Since at least two first sample features can accurately characterize the sample audio, the neural network model can decode each first sample feature to reconstruct the sample audio, thus obtaining the reconstructed audio. It is understood that different neural network models have different decoding methods; one possible implementation is shown below. In this example, the neural network model includes a quantizer and a decoder. Step 303 includes steps 3031 to 3033 (not shown in the figure).

[0081] Step 3031: Quantize each of the first sample features in at least two first sample features using a quantizer to obtain each first quantized feature.

[0082] This application does not limit the structure of the quantizer; any network with quantization capabilities can be used as a quantizer. Each first sample feature is quantized using the quantizer to obtain the first quantized feature. For example, the first sample feature Z1 is quantized using the quantizer to obtain the first quantized feature. The first sample feature Z2 is quantized using a quantizer to obtain the first quantized feature. The first sample feature Z3 is quantized using a quantizer to obtain the first quantized feature. By quantifying the features of the first sample, the amount of data is compressed, which helps to improve the speed of subsequent decoding and increase the efficiency of audio generation.

[0083] In an exemplary embodiment, step 3031 includes: quantizing the first input information using a quantizer to obtain the i-th first quantized feature. Wherein, when i is 1, the first input information includes the first first sample feature among at least two first sample features; when i is greater than 1, the first input information includes the i-th first sample feature and the (i-1)-th residual feature, where the (i-1)-th residual feature characterizes the difference between the (i-1)-th first quantized feature and the (i-1)-th first sample feature.

[0084] In this example, the quantizer includes a first quantization module, which quantizes a first sample feature to obtain a first quantized feature. The quantizer also includes a second quantization module and a first downsampling module. The first downsampling module calculates the difference between the first quantized feature and the first sample feature. This downsampling module includes an average pooling layer, which performs an average pooling operation on the difference to obtain a first residual feature. The first residual feature and the second sample feature are added together and then input to the second quantization module for quantization to obtain a second first quantized feature. The quantizer may also include a third quantization module and a second downsampling module. The second downsampling module calculates the difference between the second first quantized feature and the second sample feature. This downsampling module includes an average pooling layer, which performs an average pooling operation on the difference to obtain a second residual feature. The second residual feature and the third first sample feature are added together and then input into the third quantization module. The third quantization module performs quantization to obtain the third first quantized feature. This process continues until the last first quantized feature is obtained. The downsampling module calculates the feature difference between the first quantized feature and the first sample feature, aligning the dimension of this feature difference with the dimension of the next first sample feature for subsequent quantization.

[0085] By directly quantizing the first sample feature, the amount of feature data is reduced, thus increasing the speed of subsequent processing of the first quantized feature. Subsequently, in addition to quantizing the current first sample feature, the residual features before and after the previous quantization are also quantized to minimize information loss caused by feature quantization. While minimizing information loss, the amount of feature data is reduced, giving the first quantized feature strong representational power and improving the speed of subsequent processing of the first quantized feature. These characteristics of the first quantized feature contribute to improving the efficiency and accuracy of reconstructed audio generation, thereby enhancing the model's training performance.

[0086] Step 3032: The various first quantization features are fused through a quantizer to obtain the reconstructed audio features.

[0087] This application does not limit the fusion method of the first quantization features. For example, the reconstructed audio features can be obtained by concatenating the various first quantization features.

[0088] In an exemplary embodiment, there are N first quantization features, where N is a positive integer. Step 3032 includes: upsampling the second input information using a quantizer to obtain the Ni-th upsampling result; and fusing the Ni-th upsampling result and the i-th first quantization feature using a quantizer to obtain the Ni-th fusion result. Wherein, when i is N-1, the second input information includes the N-th first quantization feature; when i is any positive integer from 1 to N-2, the second input information includes the Ni-1-th fusion result, and the Ni-1-th fusion result is a reconstructed audio feature.

[0089] In this example, the quantizer includes a first upsampling module. This first upsampling module upsamples the Nth first quantization feature to obtain a first upsampled result. The first upsampled result is then added to the (N-1)th first quantization feature to obtain a first fused result. Similarly, the quantizer may include a second upsampling module. This second upsampling module upsamples the first fused result to obtain a second upsampled result. The second upsampled result is then added to the (N-2)th first quantization feature to obtain a second fused result. The quantizer may also include a third upsampling module. This third upsampling module upsamples the second fused result to obtain a third upsampled result. The third upsampled result is then added to the (N-3)th first quantization feature to obtain a third fused result. This process continues until the (N-1)th fused result is obtained, which represents the reconstructed audio feature. The upsampling module is used to align the dimension of the fusion result or the dimension of the Nth first quantization feature with the dimension of the previous first quantization feature to facilitate subsequent fusion.

[0090] By upsampling the second input information and fusing the upsampling result with the first quantization feature, the feature dimensions are aligned and the features are fused, so that the reconstructed audio features can accurately represent the information of the sample audio, thereby reconstructing the sample audio and improving the accuracy of the reconstructed audio.

[0091] Step 3033: The audio features are reconstructed by decoding the decoder to obtain the reconstructed audio.

[0092] This application does not limit the structure of the decoder; any network with decoding capabilities can be used as a decoder. Z can be used. q ∈R N / 2×d Characterize the reconstructed audio features, and reconstruct the audio features Z q The input is decoded by the decoder to obtain the reconstructed audio. Where R represents a real number, N / 2 represents the number of audio frames included in the reconstructed audio, d represents the dimension of the audio frame, and T represents the reconstructed audio... The duration. It should be noted that different decoder structures result in different decoding methods, which will not be elaborated upon here.

[0093] By quantizing each first sample feature to obtain each first quantized feature, and fusing these first quantized features to obtain reconstructed audio features, the amount of data required for reconstructed audio features is reduced, and the speed of subsequent decoding and reconstruction of audio features is increased, thereby improving the generation efficiency of reconstructed audio. Furthermore, the representational power of the reconstructed audio features can be improved, thereby increasing the accuracy of reconstructed audio and ultimately enhancing the training effect of the model.

[0094] In another possible implementation, step 303 includes steps 3034 to 3035 (not shown in the figure).

[0095] Step 3034: Generate second sample features based on the j-th downsampling result using a neural network model. The second sample features represent time-independent information in the sample audio.

[0096] The neural network model also includes a Time-invariant Representation Extraction (TIRE) module. The TIRE module enables the neural network model to generalize over time, allowing it to identify underlying patterns regardless of changes in the temporal form of the audio data. As mentioned above, the first encoder includes at least one first encoding module, where the j-th first encoding module determines the j-th downsampling result. The j-th downsampling result can be input into the TIRE module; for example, the second downsampling result can be input into the TIRE module. The TIRE module includes convolutional layers and activation layers. The convolutional layers perform convolutional processing, and the activation layers perform activation operations. The TIRE module can perform convolutional processing and activation operations multiple times on the j-th downsampling result and perform time-based averaging of the results to obtain time-invariant features, which are features independent of time. The TIRE module performs mapping and activation operations on the time-invariant features to obtain second sample features, which are also time-invariant features.

[0097] In sample audio, some information is stable and does not change over time. For example, the spectrum of the sample audio and the timbre of the person whose voice it belongs to do not change over time. The TIRE module extracts second sample features, which can characterize time-independent information in the sample audio, such as timbre and spectrum. This can be achieved using m∈R. d The second sample feature is represented by R, where R represents a real number and d represents the dimension of the second sample feature.

[0098] Step 3035: Generate reconstructed audio based on the features of the second sample and each of the first sample features using a neural network model.

[0099] Following the implementation steps 3031 to 3032, the reconstructed audio feature Z can be determined based on the features of each first sample. q The audio feature Z will be reconstructed. q The second sample feature m is input into the decoder, and the reconstructed audio is obtained through decoding. The decoding process will not be detailed here. The reconstructed audio is obtained by decoding the second sample features and each of the first sample features, thus reconstructing the sample audio based on time-independent information and information from different time scales within the sample audio. Because more comprehensive information is used during reconstruction, the accuracy of the reconstructed audio can be improved (closer to the sample audio), thereby improving the model training effect.

[0100] In an exemplary embodiment, step 3035 includes: dividing the second sample features to obtain at least two second sample sub-features; quantizing each of the second sample sub-features in the at least two second sample sub-features to obtain each second quantized feature; and generating reconstructed audio based on each second quantized feature and each first sample feature using a neural network model.

[0101] The neural network model also includes a group quantizer. First, the group quantizer divides the second sample feature into at least two second sample sub-features; for example, it divides the second sample feature into eight second sample sub-features. The division method is not limited here. Next, each second sample sub-feature is quantized by the group quantizer to obtain the second quantized feature. By fusing the various second quantized features, a fused feature is obtained. This can be achieved using m... q ∈R d The fusion feature is represented by R, where R represents a real number and d represents the dimension of the fusion feature.

[0102] Following the implementation steps 3031 to 3032, the reconstructed audio feature Z can be determined based on the features of each first sample. q The audio feature Z will be reconstructed. q and fusion feature m q The input is decoded by the decoder to obtain the reconstructed audio. By segmenting the second sample features, the segment-level second sample features are transformed into frame-level second sample sub-features, aligning each second sample sub-feature with the frame-level first sample features for easier subsequent decoding. Quantizing the second sample sub-features reduces the amount of feature data, thereby improving the speed of subsequent decoding.

[0103] Different decoders have different structures and decoding methods. One possible implementation is shown below.

[0104] The decoder includes at least one decoding module. If there are at least two decoding modules, they are connected in series, and the structures of any two decoding modules can be the same or different. This application does not limit the structure of the decoding modules; any network with upsampling functionality can serve as a decoding module, and the functionality of the decoding module is not limited to upsampling.

[0105] For example, the decoder includes four cascaded decoding modules, each consisting of three residual units and one upsampling layer. The residual unit comprises a transposed convolutional layer (also called a deconvolutional layer), a normalization layer, and an activation layer. The transposed convolutional layer performs transposed convolution processing on the input of the residual unit, and the normalization layer normalizes the result of the transposed convolutional layer's processing. The residual units are implemented with skip connections, allowing the input of the residual unit to be directly added to the output of the normalization layer, and then activated by the activation layer to obtain the output of the residual unit. The upsampling layer is used to upsample the input.

[0106] In this example, the fused feature m q Any decoding module can be input. The first decoding module is used to reconstruct the audio feature Z. q Upsampling is performed to obtain the first upsampling result. If the fused feature m... q Inputting the first decoding module will reconstruct the audio feature Z. q and fusion feature m q The components are added together, and the result is upsampled to obtain the first upsampled result. The second decoding module is used to upsample the first upsampled result to obtain the second upsampled result. If the fused feature m... q If the second decoding module is input, then the second decoding module will combine the first upsampled result with the fused feature m. q The samples are added together, and the result is upsampled to obtain a second upsampled result. A third decoding module is used to upsample the second upsampled result to obtain a third upsampled result. If the fused feature m... q If the third decoding module is input, then the third decoding module will combine the second upsampled result with the fused feature m. q The samples are added together, and the result is upsampled to obtain the third upsampled result. This process is repeated until the last upsampled result is obtained, and the reconstructed audio is determined based on the last upsampled result.

[0107] In practical applications, network layers such as convolutional layers and normalization layers can be connected in series before or after any decoding module, without any restrictions.

[0108] For example, the decoder includes four decoding modules, with two one-dimensional convolutional layers cascaded before the first decoding module. These convolutional layers are used to perform convolution processing. Based on this, the electronic device will reconstruct the audio features Z.q and fusion feature m q The input decoder first processes the reconstructed audio features Z through a one-dimensional convolutional layer. q Perform convolution processing to obtain the convolution result. Then, through four decoding modules, gradually process the convolution result and the fused feature m. q Upsampling is performed to obtain the final upsampled result. The upsampling factor (i.e., the step size of the upsampling) of each decoding module is not limited here. For example, if the upsampling factors of the four decoding modules are 8, 5, 4, and 2 respectively, then the total downsampling factor satisfies D = 2 × 4 × 5 × 8 = 320.

[0109] For example, the decoder and the first encoder have symmetrical structures, and the processing performed by the decoder is the inverse operation of the processing performed by the first encoder. For instance, the first encoder includes a Transformer encoding module, and the decoder includes a Transformer decoding module. Based on this, the first encoder can encode sample audio to obtain reference audio features and the j-th downsampled result, while the decoder can decode the reconstructed audio features and the j-th downsampled result to obtain the reconstructed audio.

[0110] Step 304: Train a neural network model using sample audio and reconstructed audio to obtain an audio generation model. The audio generation model is used to generate a second audio based on the first audio.

[0111] In this embodiment, the total loss of the neural network model can be determined using sample audio and reconstructed audio, and the neural network model can be trained once using the total loss. In practical applications, steps 301 to 304 can be executed multiple times. That is, the neural network model can be trained multiple times according to the implementation principle of this embodiment to finally obtain the audio generation model. The audio generation model and the neural network model are similar in structure and function, differing only in model parameters. The number of training iterations of the neural network model is not limited here. For example, the number of training iterations can be set based on human experience, or the number of training iterations can be the number of times the model's performance meets the requirements. It is understood that there are multiple ways to implement step 304, and several possible implementations are shown below.

[0112] In implementation method A, step 304 includes: determining the i-th first loss through the i-th first sample feature and the i-th first quantization feature, wherein the i-th first loss is used to characterize the difference between the i-th first sample feature and the i-th first quantization feature; and training a neural network model based on the sample audio, the reconstructed audio, and each first loss to obtain an audio generation model.

[0113] The i-th first loss can be calculated based on the i-th first quantization feature and the i-th first sample feature using the first loss function. The type of the first loss function is not limited here; for example, it can be an L1 loss function or an L2 loss function. The L1 loss function is also called the Mean Absolute Error (MAE) function, and the L2 loss function is also called the Mean Squared Error (MSE) function. Since there are at least two first sample features, each first sample feature corresponds to one first quantization feature, and each first sample feature and its corresponding first quantization feature can determine a first loss, therefore, there are also at least two first losses.

[0114] Furthermore, according to implementation method C described below, at least one of a third loss or a fourth loss can be calculated based on the sample audio and the reconstructed audio. The third or fourth loss, along with each of the first losses, is weighted and calculated to obtain the total loss. The neural network model is then trained using the total loss to obtain the audio generation model.

[0115] By training the model with sample audio and reconstructed audio, the model can be optimized to make the reconstructed audio approximate the sample audio, thus improving the accuracy of the audio generation model. Furthermore, by training the model with the first loss, the model can be optimized to make the first quantized feature approximate the first sample feature, reducing information loss during feature quantization and improving the accuracy of quantized features, thereby enhancing the accuracy of the generated audio.

[0116] In implementation method B, step 304 includes: determining a second loss using at least two first sample features, wherein the second loss characterizes the correlation between each pair of first sample features; and training a neural network model based on the sample audio, the reconstructed audio, and the second loss to obtain an audio generation model.

[0117] The correlation between any two features of the first sample can be determined using a correlation function. The type of correlation function is not limited here; for example, it can be a cosine similarity function, a mutual information function, or a KL divergence function, etc. For instance, the mutual information function could be a contrastive log-ratio upper bound (CLUB) function or a variational contrastive log-ratio upper bound (VCLUB) function. These two functions will be described in detail below.

[0118] When variables u and v are independent, and the conditional distribution P(u|v) is known, the CLUB function is defined as: I CLUB (u, v) = E P(u,v)[logP(u|v)]-E P(u) [E P(v) [logP(u|v)]]. Here, P(u|v) represents the probability that the distribution of variable u is generated based on the distribution of variable v. CLUB (u, v) represents the expected mutual information between variables v and u. log is the logarithmic sign, and E is the mean sign. P(u, v) represents the joint distribution of variables u and v, P(u) represents the marginal distribution of variable u, and P(v) represents the marginal distribution of variable v. In practical applications, variables u and v can be independent or correlated; based on this, I... CLUB (u, v) is the upper bound of I (u, v), that is, I (u, v) ≤ I CLUB I(u, v) represents the actual mutual information between variables v and u. When the mutual information function is a CLUB function, variables u and v are two different first-sample features, and the mutual information between the two first-sample features can be calculated according to the CLUB function described above.

[0119] The VCLUB function is a variational form of the CLUB function. It aims to determine the conditional distribution P(u|v) using a neural network when the conditional distribution is unknown. Assume the conditional distribution determined by the neural network is: Q θ (u|v), then the VCLUB function can be expressed as: I VCLUB (u, v) = E P(u,v) [logQ θ (u|v)]-E P(u) [E P(v) [logQ θ (u|v)]]. Where, Q θ (u|v) represents the probability that a distribution of variable u is generated from the distribution of variable v using a neural network. In other words, Q... θ (u|v) is a variational approximation of P(u|v). Variational approximation refers to searching for the optimal function that best approximates the objective function by adjusting the shape of the approximation function (e.g., its parameters). The degree of approximation can be measured by the Kullback-Leibler (KL) divergence. θ represents the network parameters of the neural network. VCLUB (u, v) represents the mutual information between variables v and u. The meanings of other parameters in the VCLUB function can be found in the description of the CLUB function, and will not be repeated here. For example, L u,v =logQ θ (u|v). When the mutual information function is a VCLUB function, variables u and v are two different first sample features. The mutual information between the two first sample features can be calculated according to the VCLUB function described above.

[0120] For example, if there are at least two sample audio files, when calculating the mutual information between two first sample features, these two first sample features can correspond to the same sample audio file or different sample audio files, and the two first sample features can correspond to different time scales. Assuming the two first sample features correspond to time scales t1 and t2 respectively, then the two first sample features belonging to the same sample audio file at these two time scales constitute a positive sample pair, while the two first sample features belonging to different sample audio files at these two time scales constitute a negative sample pair. Based on each positive and negative sample pair, the mutual information between the first sample features corresponding to the two time scales is calculated.

[0121] For example, since different first sample features correspond to different time scales, they also correspond to different numbers of audio frames. Based on this, when calculating the mutual information between two first sample features, one of the first sample features can be expanded or compressed to align with the other, thus facilitating the calculation of mutual information. For instance, the number of audio frames for first sample feature Z1 is twice the number of audio frames for first sample feature Z2, and the number of audio frames for first sample feature Z2 is twice the number of audio frames for first sample feature Z3. Therefore, Z2 is repeated twice in terms of audio frames, and Z3 is repeated four times in terms of audio frames, so that the audio frames of Z2 and Z3 are aligned with the audio frames of Z1, facilitating the subsequent calculation of the mutual information between each pair of first sample features.

[0122] Based on this, it can be followed Calculate the mutual information between the feature corresponding to time scale t1 (i.e., one first sample feature) and the feature corresponding to time scale t2 (i.e., another first sample feature). Here, Z1 represents the feature corresponding to time scale t1, and Z2 represents the feature corresponding to time scale t2. The mutual information between Z1 and Z2 is represented. K represents the number of audio samples, and N / 2 represents the number of audio frame features in the first sample feature after alignment. Z 1;k,i Z represents the i-th audio frame feature in the first sample feature Z1 corresponding to the k-th audio sample. 2;k,i Z represents the i-th audio frame feature in the first sample feature Z2 corresponding to the k-th audio sample. 1;l,i The i-th audio frame feature in the first sample feature Z1 corresponding to the l-th audio sample. Where Z... 1;k,i and Z 2;k,i Forming positive sample pairs, while Z 1;l,i and Z 2;k,i Form negative sample pairs. Q θ (Z 1;k,i |Z 2;k,i Q represents the conditional distribution of positive sample pairs determined by the neural network. θ (Z1;l,i |Z 2;k,i θ represents the conditional distribution of negative sample pairs determined by the neural network, and θ represents the network parameters of the neural network.

[0123] Based on the above principle, the mutual information between any two first sample features can be determined. Taking a scenario with three time scales as an example, where each audio sample corresponds to three first sample features, denoted as Z1, Z2, and Z3 respectively, the mutual information between first sample feature Z1 and first sample feature Z2 can be determined according to the above implementation principle. Mutual information between the first sample feature Z1 and the first sample feature Z3 And the mutual information between the first sample feature Z2 and the first sample feature Z3

[0124] In this example, the second loss L can be obtained by summing or averaging the mutual information between every two features of the first sample. mi For example, the second loss Furthermore, according to implementation method C, at least one of a third loss or a fourth loss can be calculated based on the sample audio and the reconstructed audio. The third or fourth loss, along with the second loss, is weighted to obtain the total loss. The neural network model is then trained using the total loss to obtain the audio generation model.

[0125] By training the model with sample audio and reconstructed audio, the model can be optimized to make the reconstructed audio approximate the sample audio, thus improving the accuracy of the audio generation model. Furthermore, training the model with a second loss can optimize it to reduce the correlation between the features of the two first sample features. This results in lower correlation between features extracted by the audio generation model at different time scales, allowing features at different time scales to independently, accurately, and fully represent the audio. In other words, features at different time scales can represent different aspects of the audio information, reducing feature redundancy at each time scale, improving the effectiveness of encoding and the representational power of features, thereby improving the quality of the generated audio.

[0126] In implementation C, step 304 includes: determining at least one of a third loss or a fourth loss based on the sample audio and the reconstructed audio, wherein the third loss characterizes the difference between the sample audio and the reconstructed audio in the time domain and the fourth loss characterizes the difference between the sample audio and the reconstructed audio in the frequency domain; and training a neural network model based on at least one of the third loss or the fourth loss to obtain an audio generation model.

[0127] A third loss function can be used to calculate the loss based on the waveform of the sample audio and the waveform of the reconstructed audio. The type of the third loss function is not limited here; for example, it can be an L1 loss function or an L2 loss function.

[0128] And / or, a fourth loss function can be used to calculate the fourth loss based on the spectrum of the sample audio and the spectrum of the reconstructed audio. The type of the fourth loss function is not limited here; for example, the fourth loss function can be an L1 loss function or an L2 loss function.

[0129] The total loss is determined based on at least one of the third or fourth losses. For example, the third and fourth losses can be weighted to obtain the total loss. The neural network model is then trained using the total loss to obtain the audio generation model. Training the model using the third and fourth losses can optimize the model towards making the reconstructed audio approximate the sample audio in both the time and frequency domains, thereby improving the accuracy of the audio generation model.

[0130] It is understood that the total loss includes, but is not limited to, the losses mentioned above. For example, the total loss may also include at least one of a fifth or sixth loss determined as follows: The sample audio is input into a discriminator, features of the sample audio are extracted by the discriminator, and a first discrimination result is determined based on the features of the sample audio, where the first discrimination result represents the probability that the sample audio is real data. Similarly, the reconstructed audio is input into a discriminator, features of the reconstructed audio are extracted by the discriminator, and a second discrimination result is determined based on the features of the reconstructed audio, where the second discrimination result represents the probability that the reconstructed audio is real data. Here, real data refers to the original data, such as the sample audio, while non-real data (i.e., fake data) refers to generated data, such as the reconstructed audio. The fifth loss is a loss determined according to the fifth loss function based on the features of the sample audio and the reconstructed audio, used to characterize the difference between the two features. For example, the fifth loss function is the L2 loss. The sixth loss is a loss determined according to the sixth loss function based on the second discrimination result, used to characterize the difference between the second discrimination result and the specified data. For example, the sixth loss is determined according to the following formula: Among them, L g The sixth loss is represented by K, which represents the number of sample audio files in a batch. The reconstructed audio corresponding to the k-th sample audio The second discrimination result.

[0131] Furthermore, the total loss can be determined based on at least one of the third or fourth losses, or it can be determined by combining at least one of the first, second, fifth, and sixth losses. For example, the total loss satisfies: L = λ t L t +λ f Lf +λ g L g +λ feat L feat +λ w L w +λ mi L mi Where L represents the total loss. t Characterizing the third loss, λ t The weights representing the third loss. L f Characterizing the fourth loss, λ f The weights representing the fourth loss. L g Characterizing the sixth loss, λ g The weights representing the sixth loss. L feat Characterizing the fifth loss, λ feat The weights representing the fifth loss. L w Characterizing the first loss, λ w The weights representing the first loss. L mi Characterizing the second loss, λ mi The weights represent the second loss. The neural network model is trained using the total loss to obtain the audio generation model. Subsequently, the audio generation model can be used to generate a second audio based on the first audio.

[0132] In other words, the method in this application embodiment further includes: obtaining a first audio; encoding the first audio through an audio generation model to obtain at least two first audio features, the at least two first audio features corresponding to different time scales, the i-th first audio feature among the at least two first audio features being used to characterize the feature information of multiple audio segments obtained after segmenting the first audio according to the i-th time scale, where i is a positive integer; and generating a second audio based on the at least two first audio features through an audio generation model.

[0133] The first audio file can be any audio file. For example, the first audio file could be someone's voice. Electronic devices can acquire the first audio file through audio acquisition devices, or they can acquire it through the internet; the method of acquisition is limited here.

[0134] Following the implementation principle of step 302, the first audio can be encoded using the audio generation model to obtain at least two first audio features; the encoding method will not be elaborated here. Next, following the implementation principle of step 303, the second audio can be generated based on the at least two first audio features using the audio generation model; the generation method will also not be elaborated here.

[0135] The audio generation model of this application embodiment can be applied in various scenarios. For example, the audio generation model can be applied in audio transmission scenarios. Exemplarily, both the source device (an electronic device) and the destination device (another electronic device) deploy the audio generation model. The source device acquires first audio, encodes the first audio using the audio generation model to obtain various first audio features, and quantizes each first audio feature to obtain various third quantization features. The source device transmits each third quantization feature to the destination device. The destination device fuses the various third quantization features and reconstructs the first audio based on the fused features to obtain a second audio. Since each third quantization feature can accurately represent the first audio, the reconstructed second audio is close to the first audio, thus ensuring the quality of the second audio. Furthermore, transmitting the third quantization features in an audio transmission scenario reduces the amount of data transmitted and saves transmission resources.

[0136] For example, audio generation models can be applied in speech synthesis scenarios. Exemplarily, an electronic device deploys an audio generation model. The device can acquire a first audio file, encode the first audio file using the audio generation model to obtain various first audio features, and quantize these first audio features to obtain various third quantized features. The electronic device also deploys other pre-trained models, which can update the various third quantized features based on the target text to obtain various fourth quantized features. The third quantized features represent the semantics, timbre, frequency, and other information of the first audio file, while the fourth quantized features represent the timbre, frequency, and other information of the first audio file, as well as the semantics of the target text. In other words, the third and fourth quantized features represent different text semantics but the same sound information. The electronic device can fuse the various fourth quantized features using the audio generation model and generate a second audio file based on the fused features. This second audio file speaks the target text using the voice of the first audio file, improving the quality of the synthesized speech.

[0137] It should be noted that all information (including but not limited to user device information, user personal information, etc.), data (including but not limited to data used for analysis, stored data, displayed data, etc.), and signals involved in this application have been authorized by the user or fully authorized by all parties, and the collection, use, and processing of related data must comply with the relevant laws, regulations, and standards of the relevant regions. For example, the sample audio and first audio involved in this application were obtained with full authorization.

[0138] The above method encodes sample audio using a neural network model to obtain at least two first sample features. Since the i-th first sample feature represents information from multiple audio segments obtained after segmenting the sample audio according to the i-th time scale, different first sample features can reflect information from the sample audio at different time scales, making each first sample feature more representative and accurately reflecting the sample audio. Based on this, highly accurate reconstructed audio can be generated from each first sample feature. This improves the training effect and accuracy of the audio generation model when training the neural network model based on the reconstructed audio and sample audio, thus enabling the generation of high-quality audio based on the audio generation model.

[0139] The training method for the audio generation model has been described above from the perspective of methodology and steps. The following section will further elaborate on this in the context of specific scenarios. As mentioned above, the audio generation model of this application can be applied to scenarios such as audio transmission and speech synthesis. Based on this, taking the speech synthesis scenario as an example, as shown in Figure 6, the speech synthesis process includes the following steps.

[0140] Step 601: Train a neural network model using sample audio to obtain an audio generation model.

[0141] Figure 7 illustrates the training of an audio generation model provided in an embodiment of this application. This audio generation model is trained based on a neural network model. The neural network model includes a first encoder, which comprises four cascaded first encoding modules and a convolutional layer. Sample audio 701 is input into the first encoder, which encodes the sample audio to obtain reference audio features 702. The reference audio features 702 can be represented as Z. e ∈R N×d Where R represents a real number, N represents the number of audio frames, and d represents the feature dimension of each audio frame.

[0142] The neural network model also includes a second encoder, which encodes the reference audio features 702 to obtain multiple first sample features 703. Specifically, the second encoder includes three concatenated convolutional layers, with an adapter concatenated after each convolutional layer. After the reference audio features 702 are input into the second encoder, they are first convolved by convolutional layer 1, and the convolution result of convolutional layer 1 is processed by adapter 1 to obtain a first sample feature Z1∈R. N / 2×d Next, the convolution result of convolution layer 1 is convolved by convolution layer 2, and the convolution result of convolution layer 2 is processed by adapter 2 to obtain another first sample feature Z2∈R. N / 4×d Then, the convolution result of convolution layer 2 is convolved by convolution layer 3, and the convolution result of convolution layer 3 is processed by adapter 3 to obtain another first sample feature Z3∈R.N / 8×d .

[0143] The neural network model also includes a quantizer, which is used to quantize each first sample feature 703 and fuse them to obtain the reconstructed audio feature 704. Specifically, the quantizer includes three quantization modules. After inputting multiple first sample features 703 into the quantizer, firstly, the first sample feature Z1 is quantized by quantization module 1 to obtain a first quantized feature, as shown in Figure 7. The first quantized features of quantization module 1 are 9, 16, 52, 84...; the difference between the first sample feature Z1 and the corresponding first quantized feature is downsampled, and the sum of the downsampled result and the first sample feature Z2 is quantized by quantization module 2 to obtain another first quantized feature, as shown in Figure 7. The first quantized features of quantization module 2 are 71, 67...; the difference between the first sample feature Z2 and the corresponding first quantized feature is downsampled, and the sum of the downsampled result and the first sample feature Z3 is quantized by quantization module 3 to obtain yet another first quantized feature, as shown in Figure 7. The first quantized features of quantization module 3 are 12... Next, the first quantization feature of quantization module 3 is upsampled, the upsampled result is added to the first quantization feature of quantization module 2, the added result is upsampled, and the upsampled result is added to the first quantization feature of quantization module 1 to obtain the reconstructed audio feature 704. For example, the reconstructed audio feature 704 can be represented as Z. q ∈R N / 2×d .

[0144] The neural network model also includes a time-invariant representation extraction module and a group quantizer. The time-invariant representation extraction module extracts features from the encoding result of the second first encoding module (i.e., the j-th downsampling result mentioned above) to obtain the second sample features. The group quantizer then groups and quantizes these second sample features, fusing them to obtain the fused feature m. q The neural network model also includes a decoder, which consists of two concatenated convolutional layers and four decoding modules. The decoder processes the fused features m. q By reconstructing audio features 704, we obtain reconstructed audio 705.

[0145] Next, on the one hand, the mutual information between every two first sample features 703 is calculated, and on the other hand, the L2 loss between the sample audio 701 and the reconstructed audio 705 is calculated. The neural network model is trained using the mutual information and L2 loss to obtain the audio generation model. The implementation principle of step 601 has been described above and will not be repeated here.

[0146] Step 602: Input the first speech into the audio generation model to obtain the various quantization features corresponding to the first speech.

[0147] In speech synthesis scenarios, electronic devices can install speech synthesis applications, which provide an interface for speech synthesis. This application does not limit the style, size, or content of the interface; exemplarily, the speech synthesis interface is shown in Figure 8. In this example, the speech synthesis interface may include an audio input area 802, where a user can upload audio to enable the electronic device to acquire the first speech. The speech synthesis interface may also include an audio selection area 804, where a user can select audio to enable the electronic device to acquire the first speech.

[0148] The electronic device is equipped with an audio generation model. First speech can be input into the audio generation model, and processed by a first encoder, a second encoder, and a quantizer to obtain the corresponding quantization features. The method for determining the quantization features corresponding to the first speech can be found in the description of the first quantization feature above; the implementation principles are similar and will not be repeated here.

[0149] In practical applications, more information can be used to determine the various quantization features corresponding to the first speech. For example, an audio generation model can be used to determine the various quantization features corresponding to the first speech based on the first speech and its speech recognition results. For instance, the speech synthesis interface can also include a correction area 803, which displays the recognition result of the first speech and, based on the user's correction of the recognition result, displays the corrected recognition result. Electronic devices can use an audio generation model to determine the various quantization features corresponding to the first speech based on the first speech and its corrected speech recognition results.

[0150] Step 603: Update the quantization features corresponding to the first speech based on the target text to obtain the updated quantization features.

[0151] In this embodiment, the speech synthesis interface may further include a text acquisition area 805, which is used to acquire the target text input by the user, so that the electronic device can acquire the target text. The electronic device can update the various quantization features corresponding to the first speech based on the target text to obtain the updated quantization features. The updated quantization features can characterize the timbre, frequency, and other information of the first speech, and can also characterize the semantics of the target text.

[0152] Step 604: Synthesize the second speech using the audio generation model based on the updated quantization features.

[0153] Electronic devices can synthesize second speech based on updated quantization features using quantizers and decoders. The synthesis method of the second speech is similar to that of reconstructed audio generation, and therefore will not be elaborated upon here.

[0154] In practical applications, electronic devices can obtain the first speech, the speech recognition result, and the target text through the speech synthesis interface. The speech synthesis interface includes a first synthesis control 808. When the user triggers the first synthesis control 808, the electronic device immediately executes steps 602 to 604 to immediately synthesize the second speech. Alternatively, the speech synthesis interface includes a second synthesis control 809. When the user triggers the second synthesis control 809, the electronic device executes steps 602 to 604 after a set time to synthesize the second speech. The speech synthesis interface may also include other information, such as a first prompt message 801, a second prompt message 806, and a third prompt message 807. The first prompt message 801 indicates the speech synthesis process, the second prompt message 806 indicates the optimization method for the synthesized speech, and the third prompt message 807 indicates the usage method of the synthesized speech. The content of each prompt message is not limited here.

[0155] Figure 9 shows a schematic diagram of the structure of a training device for an audio generation model provided in an embodiment of this application. As shown in Figure 9, the device includes the following components.

[0156] The acquisition module 901 is used to acquire sample audio and the neural network model to be trained.

[0157] The encoding module 902 is used to encode sample audio through a neural network model to obtain at least two first sample features; the at least two first sample features correspond to different time scales, and the i-th first sample feature among the at least two first sample features is used to characterize the feature information of multiple audio segments obtained after segmenting the sample audio according to the i-th time scale, where i is a positive integer.

[0158] The generation module 903 is used to generate reconstructed audio based on at least two first sample features using a neural network model.

[0159] Training module 904 is used to train a neural network model using sample audio and reconstructed audio to obtain an audio generation model, which is used to generate a second audio based on the first audio.

[0160] In one possible implementation, the neural network model includes a quantizer and a decoder.

[0161] The generation module 903 is used to quantize each of the first sample features in at least two first sample features through a quantizer to obtain each first quantized feature; to fuse each first quantized feature through a quantizer to obtain reconstructed audio features; and to decode the reconstructed audio features through a decoder to obtain reconstructed audio.

[0162] In one possible implementation, the generation module 903 is used to quantize the first input information through a quantizer to obtain the i-th first quantized feature.

[0163] When i is 1, the first input information includes the first first sample feature among at least two first sample features; when i is greater than 1, the first input information includes the i-th first sample feature and the (i-1)-th residual feature, and the (i-1)-th residual feature represents the difference between the (i-1)-th first quantization feature and the (i-1)-th first sample feature.

[0164] In one possible implementation, the first quantization feature has N elements, where N is a positive integer.

[0165] The generation module 903 is used to upsample the second input information through a quantizer to obtain the Ni-th upsampled result, and to fuse the Ni-th upsampled result and the i-th first quantization feature through a quantizer to obtain the Ni-th fused result.

[0166] When i is N-1, the second input information includes the Nth first quantization feature; when i is any positive integer from 1 to N-2, the second input information includes the Ni-1th fusion result, and the N-1th fusion result is the reconstructed audio feature.

[0167] In one possible implementation, the training module 904 is used to determine the i-th first loss through the i-th first sample feature and the i-th first quantization feature, the i-th first loss being used to characterize the difference between the i-th first sample feature and the i-th first quantization feature; and to train a neural network model based on the sample audio, the reconstructed audio, and each first loss to obtain an audio generation model.

[0168] In one possible implementation, training module 904 is used to determine a second loss using at least two first sample features, the second loss representing the correlation between every two first sample features; and to train a neural network model based on sample audio, reconstructed audio, and the second loss to obtain an audio generation model.

[0169] In one possible implementation, the neural network model includes a first encoder and a second encoder.

[0170] The encoding module 902 is used to encode sample audio through the first encoder to obtain reference audio features; to perform convolution processing on the third input information through the second encoder to obtain the i-th convolution result; and to determine the i-th first sample feature based on the i-th convolution result.

[0171] When i is 1, the third input information includes reference audio features; when i is greater than 1, the third input information includes the (i-1)th convolution result.

[0172] In one possible implementation, the encoding module 902 is used to perform feature extraction on the i-th convolution result to obtain local features and global features. The local features represent the local information of multiple audio segments, and the global features represent the global information of multiple audio segments. The local features and global features are fused to obtain the i-th first sample feature.

[0173] In one possible implementation, the encoding module 902 is used to downsample the fourth input information through the first encoder to obtain the j-th downsampled result.

[0174] Where j is a positive integer, when j is 1, the fourth input information includes the sample audio, when j is greater than 1, the fourth input information includes the (j-1)th downsampling result, and the last downsampling result is used to determine the reference audio features.

[0175] In one possible implementation, the generation module 903 is used to generate a second sample feature based on the j-th downsampling result through a neural network model. The second sample feature represents time-independent information in the sample audio. The reconstructed audio is generated based on the second sample feature and each first sample feature through the neural network model.

[0176] In one possible implementation, the generation module 903 is used to divide the second sample features to obtain at least two second sample sub-features; quantize each of the second sample sub-features in the at least two second sample sub-features to obtain each second quantized feature; and generate reconstructed audio based on each second quantized feature and each first sample feature through a neural network model.

[0177] In one possible implementation, training module 904 is used to determine at least one of a third loss or a fourth loss based on sample audio and reconstructed audio, wherein the third loss characterizes the difference between the sample audio and the reconstructed audio in the time domain and the fourth loss characterizes the difference between the sample audio and the reconstructed audio in the frequency domain; and to train a neural network model based on at least one of the third loss or the fourth loss to obtain an audio generation model.

[0178] In one possible implementation, module 901 is also used to acquire the first audio.

[0179] The encoding module 902 is also used to encode the first audio through the audio generation model to obtain at least two first audio features. The at least two first audio features correspond to different time scales. The i-th first audio feature among the at least two first audio features is used to characterize the feature information of multiple audio segments obtained after segmenting the first audio according to the i-th time scale, where i is a positive integer.

[0180] The generation module 903 is also used to generate a second audio based on at least two first audio features using an audio generation model.

[0181] The aforementioned device encodes sample audio using a neural network model to obtain at least two first sample features. Since the i-th first sample feature represents information from multiple audio segments obtained after segmenting the sample audio according to the i-th time scale, different first sample features can reflect information from the sample audio at different time scales, making each first sample feature more representative and accurately reflecting the sample audio. Based on this, highly accurate reconstructed audio can be generated from each first sample feature. This improves the training effect and accuracy of the audio generation model when training the neural network model based on the reconstructed audio and sample audio, thereby enabling the generation of high-quality audio based on the audio generation model.

[0182] It should be understood that the device shown in Figure 9 above is only illustrated by the division of the above-described functional modules. In practical applications, the functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. In addition, the device and method embodiments provided in the above embodiments belong to the same concept, and their specific implementation process can be found in the method embodiments, which will not be repeated here.

[0183] Figure 10 shows a structural block diagram of a terminal device 1000 provided in an exemplary embodiment of this application. The terminal device 1000 includes a processor 1001 and a memory 1002.

[0184] Processor 1001 may include one or more processing cores, such as a quad-core processor, an octa-core processor, etc. Processor 1001 may be implemented using at least one hardware form selected from DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array). Processor 1001 may also include a main processor and a coprocessor. The main processor, also known as a CPU (Central Processing Unit), is used to process data in the wake-up state; the coprocessor is a low-power processor used to process data in the standby state. In some embodiments, processor 1001 may integrate a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the screen. In some embodiments, processor 1001 may also include an AI (Artificial Intelligence) processor, which is used to handle computational operations related to machine learning.

[0185] The memory 1002 may include one or more computer-readable storage media, which may be non-transitory. The memory 1002 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices or flash memory devices. In some embodiments, the non-transitory computer-readable storage media in the memory 1002 are used to store at least one computer program, which is executed by the processor 1001 to implement the training method for the audio generation model provided in the method embodiments of this application.

[0186] In some embodiments, the terminal device 1000 may further include a peripheral device interface 1003 and at least one peripheral device. The processor 1001, memory 1002, and peripheral device interface 1003 can be connected via a bus or signal line. Each peripheral device can be connected to the peripheral device interface 1003 via a bus, signal line, or circuit board. Specifically, the peripheral device includes at least one of the following: a radio frequency circuit 1004, a display screen 1005, a camera assembly 1006, an audio circuit 1007, and a power supply 1008.

[0187] Peripheral device interface 1003 can be used to connect at least one I / O (Input / Output) related peripheral device to processor 1001 and memory 1002. In some embodiments, processor 1001, memory 1002 and peripheral device interface 1003 are integrated on the same chip or circuit board; in some other embodiments, any one or two of processor 1001, memory 1002 and peripheral device interface 1003 can be implemented on separate chips or circuit boards, which is not limited in this embodiment.

[0188] The radio frequency (RF) circuit 1004 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The RF circuit 1004 communicates with communication networks and other communication devices via electromagnetic signals. The RF circuit 1004 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals back into electrical signals. For example, the RF circuit 1004 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user identity module card, etc. The RF circuit 1004 can communicate with other terminals through at least one wireless communication protocol. This wireless communication protocol includes, but is not limited to: the World Wide Web, metropolitan area networks, intranets, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and / or WiFi (Wireless Fidelity) networks. In some embodiments, the RF circuit 1004 may also include circuitry related to NFC (Near Field Communication), which is not limited in this application.

[0189] Display screen 1005 is used to display a UI (User Interface). This UI may include graphics, text, icons, videos, and any combination thereof. When display screen 1005 is a touch display screen, it also has the ability to collect touch signals on or above its surface. These touch signals can be input as control signals to processor 1001 for processing. In this case, display screen 1005 can also be used to provide virtual buttons and / or a virtual keyboard, also known as soft buttons and / or a soft keyboard. In some embodiments, display screen 1005 may be a single screen, disposed on the front panel of terminal device 1000; in other embodiments, display screen 1005 may be at least two, disposed on different surfaces of terminal device 1000 or in a folded design; in still other embodiments, display screen 1005 may be a flexible display screen, disposed on a curved or folded surface of terminal device 1000. Furthermore, display screen 1005 may also be configured as a non-rectangular, irregular shape, i.e., a non-rectangular screen. The display screen 1005 can be made of materials such as LCD (Liquid Crystal Display) and OLED (Organic Light-Emitting Diode).

[0190] The camera assembly 1006 is used to acquire images or videos. For example, the camera assembly 1006 includes a front-facing camera and a rear-facing camera. Typically, the front-facing camera is located on the front panel of the terminal, and the rear-facing camera is located on the back of the terminal. In some embodiments, there are at least two rear-facing cameras, which are any one of a main camera, a depth-sensing camera, a wide-angle camera, and a telephoto camera, to achieve background blurring by fusion of the main camera and the depth-sensing camera, panoramic shooting by fusion of the main camera and the wide-angle camera, VR (Virtual Reality) shooting, or other fusion shooting functions. In some embodiments, the camera assembly 1006 may also include a flash. The flash can be a single-color temperature flash or a dual-color temperature flash. A dual-color temperature flash refers to a combination of a warm-light flash and a cool-light flash, which can be used for light compensation at different color temperatures.

[0191] The audio circuit 1007 may include a microphone and a speaker. The microphone is used to collect sound waves from the user and the environment, converting the sound waves into electrical signals that are input to the processor 1001 for processing, or input to the radio frequency circuit 1004 for voice communication. For stereo sound acquisition or noise reduction purposes, multiple microphones may be used, each located at a different part of the terminal device 1000. The microphone may also be an array microphone or an omnidirectional microphone. The speaker is used to convert electrical signals from the processor 1001 or the radio frequency circuit 1004 into sound waves. The speaker may be a conventional diaphragm speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, it can convert electrical signals not only into audible sound waves but also into inaudible sound waves for purposes such as distance measurement. In some embodiments, the audio circuit 1007 may also include a headphone jack.

[0192] The power supply 1008 is used to power the various components in the terminal device 1000. The power supply 1008 can be AC ​​power, DC power, a disposable battery, or a rechargeable battery. When the power supply 1008 includes a rechargeable battery, the rechargeable battery can be a wired rechargeable battery or a wireless rechargeable battery. A wired rechargeable battery is a battery that is charged via a wired line, while a wireless rechargeable battery is a battery that is charged via a wireless coil. The rechargeable battery can also be used to support fast charging technology.

[0193] In some embodiments, the terminal device 1000 further includes one or more sensors 1009. The one or more sensors 1009 include, but are not limited to: an acceleration sensor 1011, a gyroscope sensor 1012, a pressure sensor 1013, an optical sensor 1014, and a proximity sensor 1015.

[0194] Accelerometer 1011 can detect the magnitude of acceleration along the three coordinate axes of a coordinate system established by terminal device 1000. For example, accelerometer 1011 can be used to detect the components of gravitational acceleration along the three coordinate axes. Processor 1001 can control display screen 1005 to display the user interface in either a landscape or portrait view based on the gravitational acceleration signal acquired by accelerometer 1011. Accelerometer 1011 can also be used for games or for acquiring user motion data.

[0195] The gyroscope sensor 1012 can detect the orientation and rotation angle of the terminal device 1000. The gyroscope sensor 1012 can work in conjunction with the accelerometer sensor 1011 to collect the user's 3D movements on the terminal device 1000. Based on the data collected by the gyroscope sensor 1012, the processor 1001 can perform the following functions: motion sensing (e.g., changing the UI based on the user's tilt), image stabilization during shooting, game control, and inertial navigation.

[0196] The pressure sensor 1013 can be disposed on the side bezel of the terminal device 1000 and / or on the lower layer of the display screen 1005. When the pressure sensor 1013 is disposed on the side bezel of the terminal device 1000, it can detect the user's grip signal on the terminal device 1000, and the processor 1001 can perform left / right hand recognition or quick operation based on the grip signal collected by the pressure sensor 1013. When the pressure sensor 1013 is disposed on the lower layer of the display screen 1005, the processor 1001 can control the operable controls on the UI interface based on the user's pressure operation on the display screen 1005. The operable controls include at least one of button controls, scroll bar controls, icon controls, and menu controls.

[0197] An optical sensor 1014 is used to collect ambient light intensity. In one embodiment, the processor 1001 can control the display brightness of the display screen 1005 based on the ambient light intensity collected by the optical sensor 1014. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1005 is increased; when the ambient light intensity is low, the display brightness of the display screen 1005 is decreased. In another embodiment, the processor 1001 can also dynamically adjust the shooting parameters of the camera assembly 1006 based on the ambient light intensity collected by the optical sensor 1014.

[0198] The proximity sensor 1015, also known as a distance sensor, is typically installed on the front panel of the terminal device 1000. The proximity sensor 1015 is used to detect the distance between the user and the front of the terminal device 1000. In one embodiment, when the proximity sensor 1015 detects that the distance between the user and the front of the terminal device 1000 is gradually decreasing, the processor 1001 controls the display screen 1005 to switch from a screen-on state to a screen-off state; when the proximity sensor 1015 detects that the distance between the user and the front of the terminal device 1000 is gradually increasing, the processor 1001 controls the display screen 1005 to switch from a screen-off state to a screen-on state.

[0199] Those skilled in the art will understand that the structure shown in FIG10 does not constitute a limitation on the terminal device 1000, and may include more or fewer components than shown, or combine certain components, or use different component arrangements.

[0200] Figure 11 is a schematic diagram of the server structure provided in an embodiment of this application. The server 1100 can vary considerably due to different configurations or performance. It may include one or more processors 1101 and one or more memories 1102. The one or more memories 1102 store at least one computer program, which is loaded and executed by the one or more processors 1101 to implement the training method of the audio generation model provided in the above-described method embodiments. For example, the processor 1101 is a CPU. Of course, the server 1100 may also have wired or wireless network interfaces, a keyboard, and input / output interfaces for input and output. The server 1100 may also include other components for implementing device functions, which will not be elaborated here.

[0201] In an exemplary embodiment, a computer-readable storage medium is also provided, which stores at least one computer program that is loaded and executed by a processor to enable an electronic device to implement the training method of any of the above-described audio generation models.

[0202] For example, the aforementioned computer-readable storage media may be read-only memory (ROM), random access memory (RAM), compact disc read-only memory (CD-ROM), magnetic tape, floppy disk, and optical data storage devices, etc.

[0203] In an exemplary embodiment, a computer program is also provided, which is at least one such computer program, loaded and executed by a processor to enable an electronic device to implement the training method of any of the above-described audio generation models.

[0204] In an exemplary embodiment, a computer program product is also provided, which stores at least one computer program that is loaded and executed by a processor to enable an electronic device to implement the training method of any of the above-described audio generation models.

[0205] It should be understood that "multiple" as used in this article refers to two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone. The character " / " generally indicates that the preceding and following related objects have an "or" relationship.

[0206] The sequence numbers of the embodiments in this application are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.

[0207] The above description is merely an exemplary embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the principles of this application should be included within the protection scope of this application.

Claims

1. A method for training an audio generation model, the method comprising: The method includes: Acquire sample audio and the neural network model to be trained; The sample audio is encoded by the neural network model to obtain at least two first sample features; the at least two first sample features correspond to different time scales, and the i-th first sample feature among the at least two first sample features is used to characterize the feature information of multiple audio segments obtained after segmenting the sample audio according to the i-th time scale, where i is a positive integer; The neural network model generates reconstructed audio based on the at least two first sample features; The neural network model is trained using the sample audio and the reconstructed audio to obtain an audio generation model, which is used to generate a second audio based on the first audio.

2. The method of claim 1, wherein, The neural network model includes a quantizer and a decoder; the generation of reconstructed audio based on the at least two first sample features using the neural network model includes: Each first sample feature among the at least two first sample features is quantized by the quantizer to obtain each first quantized feature; The reconstructed audio features are obtained by fusing the various first quantization features using the quantizer; The reconstructed audio is obtained by decoding the reconstructed audio features using the decoder.

3. The method of claim 2, wherein, The step of quantizing each of the at least two first sample features using the quantizer to obtain each first quantized feature includes: The first input information is quantized by the quantizer to obtain the i-th first quantized feature; Wherein, when i is 1, the first input information includes the first first sample feature among the at least two first sample features; when i is greater than 1, the first input information includes the i-th first sample feature and the (i-1)-th residual feature, wherein the (i-1)-th residual feature characterizes the difference between the (i-1)-th first quantization feature and the (i-1)-th first sample feature.

4. The method according to claim 2 or 3, characterized in that, There are N first quantization features, where N is a positive integer; the process of fusing the first quantization features through the quantizer to obtain reconstructed audio features includes: The second input information is upsampled by the quantizer to obtain the Ni-th upsampled result. The Ni-th upsampled result and the i-th first quantization feature are then fused by the quantizer to obtain the Ni-th fused result. Wherein, when i is N-1, the second input information includes the Nth first quantization feature; when i is any positive integer from 1 to N-2, the second input information includes the Ni-1th fusion result, and the N-1th fusion result is the reconstructed audio feature.

5. The method according to any one of claims 2 to 4, characterized in that, The step of training the neural network model using the sample audio and the reconstructed audio to obtain the audio generation model includes: The i-th first loss is determined by the i-th first sample feature and the i-th first quantization feature, and the i-th first loss is used to characterize the difference between the i-th first sample feature and the i-th first quantization feature; Based on the sample audio, the reconstructed audio, and each first loss, the neural network model is trained to obtain the audio generation model.

6. The method according to any one of claims 1 to 5, characterized in that, The step of training the neural network model using the sample audio and the reconstructed audio to obtain the audio generation model includes: A second loss is determined using the at least two first sample features, wherein the second loss characterizes the correlation between each pair of first sample features; Based on the sample audio, the reconstructed audio, and the second loss, the neural network model is trained to obtain the audio generation model.

7. The method according to any one of claims 1 to 6, characterized in that, The neural network model includes a first encoder and a second encoder; the process of encoding the sample audio through the neural network model to obtain at least two first sample features includes: The sample audio is encoded by the first encoder to obtain reference audio features; The third input information is processed by the second encoder to obtain the i-th convolution result, and the i-th first sample feature is determined based on the i-th convolution result. Wherein, when i is 1, the third input information includes the reference audio features; when i is greater than 1, the third input information includes the (i-1)th convolution result.

8. The method of claim 7, wherein, The step of determining the i-th first sample feature based on the i-th convolution result includes: Feature extraction is performed on the i-th convolution result to obtain local features and global features. The local features represent the local information of the multiple audio segments, and the global features represent the global information of the multiple audio segments. The local features and the global features are fused to obtain the i-th first sample feature.

9. The method according to claim 7 or 8, characterized in that, The step of encoding the sample audio through the first encoder to obtain reference audio features includes: The first encoder downsamples the fourth input information to obtain the j-th downsampling result; Wherein, j is a positive integer. When j is 1, the fourth input information includes the sample audio. When j is greater than 1, the fourth input information includes the (j-1)th downsampling result. The last downsampling result is used to determine the reference audio features.

10. The method of claim 9, wherein, The process of generating reconstructed audio based on the at least two first sample features using the neural network model includes: The neural network model generates a second sample feature based on the j-th downsampling result, and the second sample feature represents time-independent information in the sample audio. The neural network model generates reconstructed audio based on the second sample features and each of the first sample features.

11. The method of claim 10, wherein, The step of generating reconstructed audio based on the second sample features and each of the first sample features using the neural network model includes: Divide the second sample features to obtain at least two second sample sub-features; Quantify each of the at least two second sample sub-features to obtain each second quantized feature; The reconstructed audio is generated by the neural network model based on the respective second quantization features and the respective first sample features.

12. The method according to any one of claims 1 to 11, characterized in that, The step of training the neural network model using the sample audio and the reconstructed audio to obtain the audio generation model includes: Based on the sample audio and the reconstructed audio, at least one of a third loss or a fourth loss is determined, wherein the third loss characterizes the difference between the sample audio and the reconstructed audio in the time domain, and the fourth loss characterizes the difference between the sample audio and the reconstructed audio in the frequency domain; The neural network model is trained based on at least one of the third loss or the fourth loss to obtain the audio generation model.

13. The method according to any one of claims 1 to 12, characterized in that, The method further includes: Get the first audio; The first audio is encoded by the audio generation model to obtain at least two first audio features, which correspond to different time scales. The i-th first audio feature among the at least two first audio features is used to characterize the feature information of multiple audio segments obtained after segmenting the first audio according to the i-th time scale, where i is a positive integer. The second audio is generated based on the at least two first audio features using the audio generation model.

14. An apparatus for training an audio generation model, comprising: The device includes: The acquisition module is used to acquire sample audio and the neural network model to be trained; An encoding module is used to encode the sample audio through the neural network model to obtain at least two first sample features; the at least two first sample features correspond to different time scales, and the i-th first sample feature among the at least two first sample features is used to characterize the feature information of multiple audio segments obtained after segmenting the sample audio according to the i-th time scale, where i is a positive integer; A generation module is used to generate reconstructed audio based on the at least two first sample features using the neural network model; The training module is used to train the neural network model using the sample audio and the reconstructed audio to obtain an audio generation model, which is used to generate a second audio based on the first audio.

15. An electronic device, comprising: The electronic device includes a processor and a memory, the memory storing at least one computer program, which is loaded and executed by the processor to enable the electronic device to implement the training method of the audio generation model as described in any one of claims 1 to 13.

16. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores at least one computer program, which is loaded and executed by a processor to enable an electronic device to implement the training method of the audio generation model as described in any one of claims 1 to 13.

17. A computer program product, characterised in that, The computer program product stores at least one computer program, which is loaded and executed by a processor to enable the electronic device to implement the training method of the audio generation model as described in any one of claims 1 to 13.