Speech translation, model training method and device, equipment and storage medium

By directly extracting audio features from source language audio frames and generating target speech units, and using neural networks and HiFi-GAN to synthesize target language audio, the complex text intermediate steps in existing technologies are solved, achieving efficient speech translation.

CN114783428BActive Publication Date: 2026-06-23BEIJING BAIDU NETCOM SCI & TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING BAIDU NETCOM SCI & TECH CO LTD
Filing Date
2022-02-28
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Current speech translation technology requires an intermediate text step, making the translation process complex and inefficient.

Method used

Audio features are extracted directly from the source language audio frames to determine the target speech units. The target language audio is generated based on the time sequence. A neural network model is used for feature extraction and upsampling. The speech is synthesized by combining HiFi-GAN, omitting the intermediate text step.

Benefits of technology

It simplifies the voice translation process, improves translation efficiency, and ensures the accuracy and consistency of translations.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN114783428B_ABST
    Figure CN114783428B_ABST
Patent Text Reader

Abstract

The present disclosure provides a speech translation, a model training method, device and equipment and a storage medium, relates to the technical field of data processing, and particularly relates to the technical field of audio data processing. The specific implementation scheme is as follows: audio features of each audio frame in source language audio to be translated are extracted; based on the audio features of each audio frame, speech units of a target language corresponding to each audio frame are respectively determined as target speech units, wherein each speech unit is audio data of an acoustic category corresponding to the audio; and based on the time sequence order of each audio frame in the source language audio and the target speech units corresponding to each audio frame, target language audio is generated. When the scheme provided by the embodiment of the present disclosure is applied to speech translation, the efficiency of speech translation can be improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of data processing technology, particularly to the field of audio data processing technology, and further to speech translation, model training methods, apparatus, devices, and storage media. Background Technology

[0002] The process of speech translation refers to translating audio from a source language into audio from a target language. For example, if the source language is Chinese and the target language is Spanish, then the speech translation process described above refers to translating Chinese audio into Spanish audio.

[0003] The relevant technology achieves the speech translation process through the following three steps: first, the source language audio is identified to obtain the source language text; second, the source language text is translated into the target language text; and finally, the target language text is converted into phonocode to obtain the target language audio. Summary of the Invention

[0004] This disclosure provides a speech translation, model training method, apparatus, device, and storage medium.

[0005] According to one aspect of this disclosure, a speech translation method is provided, comprising:

[0006] Extract the audio features of each audio frame in the source language audio to be translated;

[0007] Based on the audio features of each audio frame, the speech units of the target language corresponding to each audio frame are determined as target speech units. Each speech unit is: audio data of an acoustic category corresponding to the audio.

[0008] Based on the temporal order of each audio frame in the source language audio and the target speech unit corresponding to each audio frame, the target language audio is generated.

[0009] According to another aspect of this disclosure, a model training method is provided, comprising:

[0010] The sample features of each sample audio frame in the source language audio are input into the initial speech conversion model to obtain the first speech unit of the target language corresponding to each sample audio frame and the second speech unit of the source language corresponding to each sample audio frame output by the initial speech conversion model. Each speech unit is: audio data of an acoustic category corresponding to the audio.

[0011] Based on the first speech unit, the second speech unit, the third speech unit, and the fourth speech unit, the first loss of the initial speech conversion model for speech unit conversion is calculated. Each third speech unit is a speech unit of the target language corresponding to each audio frame in the sample target language audio. Each fourth speech unit is a speech unit of the source language corresponding to each sample audio frame. The sample source speech audio and the sample target language audio have the same semantics.

[0012] The initial speech conversion model is adjusted based on the first loss to obtain the target speech conversion model.

[0013] According to another aspect of this disclosure, a speech translation apparatus is provided, comprising:

[0014] The feature extraction module is used to extract the audio features of each audio frame in the source language audio to be translated;

[0015] The first unit determination module is used to determine the speech unit of the target language corresponding to each audio frame based on the audio features of each audio frame, and to use it as the target speech unit. Each speech unit is: audio data of an acoustic category corresponding to the audio.

[0016] The audio generation module is used to generate target language audio based on the temporal order of each audio frame in the source language audio and the target speech unit corresponding to each audio frame.

[0017] According to another aspect of this disclosure, a model training apparatus is provided, comprising:

[0018] The sample unit determination module is used to input the sample features of each sample audio frame in the source language audio into the initial speech conversion model to obtain the first speech unit of the target language corresponding to each sample audio frame and the second speech unit of the source language corresponding to each sample audio frame output by the initial speech conversion model. Each speech unit is: audio data of an acoustic category corresponding to the audio.

[0019] The first loss calculation module is used to calculate the first loss of the initial speech conversion model for speech unit conversion based on the first speech unit, the second speech unit, the third speech unit, and the fourth speech unit. Each third speech unit is a speech unit of the target language corresponding to each audio frame in the sample target language audio. Each fourth speech unit is a speech unit of the source language corresponding to each sample audio frame. The sample source speech audio and the sample target language audio have the same semantics.

[0020] The first model acquisition module is used to adjust the model parameters of the initial speech conversion model based on the first loss to obtain the target speech conversion model.

[0021] According to another aspect of this disclosure, an electronic device is provided, comprising:

[0022] At least one processor; and

[0023] A memory communicatively connected to the at least one processor; wherein,

[0024] The memory stores instructions that can be executed by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the aforementioned speech translation method or model training method.

[0025] According to another aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions, wherein the computer instructions are used to cause the computer to perform a speech translation method or a model training method.

[0026] According to another aspect of this disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements a speech translation method or a model training method.

[0027] As can be seen from the above, the solution provided in this disclosure, in the process of speech translation from source language audio to target language audio, can determine the target speech units of the target language corresponding to each audio frame in the source language audio. Each target speech unit is a segment of audio data, and different target speech units have different pronunciations. Based on the determined target speech units, a complete target language audio can be obtained by combining them. The speech translation process of this solution only involves two steps: determining the target speech units and generating target language audio based on the target speech units. Unlike related technologies, the speech translation process of this solution only involves the conversion between audio data and does not require the use of text data. Therefore, it does not involve the conversion process between audio data and text data, making the speech translation process provided by this solution simpler and more efficient.

[0028] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0029] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure. Wherein:

[0030] Figure 1A A flowchart illustrating the first speech translation method provided in this embodiment of the disclosure;

[0031] Figure 1BThis is a schematic diagram of the structure of an audio conversion model provided in an embodiment of the present disclosure;

[0032] Figure 2A A flowchart illustrating the second speech translation method provided in this embodiment of the disclosure;

[0033] Figure 2B This is a schematic diagram of the structure of a speech unit conversion model provided in an embodiment of the present disclosure;

[0034] Figure 3 A flowchart illustrating the first model training method provided in this embodiment of the disclosure;

[0035] Figure 4 A flowchart illustrating the second model training method provided in this embodiment of the disclosure;

[0036] Figure 5 A flowchart illustrating the third model training method provided in this embodiment of the disclosure;

[0037] Figure 6 This is a schematic diagram of the structure of a first speech translation device provided in an embodiment of the present disclosure;

[0038] Figure 7 This is a schematic diagram of the structure of a second speech translation device provided in an embodiment of the present disclosure;

[0039] Figure 8 This is a schematic diagram of the structure of the first model training device provided in the embodiments of this disclosure;

[0040] Figure 9 This is a schematic diagram of the structure of the second model training device provided in the embodiments of this disclosure;

[0041] Figure 10 This is a schematic diagram of the structure of the third model training device provided in the embodiments of this disclosure;

[0042] Figure 11 This is a block diagram of an electronic device used to implement the speech translation and model training methods of the embodiments of this disclosure. Detailed Implementation

[0043] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0044] First, the implementing entity of the embodiments of this disclosure will be described.

[0045] The execution subject of this disclosure embodiment can be an electronic device with voice translation function. The aforementioned electronic device can be a translator, mobile phone, computer, in-vehicle intelligent device, intelligent robot, etc.

[0046] Next, the application scenarios of the embodiments of this disclosure will be described.

[0047] The embodiments disclosed herein are applied to application scenarios that translate source language audio containing speaker voice into target language audio.

[0048] For example, after a user activates an electronic device, they can input audio in the source language. The device can then perform speech translation to obtain audio in the target language. This source language audio can be either audio captured by the electronic device itself while the user is speaking, or pre-captured audio input into the device.

[0049] The following provides a detailed description of the speech translation method provided in the embodiments of this disclosure.

[0050] See Figure 1A This is a flowchart illustrating the first speech translation method provided in this embodiment of the present disclosure. The method includes the following steps S101-S103.

[0051] S101: Extract the audio features of each audio frame in the source language audio to be translated.

[0052] Specifically, the duration of each audio frame in the source language audio to be translated is fixed, such as 25ms, 30ms, etc. In this embodiment, the language of the source language audio itself is used as the source language. In addition, the time difference between the start times of any two sequentially adjacent audio frames can be less than the duration of the audio frame, that is, there are identical audio frames in two adjacent audio frames. The above time difference can be called frame shift, for example, the frame shift is 10ms, 15ms, etc.

[0053] In one embodiment of this disclosure, audio features of each audio frame can be extracted based on factors such as root mean square energy, zero-crossing rate, and spectral flatness. For each audio frame, the FBank (filter bank) features can be extracted as the audio features of that audio frame. Alternatively, other forms of features of the audio frame can also be extracted as audio features; this embodiment does not limit this approach.

[0054] S102: Based on the audio features of each audio frame, determine the speech unit of the target language corresponding to each audio frame, and use it as the target speech unit.

[0055] Each speech unit is audio data corresponding to a specific acoustic category. Different speech units correspond to different acoustic categories. Different audio frames can correspond to different target speech units or the same target speech unit. Audio frames belonging to the same acoustic category correspond to the same speech unit. For example, the aforementioned speech unit can be an audio segment pronounced (or with a pronunciation close to) the pinyin a, o, or e.

[0056] Audio data belonging to the same acoustic category (audio data of the same speech unit) have similar audio features. For example, audio data with a similarity higher than a preset similarity threshold are classified into the corresponding acoustic category. Since a speech unit includes audio data of the same acoustic category, the audio data within that unit are highly likely to have similar pronunciations and represent similar content.

[0057] In addition, the target language mentioned above can be the default language or a language selected by the user.

[0058] In one embodiment of this disclosure, the target speech unit corresponding to each audio frame can be determined based on the correspondence between the audio features of the preset audio frame and the speech unit of the target language.

[0059] In another embodiment of this disclosure, it can also be done via the following: Figure 2A The intermediate steps S102A-S102D implement the above step S102, which will not be described in detail here.

[0060] S103: Generate target language audio based on the temporal order of each audio frame in the source language audio and the target speech unit corresponding to each audio frame.

[0061] Specifically, the target language audio can be obtained by splicing together the target speech units according to the temporal order of the corresponding audio frames in the source language audio.

[0062] Alternatively, in another embodiment of this disclosure, step S103 can also be achieved through the following steps A-C.

[0063] Step A: Extract features from the target speech unit sequence.

[0064] The aforementioned target speech unit sequence includes: target speech units corresponding to each audio frame, arranged in temporal order according to the corresponding audio frame in the aforementioned source language audio.

[0065] Specifically, step A above can be implemented using a neural network model. Specifically, the target speech unit sequence is input into the convolutional layer of the neural network model for convolution transformation, and the result of the convolution transformation is input into the normalization layer of the neural network model for normalization processing. Then, the linear transformation function contained in the neural network model is used to perform linear transformation processing on the normalized result, and the result of the linear transformation is used as the final feature extraction result.

[0066] In one embodiment of this disclosure, the convolutional layer, normalization layer, and linear transformation function described above may be included in the duration predict module of the neural network model.

[0067] The duration prediction module described above consists of a series of interconnected layers: Conv1D (One Dimensional Convolution), Layer Normalization (Layer Normalization), Conv1D, Layer Normalization, and Linear (linear transformation). Conv1D performs initial feature extraction, Layer Normalization performs normalization, and Linear performs the linear transformation.

[0068] Step B: Based on the feature extraction results, the above target speech unit sequence is upsampled to obtain the upsampled result.

[0069] Specifically, the above feature extraction results can represent the overall features of the target speech unit sequence. When upsampling the target speech unit sequence, the above feature extraction results can indicate that the upsampling process highlights the overall features of the target speech unit sequence, so that the upsampling results match the above feature extraction results.

[0070] In one embodiment of this disclosure, the feature extraction results and the target speech unit sequence can be input together into the unsampler layer, and the unsampler layer can perform upsampling processing on the target speech unit sequence.

[0071] Step C: Based on the above upsampling results, perform speech synthesis to generate audio in the target language.

[0072] In one embodiment of this disclosure, the above-mentioned upsampling result can be input into a vocoder, which can then perform speech synthesis and output audio in the target language.

[0073] The aforementioned vocoder can be a HiFi-GAN (High-Fidelity Generative Adversarial Networks), which consists of Conv1D, ReLU (Linear Rectification function), and ResBlock (Residual Block).

[0074] Specifically, the HiFi-GAN described above consists of Conv1D, ReLU, Conv1D, ReLU, ResBlock, Conv1D, ReLU, Conv1D, and ReLU connected in sequence. HiFi-GAN is a vocoder in related technologies, which will not be described in detail in this embodiment.

[0075] In the process of generating target language audio, the embodiments of this disclosure first extract features from the target speech unit sequence to determine the overall features of the target speech unit sequence, and then upsample the target speech unit sequence based on the feature extraction results, so that the upsampling results maintain the original overall features of the target speech unit sequence, thereby ensuring the consistency of data before and after the upsampling process and improving the accuracy of the generated target language audio.

[0076] Furthermore, in this embodiment, the duration prediction module, the unsampler layer, and HiFi-GAN can together form an audio conversion model, see [link to relevant documentation]. Figure 1B This is a schematic diagram of the structure of an audio conversion model provided in an embodiment of this disclosure.

[0077] As shown in the figure, after inputting the target speech unit sequence into the duration predict module to obtain the feature extraction result, the feature extraction result and the target speech unit sequence are then input into the unsampler layer to obtain the upsampling result, and the upsampling result is then input into HiFi-GAN to obtain the target language audio.

[0078] Furthermore, the initial audio conversion model can be trained using sample target language audio and pre-obtained sample target speech unit sequences to obtain the audio conversion model.

[0079] Specifically, the aforementioned sample target speech unit sequence contains sample target speech units corresponding to each audio frame in the sample target language audio, and the arrangement order of each sample target speech unit is the same as the arrangement order of the corresponding audio frame in the sample target language audio. The sample target identifier group is input into the initial audio conversion model. Based on the output result and the sample target speech unit sequence, the loss of the initial audio conversion model in generating the target language audio is calculated. Based on the loss, the model parameters of the initial audio conversion model are adjusted to obtain the audio conversion model.

[0080] In one embodiment of this disclosure, the aforementioned loss can be calculated using the MSE (Mean Square Error) loss function.

[0081] As can be seen from the above, the solution provided in this disclosure, in the process of speech translation from source language audio to target language audio, can directly determine the target speech units corresponding to each audio frame in the source language audio. Each target speech unit is a segment of audio data, and different target speech units have different pronunciations. Based on the determined target speech units, a complete target language audio can be obtained by combining them. The speech translation process of this solution only involves two steps: determining the target speech units and generating target language audio based on the target speech units. Unlike related technologies, the speech translation process of this solution only involves the conversion between audio data and does not require the use of text data. Therefore, it does not involve the conversion process between audio data and text data, making the speech translation process provided by this solution simpler and more efficient.

[0082] See Figure 2A This is a flowchart illustrating the second speech translation method provided in this embodiment of the present disclosure, which is consistent with the aforementioned... Figure 1A Compared to the embodiment shown, step S102 can be implemented by the following steps S102A-S102D.

[0083] S102A: Encode the audio features of each audio frame separately to obtain the encoded features of each audio frame.

[0084] Specifically, encoding the audio features of an audio frame is equivalent to quantizing the audio features. After encoding, the encoded features of each audio frame have the same form. Unifying the original audio features into the same form is beneficial for subsequent feature processing.

[0085] S102B: Perform feature mining on the encoded features of each audio frame to obtain the hidden layer features of each audio frame.

[0086] Among them, the aforementioned hidden layer features are features that contain implicit information about audio frames.

[0087] Specifically, feature mining can be performed based on attention mechanisms. By leveraging various information such as the temporal relationships between audio frames and the importance of different audio frames within the source language audio, implicit information hidden within the encoded features can be extracted, resulting in richly informative hidden features. Target speech units obtained based on these informative hidden features exhibit high accuracy.

[0088] S102C: Decode the hidden features of each audio frame to obtain the corresponding decoded features of each audio frame.

[0089] S102D: For each audio frame, determine the speech unit of the target language that matches the decoding features of the audio frame, and use the determined audio unit as the target speech unit corresponding to the audio frame.

[0090] Specifically, since different speech units correspond to different acoustic categories, that is, different speech units have different sounds and meanings, the characteristics of different speech units are also different.

[0091] In one embodiment of this disclosure, the decoding features of each audio frame can be matched with the features of speech units of each target language to determine the speech unit that matches the decoding features, and then the speech unit that matches the decoding features of the audio frame can be taken as the target speech unit corresponding to the audio frame.

[0092] In this embodiment, the audio features of each audio frame are first encoded to obtain easily processed encoded features. Then, feature mining is performed on the encoded features based on an attention mechanism to obtain hidden layer features containing rich information. Finally, the target speech unit corresponding to each audio frame is determined based on the decoded features obtained after decoding the hidden layer features. In this embodiment, the target speech unit is determined by combining multiple types of information contained in the hidden layer features, making the target speech unit obtained in this embodiment more accurate.

[0093] Alternatively, step S102A can be achieved through step D1.

[0094] Step D1: Input the audio features of each audio frame into the coding layer of the target speech conversion model, encode the audio features, and obtain the coded features of each audio frame.

[0095] The aforementioned target speech conversion model also includes a feature mining layer, a decoding layer, and an output layer.

[0096] Specifically, the aforementioned feature mining layer performs feature mining on encoded features based on an attention mechanism, and includes a first feature mining layer and a second feature mining layer; the aforementioned decoding layer includes a first decoding layer and a second decoding layer; and the aforementioned output layer includes a first output layer and a second output layer.

[0097] See Figure 2B This is a schematic diagram of the structure of a target speech conversion model provided in an embodiment of this disclosure.

[0098] The aforementioned speech unit conversion model includes an encoding layer, a first attention layer, a second attention layer, a first decoding layer, a second decoding layer, a first output layer, and a second output layer.

[0099] Specifically, the above-mentioned encoding layer is used to encode audio features to obtain encoded features, thereby realizing step S102A. The first attention layer is used to perform feature mining on the encoded features based on the attention mechanism to obtain hidden layer features, thereby realizing step S102B. The first decoding layer is used to decode the hidden layer features to obtain decoded features, thereby realizing step S102C. The first output layer is used to determine the target speech unit based on the decoded features, thereby realizing step S102D.

[0100] Furthermore, the aforementioned second attention layer is used to perform feature mining on the encoded features based on the attention mechanism, the second decoding layer is used to decode the feature mining results to obtain the decoded features, and the second output layer is used to determine the source language speech units corresponding to the audio frames based on the decoded features. Step D aims to obtain the target speech units corresponding to each audio frame; therefore, during the execution of step D, the second attention layer and the second decoding layer may be disabled, or the source language speech units output by the second attention layer and the second decoding layer may be ignored.

[0101] Furthermore, the aforementioned encoding layer can be composed of four interconnected Conv1D layers, and the first decoding layer and the second decoding layer can each be composed of four interconnected Transformer Layers.

[0102] The above step S102B can be achieved through the following step D2.

[0103] Step D2: Input the encoded features of each audio frame into the above feature mining layer, perform feature mining on the encoded features, and obtain the hidden layer features of each audio frame.

[0104] Specifically, the encoded features can be input into the first feature mining layer in the feature mining layer to perform feature mining on the encoded features.

[0105] The above step S102C can be achieved through the following step D3.

[0106] Step D3: Input the hidden features of each audio frame into the above decoding layer, decode the hidden features to obtain the decoded features corresponding to each audio frame.

[0107] Specifically, the hidden layer features can be input into the first decoding layer in the decoding layer to decode the hidden layer features.

[0108] The above step S102D can be achieved through the following step D4.

[0109] Step D4: Input the decoding features corresponding to each audio frame into the above output layer to determine the speech unit of the target language that matches the decoding features of each audio frame, and use it as the target speech unit corresponding to each audio frame.

[0110] Specifically, the first output layer of the decoding feature input-output layer can be used to determine the target speech unit corresponding to each audio frame.

[0111] In this embodiment of the disclosure, a trained neural network model, namely the target speech conversion model, is used to obtain the target speech units corresponding to each audio frame in the source language audio. The target speech conversion model is trained based on a large number of samples, which can ensure the accuracy of the obtained target speech units. In addition, the neural network model has a fast data processing speed, which can improve the efficiency of determining the target speech units.

[0112] See Figure 3 This is a flowchart illustrating the first model training method provided in this embodiment. Specifically, the target speech conversion model can be obtained by training the initial speech conversion model through the following steps S301-S303.

[0113] S301: Input the sample features of each sample audio frame in the source language audio into the initial speech conversion model to obtain the first speech unit of the target language corresponding to each sample audio frame and the second speech unit of the source language corresponding to each sample audio frame, which are output by the initial speech conversion model.

[0114] Each speech unit is audio data corresponding to one acoustic category of the audio. The initial speech conversion model described above is an initial model that has not yet been trained into the target speech conversion model. The structure of the initial speech conversion model is the same as that of the target speech conversion model, as detailed above. Figure 2B The model structure shown.

[0115] In addition, the aforementioned first speech unit is the one described above. Figure 2B The first decoding layer outputs a speech unit, and the second speech unit is as described above. Figure 2B The speech unit output by the second decoding layer is shown.

[0116] S302: Based on the first speech unit, the second speech unit, the third speech unit, and the fourth speech unit, calculate the first loss of the speech unit conversion using the above initial speech conversion model.

[0117] Among them, the third and fourth speech units mentioned above are: pre-determined speech units, and each third speech unit is: the speech unit of the target language corresponding to each audio frame in the sample target language audio.

[0118] Each fourth speech unit is the speech unit of the source language corresponding to each sample audio frame. The source speech audio and the target language audio of the above samples have the same semantics, only the language is different.

[0119] Specifically, the aforementioned sample target language audio can be obtained by translating the sample source language audio into speech using methods in related technologies, or by translating the content of the sample source language audio into the target language and then recording it while the speaker reads the translated content.

[0120] In one embodiment of this disclosure, the third speech unit may be determined manually by identifying each audio frame in the target language audio sample. Alternatively, the third speech unit may be obtained through step E as described below, which will not be detailed here.

[0121] In another embodiment of this disclosure, the fourth speech unit may be determined manually by identifying each audio frame in the sample source language audio. Alternatively, the fourth speech unit may be obtained through step F as described below, which will not be detailed here.

[0122] Furthermore, when calculating the aforementioned first loss, the first sub-loss of the target speech unit can be determined based on the initial speech conversion model calculated using the first and third speech units, and the second sub-loss of the source language speech unit can be determined based on the initial speech conversion model calculated using the second and fourth speech units. The aforementioned first loss is then calculated by combining the first and second sub-losses.

[0123] Specifically, the first loss can be obtained by calculating the average, weighted average, or weighted sum of the first and second sub-losses. Furthermore, the calculation of the first and second sub-losses can be based on the Cross Entropy (CE) criterion.

[0124] S303: Based on the first loss mentioned above, the model parameters of the initial speech conversion model are adjusted to obtain the target speech conversion model.

[0125] In this embodiment of the disclosure, during the training of the target speech conversion model, the parameters of the initial speech conversion model are often adjusted multiple times. After each adjustment of the model parameters, if the preset first training termination condition is not met, the process can return to the above step S301, input the sample features of each sample audio frame in the new sample source language audio into the above initial speech conversion model, and continue to execute the subsequent step S302 to further train the above initial speech conversion model until the above first training termination condition is met, and the target speech conversion model is obtained.

[0126] The first training termination condition can be either the number of times the model parameters are adjusted reaches a preset number, or the calculated first loss is lower than a first preset loss.

[0127] Although the target speech conversion model trained in this embodiment is a model used to determine the target speech units corresponding to each audio frame in the source language audio, in the process of training the initial speech conversion model, this embodiment not only uses the first and third speech units corresponding to the target language to adjust the parameters of the initial speech conversion model to ensure the accuracy of the target speech units output by the model, but also uses the second and fourth speech units corresponding to the source language to adjust the parameters of the initial speech conversion model, jointly constraining the model training process, so that the output results of the trained speech unit conversion model are more accurate.

[0128] It should be noted that the source language audio samples and target language audio samples in this embodiment are from publicly available datasets. Furthermore, the source language audio samples and target language audio samples in this embodiment are not specific to any particular user and do not reflect the personal information of any particular user.

[0129] In one embodiment of this disclosure, for each audio frame in the target language audio sample, a third speech unit can be obtained through the following step E.

[0130] Step E: Input the audio features of the audio frame into the trained target speech unit determination model, and use the output as the third speech unit corresponding to the audio frame.

[0131] The target speech unit determination model mentioned above can be a HuBERT (Hidden-Unit Bidirectional Encoder Representation from Transformers) model. This model consists of four sequentially connected Conv1D layers and six sequentially connected Transformer Layers.

[0132] In another embodiment of this disclosure, for each sample audio frame in the sample source language audio, a fourth speech unit can be obtained through the following step F.

[0133] Step F: Input the audio features of the sample audio frame into the trained target speech unit determination model, and use the output as the fourth speech unit corresponding to the sample audio frame.

[0134] Specifically, the target speech unit determination model involved in step F has the same structure as the target speech unit determination model involved in step E, and will not be described again here.

[0135] In addition, the target speech unit determination model involved in steps E and F above can be trained using the first sample audio.

[0136] The target speech unit determination model involved in step E is used to determine the speech units corresponding to audio frames in the target language. This model needs to be trained using the first sample audio from the target language. Similarly, the target speech unit determination model involved in step F is used to determine the speech units corresponding to audio frames in the source language. This model also needs to be trained using the first sample audio from the source language. Since the first sample audio used during training is in a different language, the target speech unit determination models involved in steps E and F can be two different models.

[0137] Alternatively, the first sample audio of the target language and the first sample audio of the source language can be used to train the same model to obtain a target speech unit determination model. The trained target speech unit determination model can be applied to step E to determine the third speech unit corresponding to the audio frame of the target language, and can also be applied to step F to determine the fourth speech unit corresponding to the audio frame of the source language.

[0138] Furthermore, in this embodiment of the present disclosure, the third voice unit can be obtained through both step E and step F; the third voice unit can be obtained through only step E, and the fourth voice unit can be obtained through other means; the fourth voice unit can be obtained through only step F, and the third voice unit can be obtained through other means; or the third voice unit can be obtained without step E or step F.

[0139] In this embodiment, a pre-trained neural network model, i.e., a target speech unit determination model, is used to obtain the third and / or fourth speech units. These obtained third and / or fourth speech units are then used to train the initial speech conversion model. This eliminates the need for manually obtaining the third and fourth speech units, thereby reducing the cost and time required to acquire them before training the initial speech conversion model.

[0140] See Figure 4 This is a flowchart illustrating the second model training method provided in this embodiment. The initial speech unit determination model is trained through the following steps S401-S403 to obtain the trained target speech unit determination model.

[0141] S401: Input each first audio frame of the first sample audio into the initial speech unit determination model to obtain the fifth speech unit corresponding to each first audio frame.

[0142] Among them, sample audio with clear and standard pronunciation and less ambient noise can be selected as the first sample audio. The human speech features contained in such sample audio are relatively clear. Using such sample audio as the first sample audio to train the initial speech unit determination model is beneficial for the initial speech unit determination model to learn human speech features and can be trained to converge more quickly.

[0143] Furthermore, all the first sample audios belong to the same language, and the total duration of all the first sample audios is longer than a preset duration, such as 10 hours or 15 hours. Theoretically, the speech units corresponding to the audio frames contained in the longer first sample audios can cover all the different speech units that may appear in that language, thereby enabling the trained speech unit determination model to recognize various different speech units.

[0144] S402: Based on each fifth speech unit and the sample speech units corresponding to each pre-obtained first audio frame, calculate the second loss of the speech unit corresponding to the audio frame determined by the above initial speech unit determination model.

[0145] The sample speech units corresponding to each audio frame can be obtained manually or obtained through steps G-H below, which will not be detailed here.

[0146] S403: Based on the second loss mentioned above, the model parameters of the initial speech unit determination model are adjusted to obtain the target speech unit determination model.

[0147] In this embodiment of the disclosure, during the training process of obtaining the target speech unit determination model, the parameters of the initial speech unit determination model are often adjusted multiple times. After each adjustment of the model parameters, if the preset second training termination condition is not met, the process can return to the above step S401, input the sample features of each sample audio frame in the new first sample audio into the above initial speech unit determination model, and continue to execute the subsequent step S402 to further train the above initial speech unit determination model until the above second training termination condition is met, and the target speech unit determination model is obtained.

[0148] The second training termination condition can be either the number of times the model parameters are adjusted reaches a preset number, or the calculated second loss is lower than a second preset loss.

[0149] In this embodiment of the disclosure, accurate sample speech units obtained in advance are used as training labels. The initial speech unit determination model is trained based on these labels. This allows the data results of the initial speech unit determination model to gradually approach the aforementioned sample speech units during the training process. As a result, the output results of the trained target speech unit determination model approach the accurate results and can be used to identify speech units corresponding to different audio frames.

[0150] It should be noted that the first sample audio in this embodiment comes from a publicly available dataset. Furthermore, the first sample audio in this embodiment is not specific to any particular user and does not reflect the personal information of any particular user.

[0151] In one embodiment of this disclosure, sample speech units corresponding to each first audio frame can be obtained through the following steps G-H.

[0152] Step G: Based on the sample audio features of each first audio frame, perform clustering processing on each first audio frame according to the acoustic category to determine the acoustic category to which each first audio frame belongs.

[0153] The audio features mentioned above can be features in the form of FBank.

[0154] In one embodiment of this disclosure, product quantization can be used to calculate the sample audio features of each audio frame in order to cluster the audio frames. After clustering, audio frames belonging to the same class are audio frames belonging to the same acoustic type.

[0155] Step H: For each first audio frame, determine the speech unit corresponding to the acoustic category to which the first audio frame belongs as the sample speech unit corresponding to the first audio frame.

[0156] Specifically, the correspondence between acoustic categories and speech units can be pre-set, with each acoustic category corresponding to one speech unit. After determining the acoustic category to which an audio frame belongs, the speech unit corresponding to that acoustic category can be identified as the sample speech unit corresponding to that audio frame.

[0157] Alternatively, feature extraction can be performed on the first audio frames belonging to the same acoustic category to obtain audio features corresponding to that acoustic category. Speech units matching the features of that acoustic category are then selected as the speech units corresponding to that acoustic category. Therefore, the sample speech units corresponding to each first audio frame belonging to that acoustic category are all speech units corresponding to that acoustic category.

[0158] Therefore, in this embodiment, the sample speech units corresponding to each first audio frame in the first sample audio can be determined without manual identification. These sample speech units can then be used to train the initially determined model. Since obtaining the sample speech units does not require manual intervention, the cost and time required to obtain them before training the initially determined model can be reduced.

[0159] Since the target speech unit determination model mentioned above is only trained using the first sample audio, if the first sample audio is of a fixed type, such as audio from a fixed number of speakers, audio from a fixed gender, or audio from a fixed region, the limitations of the first sample audio are significant. The target speech unit determination model trained using the first sample audio can only accurately determine the speech units corresponding to audio frames of the same type as the first sample audio. Therefore, the target speech unit determination model trained in the aforementioned way has poor generalization ability. For this reason, the trained target speech unit determination model can be further trained.

[0160] See Figure 5 This is a flowchart illustrating the third model training method provided in this embodiment of the disclosure, which is consistent with the aforementioned... Figure 4 Compared to the illustrated embodiment, after step S403, the following steps S404-S406 are also included.

[0161] When the target speech unit determination model is the HuBERT model, during the further training of the target speech unit determination model, a randomly initialized softmax layer can be added after the last Transformer Layer in the target speech unit determination model to assist in the further training of the target speech unit determination model.

[0162] S404: Input each second audio frame in the second sample audio into the target speech unit determination model above to obtain the sixth speech unit corresponding to each second audio frame.

[0163] Among them, the second sample audio and the first sample audio correspond to the same language and semantics, but the audio data are different.

[0164] Although the second sample audio corresponds to the same language and semantics as the first sample audio, meaning their content is the same, the audio data of the second sample audio differs significantly from that of the first sample audio due to factors such as the accents, speaking styles, speaking speeds, and volumes of different speakers, as well as the recording environment noise when the first and second sample audio were recorded.

[0165] S405: Based on the output speech unit sequence and the sample speech unit sequence, calculate the third loss of the speech unit corresponding to the audio frame determined by the above target speech unit determination model.

[0166] The output speech unit sequence includes each sixth speech unit, and the order of each sixth speech unit is: the temporal order of the corresponding second audio frame in the second sample audio.

[0167] The above sample speech unit sequence includes: sample speech units corresponding to each first audio frame, and the arrangement order of each sample speech unit is: the temporal order of the corresponding first audio frame in the above first sample audio.

[0168] Specifically, before further training the target speech unit determination model, it is necessary to first determine the training labels used in the model training process, that is, the speech units actually corresponding to each audio frame in the second sample audio. However, since the second sample audio is substantially different from the first sample audio, it is difficult to accurately determine the audio frames in the second sample audio that represent the same content and correspond one-to-one with the first sample audio. Therefore, it is difficult to directly use the sample speech units corresponding to each audio frame in the first sample audio as the speech units corresponding to each audio frame in the second sample audio.

[0169] For example, if the speaker in the second sample audio speaks faster than the speaker in the first sample audio speaks slower, and the duration of the second sample audio is 1 minute while the duration of the first sample audio is 1.5 minutes, then although they correspond to the same language and semantics, it is obviously difficult to directly determine the one-to-one correspondence between each audio frame in the first sample audio and each audio frame in the second sample audio because the first sample audio is longer and contains more audio frames. Therefore, it is not possible to directly use the sample speech units corresponding to each audio frame in the first sample audio as the speech units corresponding to each audio frame in the second sample audio.

[0170] Although it is difficult to determine the speech units corresponding to each audio frame in the second sample audio separately, since the first and second sample audios correspond to the same language and semantics, the sequences obtained by arranging the speech units corresponding to each audio frame in the first and second sample audios according to their temporal order are theoretically similar. Therefore, in this embodiment, the sequence of sample speech units corresponding to the first sample audio can be directly used as the theoretically accurate value of the target speech unit determination model output, without needing to re-identify the actual speech units corresponding to each audio frame in the second sample audio.

[0171] In another embodiment of this disclosure, the aforementioned third loss can be calculated based on the CTC (connectionist temporal classification) loss function.

[0172] S406: Based on the third loss mentioned above, adjust the model parameters of the target speech unit determination model to obtain a further trained target speech unit determination model.

[0173] In one embodiment of this disclosure, the parameters of the speech unit determination model and the softmax layer used for training can be adjusted by fine-tuning. Specifically, the parameters of the Conv1D layer included in the speech unit determination model can be adjusted without adjusting them, thereby speeding up the model training process.

[0174] In addition, if the preset third training termination condition is not met after adjusting the model parameters, the process can return to step S404 above, input the audio frame of the new second sample audio into the target speech unit determination model, and continue to train the target speech unit determination model until the above third training termination condition is met, and obtain the target speech unit conversion model obtained through further training.

[0175] The third training termination condition mentioned above can be either: the number of times the model parameters are adjusted reaches a preset number, or the calculated third loss is lower than the third preset loss.

[0176] As can be seen from the above, this embodiment uses a second sample audio that has the same semantics and language as the first sample audio but different audio data to further train the target speech unit determination model. This allows the target speech unit determination model obtained after further training to not only determine the speech units corresponding to audio frames of the same type as the first sample audio, but also improve the generalization of the target speech unit determination model. Furthermore, by directly using the sample speech unit sequence containing the sample speech units corresponding to each first audio frame during the training process, the time and cost required to determine the actual speech units corresponding to each audio frame in the second sample audio can be saved.

[0177] It should be noted that the second sample audio in this embodiment comes from a publicly available dataset. Furthermore, the second sample audio in this embodiment is not specific to any particular user and does not reflect the personal information of any specific user.

[0178] Corresponding to the above-described speech translation method, this disclosure also provides a speech translation device.

[0179] See Figure 6 The diagram below is a structural schematic of a first speech translation device provided in this embodiment of the present disclosure. The device includes the following modules 601-603.

[0180] The feature extraction module 601 is used to extract the audio features of each audio frame in the source language audio to be translated;

[0181] The first unit determination module 602 is used to determine the speech unit of the target language corresponding to each audio frame based on the audio features of each audio frame, and to use it as the target speech unit. Each speech unit is: audio data of an acoustic category corresponding to the audio.

[0182] The audio generation module 603 is used to generate target language audio based on the temporal order of each audio frame in the source language audio and the target speech unit corresponding to each audio frame.

[0183] As can be seen from the above, the solution provided in this disclosure, in the process of speech translation from source language audio to target language audio, can directly determine the target speech units corresponding to each audio frame in the source language audio. Each target speech unit is a segment of audio data, and different target speech units have different pronunciations. Based on the determined target speech units, a complete target language audio can be obtained by combining them. The speech translation process of this solution only involves two steps: determining the target speech units and generating target language audio based on the target speech units. Unlike related technologies, the speech translation process of this solution only involves the conversion between audio data and does not require the use of text data. Therefore, it does not involve the conversion process between audio data and text data, making the speech translation process provided by this solution simpler and more efficient.

[0184] See Figure 7 This is a schematic diagram of the structure of the second type of voice translation device provided in this disclosure embodiment, which is similar to the aforementioned... Figure 6 Compared to the illustrated embodiment, the first unit determining module 602 includes:

[0185] The encoding submodule 602A is used to encode the audio features of each audio frame separately to obtain the encoded features of each audio frame.

[0186] The feature mining submodule 602B is used to perform feature mining on the encoded features of each audio frame to obtain the hidden features of each audio frame. The hidden features are features that contain the implicit information of the audio frame.

[0187] The decoding submodule 602C is used to decode the hidden features of each audio frame to obtain the corresponding decoded features of each audio frame.

[0188] The unit determination submodule 602D is used to determine the target language speech unit that matches the decoding features of each audio frame for each audio frame, and to use the determined audio unit as the target speech unit corresponding to the audio frame.

[0189] In this embodiment, the audio features of each audio frame are first encoded to obtain easily processed encoded features. Then, feature mining is performed on the encoded features based on an attention mechanism to obtain hidden layer features containing rich information. Finally, the target speech unit corresponding to each audio frame is determined based on the decoded features obtained after decoding the hidden layer features. In this embodiment, the target speech unit is determined by combining multiple types of information contained in the hidden layer features, making the target speech unit obtained in this embodiment more accurate.

[0190] In one embodiment of this disclosure, the encoding submodule 602A is specifically used for:

[0191] The audio features of each audio frame are input into the encoding layer of the target speech conversion model, and the audio features are encoded to obtain the encoded features of each audio frame. The target speech conversion model also includes a feature mining layer, a decoding layer, and an output layer.

[0192] The feature mining submodule 602B is specifically used for:

[0193] The encoded features of each audio frame are input into the feature mining layer, and feature mining is performed on the encoded features to obtain the hidden layer features of each audio frame.

[0194] The decoding submodule 602C is specifically used for:

[0195] The hidden features of each audio frame are input into the decoding layer, and the hidden features are decoded to obtain the decoding features corresponding to each audio frame.

[0196] The unit determines submodule 602D, which is specifically used for:

[0197] The decoding features corresponding to each audio frame are input into the output layer to determine the speech unit of the target language that matches the decoding features of each audio frame, which is then used as the target speech unit corresponding to each audio frame.

[0198] In this embodiment of the disclosure, a trained neural network model, namely the target speech conversion model, is used to obtain the target speech units corresponding to each audio frame in the source language audio. The target speech conversion model is trained based on a large number of samples, which can ensure the accuracy of the obtained target speech units. In addition, the neural network model has a fast data processing speed, which can improve the efficiency of determining the target speech units.

[0199] In one embodiment of this disclosure, the audio generation module 603 is specifically used for:

[0200] Feature extraction is performed on the target speech unit sequence, wherein the target speech unit sequence includes: target speech units corresponding to each audio frame arranged in temporal order according to the corresponding audio frames in the source language audio;

[0201] Based on the feature extraction results, the target speech unit sequence is upsampled to obtain the upsampled result;

[0202] Based on the upsampling results, speech synthesis is performed to generate audio in the target language.

[0203] In the process of generating target language audio, the embodiments of this disclosure first extract features from the target speech unit sequence to determine the overall features of the target speech unit sequence, and then upsample the target speech unit sequence based on the feature extraction results, so that the upsampling results maintain the original overall features of the target speech unit sequence, thereby ensuring the consistency of data before and after the upsampling process and improving the accuracy of the generated target language audio.

[0204] Corresponding to the above-described model training method, this disclosure also provides a model training apparatus.

[0205] See Figure 8 This is a schematic diagram of the structure of the first model training device provided in the embodiments of this disclosure. The device includes the following modules 801-803.

[0206] The sample unit determination module 801 is used to input the sample features of each sample audio frame in the source language audio into the initial speech conversion model to obtain the first speech unit of the target language corresponding to each sample audio frame and the second speech unit of the source language corresponding to each sample audio frame output by the initial speech conversion model. Each speech unit is: audio data of an acoustic category corresponding to the audio.

[0207] The first loss calculation module 802 is used to calculate the first loss of the initial speech conversion model for speech unit conversion based on the first speech unit, the second speech unit, the third speech unit, and the fourth speech unit. The third and fourth speech units are predetermined speech units, each of which is a speech unit of the target language corresponding to each audio frame in the sample target language audio. Each of the fourth speech units is a speech unit of the source language corresponding to each sample audio frame. The sample source speech audio and the sample target language audio have the same semantics.

[0208] The first model acquisition module 803 is used to adjust the model parameters of the initial speech conversion model based on the first loss to obtain the target speech conversion model.

[0209] Although the speech unit conversion model trained in this embodiment is used to determine the target speech unit corresponding to each audio frame in the source language audio, in the process of training the initial conversion model, this embodiment not only uses the first and third speech units corresponding to the target language to adjust the parameters of the initial conversion model to ensure the accuracy of the target speech unit output by the model, but also uses the second and fourth speech units corresponding to the source language to adjust the parameters of the initial conversion model, jointly constraining the model training process, so that the output result of the trained speech unit conversion model is more accurate.

[0210] In one embodiment of this disclosure, for each audio frame in the target language audio sample, a third speech unit corresponding to the audio frame is obtained by a third unit determination module, wherein the third unit determination module is specifically used for:

[0211] The audio features of the audio frame are input into the trained target speech unit determination model, and the output is used as the third speech unit corresponding to the audio frame.

[0212] and / or

[0213] For each sample audio frame in the source language audio, the fourth speech unit corresponding to that sample audio frame is obtained through the fourth unit determination module. The fourth unit determination module is specifically used for:

[0214] The audio features of the sample audio frame are input into the trained target speech unit determination model, and the output is used as the fourth speech unit corresponding to the sample audio frame.

[0215] In this embodiment, a pre-trained neural network model, i.e., a target speech unit determination model, is used to obtain the third and / or fourth speech units. These obtained third and / or fourth speech units are then used to train the initial speech conversion model. This eliminates the need for manually obtaining the third and fourth speech units, thereby reducing the cost and time required to acquire them before training the initial speech conversion model.

[0216] See Figure 9 The above is a schematic diagram of the structure of the second model training device provided in the embodiments of this disclosure. The initial speech unit determination model is trained through the following modules to obtain the trained target speech unit determination model. The above device includes the following modules 901-903.

[0217] The fifth unit acquisition module 901 is used to input each first audio frame of the first sample audio into the initial speech unit determination model to obtain the fifth speech unit corresponding to each first audio frame.

[0218] The second loss calculation module 902 is used to calculate the second loss of the speech unit corresponding to the audio frame determined by the initial speech unit determination model based on the sample speech units corresponding to each fifth speech unit and each pre-obtained first audio frame;

[0219] The second model acquisition module 903 is used to adjust the model parameters of the initial speech unit determination model based on the second loss to obtain the target speech unit determination model.

[0220] In this embodiment of the disclosure, accurate sample speech units obtained in advance are used as training labels. The initial speech unit determination model is trained based on these labels. This allows the data results of the initial speech unit determination model to gradually approach the aforementioned sample speech units during the training process. As a result, the output results of the trained target speech unit determination model approach the accurate results and can be used to identify speech units corresponding to different audio frames.

[0221] In one embodiment of this disclosure, sample speech units corresponding to each first audio frame are obtained through a sample unit obtaining module, wherein the sample unit obtaining module is specifically used for:

[0222] Based on the sample audio features of each first audio frame, clustering processing is performed on each first audio frame according to the acoustic category to determine the acoustic category to which each first audio frame belongs.

[0223] For each first audio frame, the speech unit corresponding to the acoustic category to which the first audio frame belongs is determined as the sample speech unit corresponding to the first audio frame.

[0224] Therefore, in this embodiment, the sample speech units corresponding to each first audio frame in the first sample audio can be determined without manual identification. These sample speech units can then be used to train the initially determined model. Since obtaining the sample speech units does not require manual intervention, the cost and time required to obtain them before training the initially determined model can be reduced.

[0225] See Figure 10 This is a schematic diagram of the structure of the third model training device provided in this embodiment of the present disclosure, which is consistent with the aforementioned Figure 9 Compared to the embodiments shown, the above-described device further includes the following modules 904-906:

[0226] The sixth unit acquisition module 904 is used to input each second audio frame in the second sample audio into the speech unit determination model to obtain the sixth speech unit corresponding to each second audio frame, wherein the second sample audio and the first sample audio correspond to the same language and semantics, but the audio data are different.

[0227] The third loss calculation module 905 is used to calculate the third loss of the speech unit corresponding to the audio frame determined by the speech unit determination model based on the output speech unit sequence and the sample speech unit sequence. The output speech unit sequence includes each sixth speech unit, and the arrangement order of each sixth speech unit is: the temporal order of the corresponding second audio frame in the second sample audio. The sample speech unit sequence includes: the sample speech units corresponding to each first audio frame, and the arrangement order of each sample speech unit is: the temporal order of the corresponding first audio frame in the first sample audio.

[0228] The third model acquisition module 906 is used to adjust the model parameters of the speech unit determination model based on the third loss to obtain the speech unit determination model.

[0229] As can be seen from the above, this embodiment uses a second sample audio that has the same semantics and language as the first sample audio but different audio data to further train the speech unit determination model. This allows the further trained speech unit determination model to not only determine the speech units corresponding to audio frames of the same type as the first sample audio, but also improve the generalization of the speech unit determination model. Furthermore, by directly using the sample speech unit sequence containing the sample speech units corresponding to each of the first audio frames for model training, the time and cost required to determine the actual speech units corresponding to each audio frame in the second sample audio can be saved.

[0230] The collection, storage, use, processing, transmission, provision, and disclosure of user personal information involved in the technical solution disclosed herein comply with the provisions of relevant laws and regulations and do not violate public order and good morals.

[0231] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.

[0232] This disclosure provides an electronic device, including:

[0233] At least one processor; and

[0234] A memory communicatively connected to the at least one processor; wherein,

[0235] The memory stores instructions that can be executed by the at least one processor, which enables the at least one processor to perform the above-described speech translation and model training methods.

[0236] This disclosure provides a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to perform a speech translation and model training method.

[0237] This disclosure provides a computer program product, including a computer program that, when executed by a processor, implements a speech translation and model training method.

[0238] Figure 11 A schematic block diagram of an example electronic device 1100 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0239] like Figure 11 As shown, device 1100 includes a computing unit 1101, which can perform various appropriate actions and processes according to a computer program stored in read-only memory (ROM) 1102 or a computer program loaded from storage unit 1108 into random access memory (RAM) 1103. The RAM 1103 may also store various programs and data required for the operation of device 1100. The computing unit 1101, ROM 1102, and RAM 1103 are interconnected via bus 1104. Input / output (I / O) interface 1105 is also connected to bus 1104.

[0240] Multiple components in device 1100 are connected to I / O interface 1105, including: input unit 1106, such as keyboard, mouse, etc.; output unit 1107, such as various types of monitors, speakers, etc.; storage unit 1108, such as disk, optical disk, etc.; and communication unit 1109, such as network card, modem, wireless transceiver, etc. Communication unit 1109 allows device 1100 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0241] The computing unit 1101 can be various general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 1101 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1101 performs the various methods and processes described above, such as speech translation methods and model training methods. For example, in some embodiments, the speech translation methods and model training methods can be implemented as computer software programs tangibly contained in a machine-readable medium, such as storage unit 1108. In some embodiments, part or all of the computer program can be loaded and / or installed on device 1100 via ROM 1102 and / or communication unit 1109. When the computer program is loaded into RAM 1103 and executed by the computing unit 1101, one or more steps of the speech translation methods and model training methods described above can be performed. Alternatively, in other embodiments, the computing unit 1101 may be configured to perform speech translation methods or model training methods by any other suitable means (e.g., by means of firmware).

[0242] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0243] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0244] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0245] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0246] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with embodiments of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

[0247] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, servers in distributed systems, or servers incorporating blockchain technology.

[0248] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.

[0249] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. A model training method, comprising: The sample features of each sample audio frame in the source language audio are input into the initial speech conversion model to obtain the first speech unit of the target language corresponding to each sample audio frame and the second speech unit of the source language corresponding to each sample audio frame output by the initial speech conversion model. Each speech unit is audio data of an acoustic category corresponding to the audio. The sample features include FBank features. The source language and the target language are different languages. The first sub-loss is calculated based on the first and third speech units, and the second sub-loss is calculated based on the second and fourth speech units. The first loss for speech unit conversion of the initial speech conversion model is calculated by combining the first and second sub-losses. Each third speech unit is a speech unit of the target language corresponding to each audio frame in the sample target language audio, and each fourth speech unit is a speech unit of the source language corresponding to each sample audio frame. The sample source language audio and the sample target language audio have the same semantics. Based on the first loss, the model parameters of the initial speech conversion model are adjusted to obtain the target speech conversion model; The method further includes: The audio features of each audio frame in the target language audio sample are input into the trained target speech unit determination model, and the output result is used as the third speech unit corresponding to that audio frame. The audio features of each sample audio frame in the source language audio are input into the trained target speech unit determination model, and the output result is used as the fourth speech unit corresponding to that sample audio frame. The target speech unit determines the training labels used by the model during training, which are the sample speech units corresponding to each first audio frame; the sample speech units corresponding to each first audio frame are obtained in the following way: Based on the sample audio features of each first audio frame, clustering processing is performed on each first audio frame according to the acoustic category to determine the acoustic category to which each first audio frame belongs. For each first audio frame, the speech unit corresponding to the acoustic category to which the first audio frame belongs is determined as the sample speech unit corresponding to the first audio frame.

2. The method according to claim 1, wherein the trained target speech unit determination model is obtained in the following manner: Each first audio frame of the first sample audio is input into the initial speech unit determination model to obtain the fifth speech unit corresponding to each first audio frame; Based on the sample speech units corresponding to each fifth speech unit and each first audio frame, the second loss of the speech unit determination model for determining the speech unit corresponding to the audio frame is calculated. Based on the second loss, the model parameters of the initial speech unit determination model are adjusted to obtain the target speech unit determination model.

3. The method according to claim 2, wherein, After adjusting the model parameters of the initial speech unit determination model based on the second loss to obtain the target speech unit determination model, the method further includes: Each second audio frame in the second sample audio is input into the target speech unit determination model to obtain the sixth speech unit corresponding to each second audio frame. The second sample audio and the first sample audio correspond to the same language and semantics, but the audio data are different. Based on the output speech unit sequence and the sample speech unit sequence, the third loss of the speech unit corresponding to the audio frame determined by the target speech unit determination model is calculated. The output speech unit sequence contains each sixth speech unit, and the arrangement order of each sixth speech unit is: the temporal order of the corresponding second audio frame in the second sample audio. The sample speech unit sequence contains: the sample speech units corresponding to each first audio frame, and the arrangement order of each sample speech unit is: the temporal order of the corresponding first audio frame in the first sample audio. Based on the third loss, the model parameters of the target speech unit determination model are adjusted to obtain the parameter-adjusted target speech unit determination model.

4. A speech translation method, comprising: Extract audio features from each audio frame in the source language audio to be translated, including FBank features; Based on the audio features of each audio frame, the speech units of the target language corresponding to each audio frame are determined as target speech units. Each speech unit is audio data of an acoustic category corresponding to the audio, and the source language and the target language are different languages. Based on the temporal order of each audio frame in the source language audio and the target speech unit corresponding to each audio frame, the target language audio is generated. The step of determining the target language speech unit corresponding to each audio frame based on the audio features of each audio frame, as the target speech unit, includes: The audio features of each audio frame are input into the target speech conversion model to obtain the speech units of the target language corresponding to each audio frame determined by the target speech conversion model, which are used as target speech units. The target speech conversion model is trained by the method described in any one of claims 1-3.

5. The method according to claim 4, wherein, The step of determining the target language speech unit corresponding to each audio frame based on the audio features of each audio frame, as the target speech unit, includes: The audio features of each audio frame are encoded separately to obtain the encoded features of each audio frame. Feature mining is performed on the encoded features of each audio frame to obtain the hidden features of each audio frame. The hidden features are features that contain the implicit information of the audio frame. The hidden features of each audio frame are decoded to obtain the decoded features corresponding to each audio frame. For each audio frame, the speech unit of the target language that matches the decoding features of the audio frame is determined, and the determined audio unit is taken as the target speech unit corresponding to the audio frame.

6. The method according to claim 5, wherein, The process of encoding the audio features of each audio frame to obtain the encoded features of each audio frame includes: The audio features of each audio frame are input into the encoding layer of the target speech conversion model, and the audio features are encoded to obtain the encoded features of each audio frame. The target speech conversion model also includes a feature mining layer, a decoding layer, and an output layer. The process of feature mining the encoded features of each audio frame to obtain the hidden layer features of each audio frame includes: The encoded features of each audio frame are input into the feature mining layer, and feature mining is performed on the encoded features to obtain the hidden layer features of each audio frame. The process of decoding the hidden features of each audio frame to obtain the decoded features corresponding to each audio frame includes: The hidden features of each audio frame are input into the decoding layer, and the hidden features are decoded to obtain the decoding features corresponding to each audio frame. For each audio frame, determining the target language speech unit that matches the decoding features of that audio frame, and using the determined audio unit as the target speech unit corresponding to that audio frame, includes: The decoding features corresponding to each audio frame are input into the output layer to determine the speech unit of the target language that matches the decoding features of each audio frame, which is then used as the target speech unit of the corresponding audio frame.

7. The method according to any one of claims 4-6, wherein, The process of generating target language audio based on the temporal order of each audio frame in the source language audio and the target speech unit corresponding to each audio frame includes: Feature extraction is performed on the target speech unit sequence, wherein the target speech unit sequence includes: target speech units corresponding to each audio frame arranged in temporal order according to the corresponding audio frames in the source language audio; Based on the feature extraction results, the target speech unit sequence is upsampled to obtain the upsampled result; Based on the upsampling results, speech synthesis is performed to generate audio in the target language.

8. A model training device, comprising: The sample unit determination module is used to input the sample features of each sample audio frame in the source language audio into the initial speech conversion model to obtain the first speech unit of the target language corresponding to each sample audio frame and the second speech unit of the source language corresponding to each sample audio frame output by the initial speech conversion model. Each speech unit is: audio data of an acoustic category corresponding to the audio. The sample features include FBank features. The source language and the target language are different languages. The first loss calculation module is used to calculate a first sub-loss based on the first speech unit and the third speech unit, and to calculate a second sub-loss based on the second speech unit and the fourth speech unit. The first loss for speech unit conversion of the initial speech conversion model is calculated by combining the first sub-loss and the second sub-loss. Each third speech unit is a speech unit of the target language corresponding to each audio frame in the sample target language audio, and each fourth speech unit is a speech unit of the source language corresponding to each sample audio frame. The sample source language audio and the sample target language audio have the same semantics. The first model acquisition module is used to adjust the model parameters of the initial speech conversion model based on the first loss to obtain the target speech conversion model. For each audio frame in the target language audio sample, the third speech unit corresponding to the audio frame is obtained through the third unit determination module. The third unit determination module is specifically used to: input the audio features of the audio frame into the trained target speech unit determination model, and use the output result as the third speech unit corresponding to the audio frame. For each sample audio frame in the source language audio, the fourth speech unit corresponding to the sample audio frame is obtained through the fourth unit determination module. The fourth unit determination module is specifically used to: input the audio features of the sample audio frame into the trained target speech unit determination model, and use the output result as the fourth speech unit corresponding to the sample audio frame. The target speech unit determines the training labels used by the model during training, which are the sample speech units corresponding to each first audio frame; the sample speech units corresponding to each first audio frame are obtained through the sample unit acquisition module, which is specifically used for: Based on the sample audio features of each first audio frame, clustering processing is performed on each first audio frame according to the acoustic category to determine the acoustic category to which each first audio frame belongs. For each first audio frame, the speech unit corresponding to the acoustic category to which the first audio frame belongs is determined as the sample speech unit corresponding to the first audio frame.

9. The apparatus according to claim 8, wherein the initial speech unit determination model is trained using the following modules to obtain a trained target speech unit determination model: The fifth unit acquisition module is used to input each first audio frame of the first sample audio into the initial speech unit determination model to obtain the fifth speech unit corresponding to each first audio frame. The second loss calculation module is used to calculate the second loss of the speech unit corresponding to the audio frame determined by the initial speech unit determination model based on the sample speech units corresponding to each fifth speech unit and each pre-obtained first audio frame; The second model acquisition module is used to adjust the model parameters of the initial speech unit determination model based on the second loss to obtain the target speech unit determination model.

10. The apparatus according to claim 9, wherein, The device further includes: The sixth unit acquisition module is used to input each second audio frame in the second sample audio into the target speech unit determination model to obtain the sixth speech unit corresponding to each second audio frame, wherein the second sample audio and the first sample audio correspond to the same language and semantics, but the audio data are different; The third loss calculation module is used to calculate the third loss of the speech unit corresponding to the audio frame determined by the target speech unit determination model based on the output speech unit sequence and the sample speech unit sequence. The output speech unit sequence contains each sixth speech unit, and the arrangement order of each sixth speech unit is: the temporal order of the corresponding second audio frame in the second sample audio. The sample speech unit sequence contains: the sample speech units corresponding to each first audio frame, and the arrangement order of each sample speech unit is: the temporal order of the corresponding first audio frame in the first sample audio. The third model acquisition module is used to adjust the model parameters of the target speech unit determination model based on the third loss, so as to obtain a further trained target speech unit determination model.

11. A speech translation device, comprising: The feature extraction module is used to extract audio features from each audio frame in the source language audio to be translated, and the audio features include filter bank FBank features; The first unit determination module is used to determine the speech unit of the target language corresponding to each audio frame based on the audio features of each audio frame, and to use it as the target speech unit. Each speech unit is audio data of an acoustic category corresponding to the audio, and the source language and the target language are different languages. The audio generation module is used to generate target language audio based on the temporal order of each audio frame in the source language audio and the target speech unit corresponding to each audio frame; The first unit determination module is specifically used to input the audio features of each audio frame into the target speech conversion model to obtain the speech units of the target language corresponding to each audio frame determined by the target speech conversion model, which serve as the target speech units. The target speech conversion model is trained by the method described in any one of claims 1-3.

12. The apparatus according to claim 11, wherein, The first unit determining module includes: The encoding submodule is used to encode the audio features of each audio frame to obtain the encoded features of each audio frame. The feature mining submodule is used to perform feature mining on the encoded features of each audio frame to obtain the hidden features of each audio frame. The hidden features are features that contain the implicit information of the audio frame. The decoding submodule is used to decode the hidden features of each audio frame to obtain the corresponding decoded features of each audio frame. The unit determination submodule is used to determine the target language speech unit that matches the decoding features of each audio frame, and to use the determined audio unit as the target speech unit corresponding to the audio frame.

13. The apparatus according to claim 12, wherein, The encoding submodule is specifically used for: The audio features of each audio frame are input into the encoding layer of the target speech conversion model, and the audio features are encoded to obtain the encoded features of each audio frame. The target speech conversion model also includes a feature mining layer, a decoding layer, and an output layer. The feature mining submodule is specifically used for: The encoded features of each audio frame are input into the feature mining layer, and feature mining is performed on the encoded features to obtain the hidden layer features of each audio frame. The decoding submodule is specifically used for: The hidden features of each audio frame are input into the decoding layer, and the hidden features are decoded to obtain the decoding features corresponding to each audio frame. The unit determines the sub-module, specifically for: The decoding features corresponding to each audio frame are input into the output layer to determine the speech unit of the target language that matches the decoding features of each audio frame, which is then used as the target speech unit corresponding to each audio frame.

14. The apparatus according to any one of claims 11-13, wherein, The audio generation module is specifically used for: Feature extraction is performed on the target speech unit sequence, wherein the target speech unit sequence includes: target speech units corresponding to each audio frame arranged in temporal order according to the corresponding audio frames in the source language audio; Based on the feature extraction results, the target speech unit sequence is upsampled to obtain the upsampled result; Based on the upsampling results, speech synthesis is performed to generate audio in the target language.

15. An electronic device comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-3 or 4-7.

16. A non-transitory computer-readable storage medium storing computer instructions, wherein, The computer instructions are used to cause the computer to perform the method according to any one of claims 1-3 or 4-7.

17. A computer program product comprising a computer program that, when executed by a processor, implements the method according to any one of claims 1-3 or 4-7.