A voice conversion method, device, equipment and readable storage medium

By encoding and modeling speech information using a three-head encoder and adjusting speech rate, combined with a pre-trained timbre conversion model, the problems of inaccurate speech conversion and large data requirements in existing technologies are solved, achieving efficient and clear speech conversion results.

CN116631373BActive Publication Date: 2026-06-26MIGU CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
MIGU CO LTD
Filing Date
2023-07-06
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies cannot efficiently convert input speech into a specific person's voice, and deep learning-based voice changing technology requires a large amount of data and cannot achieve high-quality streaming speech generation.

Method used

A three-head encoder is used to encode and model speech information, extracting speech content, environmental noise and fundamental frequency information respectively. After adjusting the speech rate, the input is fed into a pre-trained timbre conversion model to generate target acoustic features.

Benefits of technology

It achieves efficient and accurate speech conversion, improves the clarity and robustness of voice information, and allows users to adjust the speech rate as needed, thus enhancing the user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116631373B_ABST
    Figure CN116631373B_ABST
Patent Text Reader

Abstract

The application provides a speech conversion method, device and equipment and a readable storage medium. The method comprises the following steps: obtaining speech information to be processed; based on a three-head encoder, encoding and modeling speech content, environmental noise and fundamental frequency information in the speech information to be processed respectively to obtain encoded and modeled speech information; changing the time sequence of the encoded and modeled speech information to adjust the speech speed of the encoded speech information; inputting the speech information with adjusted speech speed into a previously trained timbre conversion model corresponding to a target user to obtain target acoustic features, wherein the timbre of the target acoustic features is the same as the timbre of the target user. Thus, the timbre conversion of the speech information to be processed can be performed as required, and the conversion method is more efficient and accurate. Multi-dimensional encoding of the speech information can improve the robustness of the speech in a noisy environment. Speech speed control can make the speech more in line with user requirements.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to the field of audio processing technology, and in particular to a speech conversion method, apparatus, device and readable storage medium. Background Technology

[0002] Among related technologies, voice changing technology mainly includes traditional real-time voice changing technology and deep learning-based voice changing technology. The former converts audio into a distinctive voice based on audio recognition features, while the latter achieves real-time voice changing by establishing and training a deep learning model.

[0003] The problems with the related technologies are as follows: traditional real-time voice changing technology can only support converting input speech into some preset characteristic voices (such as the voices of specific cartoon characters), but cannot convert input speech into the voice of a specific person; while deep learning-based voice changing technology requires a large amount of data (usually at least hours of data) to achieve the conversion of a specific person's voice, and cannot achieve the generation of high-quality streaming speech data. Summary of the Invention

[0004] This invention provides a speech conversion method, apparatus, device, and readable storage medium to solve the technical problems of speech conversion methods in the related art that cannot convert speech into the voice of a specific person, have poor applicability, require a large amount of training data, and have low speech conversion efficiency.

[0005] In a first aspect, embodiments of the present invention provide a speech conversion method, the method comprising:

[0006] Acquire the voice information to be processed;

[0007] Based on a three-head encoder, the speech content, environmental noise, and fundamental frequency information in the speech information to be processed are encoded and modeled to obtain the encoded and modeled speech information; wherein, the three-head encoder consists of a speech content encoder, an environmental noise encoder, and a prosodic wave encoder.

[0008] The timing of the encoded speech information is modified to adjust the speech rate of the encoded speech information;

[0009] The speech information after adjusting the speech rate is input into a pre-trained timbre conversion model corresponding to the target user to obtain the target acoustic features;

[0010] Wherein, the speech content of the target acoustic feature is the same as the speech content of the speech information, and the timbre of the target acoustic feature is the same as the timbre of the target user.

[0011] Optionally, based on a three-head encoder, the speech content, environmental noise, and fundamental frequency information in the speech information to be processed are encoded and modeled separately to obtain the encoded and modeled speech information, including:

[0012] Based on the speech content encoder, the speech content is autoencoded and mapped to a high-dimensional space to obtain a high-order feature representation of the speech content; wherein, the high-order feature representation is in the form of a two-dimensional matrix;

[0013] The environmental noise is extracted based on the environmental noise encoder, and the environmental noise is mapped to high-order environmental sound features; wherein the environmental sound features are in the form of a two-dimensional matrix;

[0014] Based on the prosodic wave encoder, the fundamental frequency information is extracted and the fundamental frequency information is discretized to obtain the prosodic wave encoding result; wherein, the prosodic wave encoding result is in the form of a one-dimensional matrix;

[0015] The environmental sound features and the prosodic wave encoding results are respectively converted into formats to generate two-dimensional matrices with the same Mel feature dimension as the two-dimensional matrix of the higher-order feature expression.

[0016] The higher-order feature representation of the speech content, the higher-order ambient sound features, and the prosodic wave coding results are superimposed to obtain the encoded speech information.

[0017] Optionally, modifying the timing of the encoded speech information to adjust the speech rate includes:

[0018] The encoded speech information is subjected to temporal augmented linear interpolation to slow down the speech rate; this includes:

[0019] Receive a speech rate slowing instruction; wherein the speech rate slowing instruction is used to indicate that the adjusted speech rate is 1 / N times the original speech rate, where N is a positive integer greater than 1;

[0020] According to the speech rate slowing instruction, the encoded value of the encoded speech information is expanded; wherein, the expansion is: arranging the encoded values ​​into a sequence of encoded values, starting from the beginning of the sequence, inserting N-1 empty positions every other position, until N-1 empty positions are inserted after the end of the sequence.

[0021] Linear interpolation is performed on the empty spaces to slow down the speech rate of the encoded speech information. The value inserted at the empty space before the end of the sequence is determined based on the two adjacent original encoded values ​​before and after the empty space, and the value inserted at the empty space after the end of the sequence is determined based on the encoded value of the end of the sequence.

[0022] Optionally, modifying the timing of the encoded speech information to adjust the speech rate includes:

[0023] The encoded speech information is subjected to temporal deletion operations to increase the speech rate; this includes:

[0024] Receive a speech rate increase instruction; wherein the speech rate increase instruction is used to indicate that the adjusted speech rate is A times the original speech rate, where A is a positive integer greater than 1;

[0025] According to the speech rate acceleration instruction, the encoded values ​​of the encoded speech information are deleted; wherein, the deletion operation is: arranging the encoded values ​​into a sequence of encoded values, starting from the beginning of the sequence, deleting A-1 empty positions every other position, until the end of the sequence.

[0026] Optionally, before inputting the speech information after adjusting the speech rate into a pre-trained timbre conversion model to obtain the target acoustic features, the method further includes:

[0027] Determine the vocal timbre characteristics of the target user;

[0028] Based on the aforementioned timbre expression features, the general timbre conversion model is adaptively trained to obtain the timbre conversion model;

[0029] The general timbre conversion model is trained based on a preset number of voice data points from a preset number of users, corresponding to Mel feature data and timbre feature representations corresponding to the voice data.

[0030] Optionally, determining the vocal timbre characteristics of the target user includes:

[0031] Obtain the target user's voice information;

[0032] The voice information of the target user is converted into Mel acoustic features;

[0033] The Mel acoustic features are input into the TDNN network to obtain global timbre features;

[0034] The global timbre features are averaged to obtain the timbre expression features.

[0035] Optionally, after inputting the speech information with adjusted speech rate into a pre-trained timbre conversion model to obtain the target acoustic features, the method further includes:

[0036] The target acoustic features are sliced ​​to obtain acoustic segments of the target acoustic features;

[0037] The acoustic segment is input into a vocoder and converted into time-domain sampling points to generate a speech segment;

[0038] The speech segments are then smoothly fused to generate a speech stream.

[0039] Optionally, the target acoustic features are sliced ​​to obtain acoustic segments of the target acoustic features, including:

[0040] Based on a preset segment length t, the target acoustic feature is sliced ​​to obtain M basic acoustic segments of length t; where T is the total length of the target acoustic feature, and both T and t are positive integers. When T is divisible by t, M = T / t; when T is not divisible by t, M = the quotient of T / t + 1.

[0041] At the end of each basic acoustic segment, an additional acoustic feature of one unit length is taken.

[0042] The basic acoustic segment is superimposed with the selected acoustic feature of one unit length to obtain M acoustic segments of length t+1.

[0043] Optionally, the acoustic segment is input to a vocoder and converted into time-domain sampling points to generate a speech segment, including:

[0044] An acoustic segment of length t+1 is input into the vocoder to generate a speech segment of length K*(t+1).

[0045] Each unit length of acoustic feature is converted into K time-domain sampling points by the vocoder, where K is a positive integer.

[0046] Optionally, the speech segments are smoothed and fused to generate a speech stream, including:

[0047] In each speech segment except the last one, the last K time-domain sampling points of the last unit length are set as the tail buffer.

[0048] In each of the remaining speech segments except the first one, the first K temporal sampling points of a unit length are set as the head fusion region;

[0049] In two adjacent speech segments, the loudness of the tail buffer of the first speech segment is attenuated, and the loudness of the head fusion area of ​​the second speech segment is enhanced.

[0050] All processed speech segments are merged and spliced ​​together to generate a speech stream.

[0051] Secondly, embodiments of the present invention provide a speech conversion device, the device comprising:

[0052] The voice information acquisition module is used to acquire the voice information to be processed.

[0053] The encoding modeling module is used to encode and model the speech content, environmental noise, and fundamental frequency information in the speech information to be processed based on a three-head encoder, so as to obtain the encoded and modeled speech information; wherein, the three-head encoder consists of a speech content encoder, an environmental noise encoder, and a prosodic wave encoder;

[0054] The speech rate adjustment module is used to modify the timing of the encoded and modeled speech information in order to adjust the speech rate of the encoded speech information.

[0055] The target acoustic feature acquisition module is used to input the speech information after adjusting the speech rate into a pre-trained timbre conversion model corresponding to the target user to obtain the target acoustic features;

[0056] Wherein, the speech content of the target acoustic feature is the same as the speech content of the speech information, and the timbre of the target acoustic feature is the same as the timbre of the target user.

[0057] Optionally, the encoding modeling module is further configured to map the speech content to a high-dimensional space through autoencoding based on the speech content encoder to obtain a high-order feature representation of the speech content; wherein the high-order feature representation is in the form of a two-dimensional matrix;

[0058] The environmental noise is extracted based on the environmental noise encoder, and the environmental noise is mapped to high-order environmental sound features; wherein the environmental sound features are in the form of a two-dimensional matrix;

[0059] Based on the prosodic wave encoder, the fundamental frequency information is extracted and the fundamental frequency information is discretized to obtain the prosodic wave encoding result; wherein, the prosodic wave encoding result is in the form of a one-dimensional matrix;

[0060] The environmental sound features and the prosodic wave encoding results are respectively converted into formats to generate two-dimensional matrices with the same Mel feature dimension as the two-dimensional matrix of the higher-order feature expression.

[0061] The higher-order feature representation of the speech content, the higher-order ambient sound features, and the prosodic wave coding results are superimposed to obtain the encoded speech information.

[0062] Optionally, the speech rate adjustment module is further configured to perform extended linear interpolation on the encoded speech information in a temporal sequence to slow down the speech rate of the encoded speech information; including:

[0063] Receive a speech rate slowing instruction; wherein the speech rate slowing instruction is used to indicate that the adjusted speech rate is 1 / N times the original speech rate, where N is a positive integer greater than 1;

[0064] According to the speech rate slowing instruction, the encoded value of the encoded speech information is expanded; wherein, the expansion is: arranging the encoded values ​​into a sequence of encoded values, starting from the beginning of the sequence, inserting N-1 empty positions every other position, until N-1 empty positions are inserted after the end of the sequence.

[0065] Linear interpolation is performed on the empty spaces to slow down the speech rate of the encoded speech information. The value inserted at the empty space before the end of the sequence is determined based on the two adjacent original encoded values ​​before and after the empty space, and the value inserted at the empty space after the end of the sequence is determined based on the encoded value of the end of the sequence.

[0066] Optionally, the speech rate adjustment module is further configured to perform temporal deletion operations on the encoded and modeled speech information to speed up the speech rate of the encoded speech information; including:

[0067] Receive a speech rate increase instruction; wherein the speech rate increase instruction is used to indicate that the adjusted speech rate is A times the original speech rate, where A is a positive integer greater than 1;

[0068] According to the speech rate acceleration instruction, the encoded values ​​of the encoded speech information are deleted; wherein, the deletion operation is: arranging the encoded values ​​into a sequence of encoded values, starting from the beginning of the sequence, deleting A-1 empty positions every other position, until the end of the sequence.

[0069] Optionally, the device further includes:

[0070] The timbre expression feature determination module is used to determine the timbre expression features of the target user before inputting the speech information after adjusting the speech rate into the pre-trained timbre conversion model to obtain the target acoustic features;

[0071] The training module is used to adaptively train the general timbre conversion model based on the timbre expression features to obtain the timbre conversion model;

[0072] The general timbre conversion model is trained based on a preset number of voice data points from a preset number of users, corresponding to Mel feature data and timbre feature representations corresponding to the voice data.

[0073] Optionally, the timbre expression feature determination module is also used to acquire the voice information of the target user;

[0074] The voice information of the target user is converted into Mel acoustic features;

[0075] The Mel acoustic features are input into the TDNN network to obtain global timbre features;

[0076] The global timbre features are averaged to obtain the timbre expression features.

[0077] Optionally, the device further includes:

[0078] The slicing module is used to slice the target acoustic features after inputting the speech information with adjusted speech rate into a pre-trained timbre conversion model to obtain the target acoustic features, thereby obtaining acoustic segments of the target acoustic features.

[0079] The speech segment generation module is used to convert the acoustic segment into a vocoder and convert it into time-domain sampling points to generate a speech segment;

[0080] The speech stream generation module is used to smoothly fuse the speech segments to generate a speech stream.

[0081] Optionally, the slicing module is further configured to slice the target acoustic feature based on a preset segment length t to obtain M basic acoustic segments of length t; wherein T is the total length of the target acoustic feature, and both T and t are positive integers. When T is divisible by t, M = T / t; when T is not divisible by t, M = the quotient of T / t + 1.

[0082] At the end of each basic acoustic segment, an additional acoustic feature of one unit length is taken.

[0083] The basic acoustic segment is superimposed with the selected acoustic feature of one unit length to obtain M acoustic segments of length t+1.

[0084] Optionally, the speech segment generation module is further configured to input an acoustic segment of length t+1 into the vocoder to generate a speech segment of length K*(t+1).

[0085] Each unit length of acoustic feature is converted into K time-domain sampling points by the vocoder, where K is a positive integer.

[0086] Optionally, the speech stream generation module is further configured to set the last K time-domain sampling points of the last unit length in each speech segment except the last speech segment as a tail buffer.

[0087] In each of the remaining speech segments except the first one, the first K time-domain sampling points of a unit length are set as the head fusion region;

[0088] In two adjacent speech segments, the loudness of the tail buffer of the first speech segment is attenuated, and the loudness of the head fusion area of ​​the second speech segment is enhanced.

[0089] All processed speech segments are merged and spliced ​​together to generate a speech stream.

[0090] Thirdly, embodiments of the present invention provide an electronic device, including: a processor, a memory, and a program stored in the memory and executable on the processor, wherein when the program is executed by the processor, it implements the steps of the speech conversion method as described in the first aspect.

[0091] Fourthly, embodiments of the present invention provide a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of the speech conversion method as described in the first aspect.

[0092] Therefore, the voice information to be processed can be converted in timbre as needed, and the conversion method is more efficient and accurate; multi-dimensional encoding of voice information can accurately extract human voice information from the voice information to be processed, while discarding irrelevant information such as environmental noise, improving the robustness of voice in noisy environments and improving the clarity of voice information; speech rate control allows users to adjust the speech rate according to actual needs, which can improve the user experience. Attached Figure Description

[0093] Various other advantages and benefits will become apparent to those skilled in the art upon reading the following detailed description of preferred embodiments. The accompanying drawings are for illustrative purposes only and are not intended to limit the invention. Furthermore, the same reference numerals denote the same parts throughout the drawings. In the drawings:

[0094] Figure 1 A flowchart of a speech conversion method provided in an embodiment of the present invention;

[0095] Figure 2 A flowchart of a speech conversion method provided in an embodiment of the present invention;

[0096] Figure 3 A flowchart of a speech conversion method provided in an embodiment of the present invention;

[0097] Figure 4 A flowchart of a speech conversion method provided in an embodiment of the present invention;

[0098] Figure 5 A flowchart of a speech conversion method provided in an embodiment of the present invention;

[0099] Figure 6 A flowchart of a speech conversion method provided in an embodiment of the present invention;

[0100] Figure 7 A flowchart of a speech conversion method provided in an embodiment of the present invention;

[0101] Figure 8 A schematic diagram illustrating the fusion and splicing of speech segments provided in an embodiment of the present invention;

[0102] Figure 9 This is a structural block diagram of a speech conversion device provided in an embodiment of the present invention;

[0103] Figure 10 This is a structural block diagram of an electronic device provided in an embodiment of the present invention. Detailed Implementation

[0104] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0105] Figure 1 A speech conversion method according to an embodiment of the present invention is shown, the method comprising:

[0106] Step S101: Obtain the voice information to be processed;

[0107] Step S102: Based on the three-head encoder, the speech content, environmental noise and fundamental frequency information in the speech information to be processed are encoded and modeled to obtain the encoded and modeled speech information.

[0108] Step S103: Modify the timing of the encoded speech information to adjust the speech rate of the encoded speech information;

[0109] Step S104: Input the speech information after adjusting the speech rate into the pre-trained timbre conversion model corresponding to the target user to obtain the target acoustic features;

[0110] Among them, the speech content of the target acoustic feature is the same as the speech content of the speech information, and the timbre of the target acoustic feature is the same as the timbre of the target user.

[0111] In step S101, the voice information to be processed can be obtained through professional recording equipment (such as recording voice through a microphone and then transmitting the recording file to a computer or the cloud), online speech recognition API (Application Programming Interface), etc. The voice information acquisition method can be selected according to actual needs.

[0112] In step S102, after acquiring the speech information to be processed, the speech information can be input into a three-head encoder for encoding. The three-head encoder can also be a multi-head encoder. "Head" can be understood as dimension. That is, the speech information to be processed is encoded in multiple dimensions by a multi-head encoder. Multi-dimensional encoding can accurately extract the human voice information in the speech information to be processed, while discarding irrelevant information such as environmental noise, thereby improving the clarity of the speech information.

[0113] In one possible implementation, such as Figure 2 As shown, in step S102, based on a three-head encoder, the speech content, environmental noise, and fundamental frequency information in the speech information to be processed are encoded and modeled to obtain the encoded and modeled speech information, including:

[0114] Step S1021: Based on the speech content encoder, the speech content is mapped to a high-dimensional space through autoencoding to obtain the high-order feature representation of the speech content;

[0115] Among them, the higher-order features are represented in the form of a two-dimensional matrix;

[0116] Step S1022: Extract environmental noise based on the environmental noise encoder and map the environmental noise into high-order environmental sound features;

[0117] Among them, the form of environmental sound characteristics is a two-dimensional matrix;

[0118] Step S1023: Extract fundamental frequency information based on prosodic wave encoder, and discretize the fundamental frequency information to obtain prosodic wave encoding result;

[0119] Among them, the prosodic fluctuation coding result is in the form of a one-dimensional matrix;

[0120] Step S1024: Convert the format of the environmental sound features and prosodic wave encoding results respectively to generate two-dimensional matrices with the same Mel feature dimension as the two-dimensional matrix of the higher-order feature expression.

[0121] Step S1025: Superimpose the high-order feature representation of the speech content, the high-order ambient sound features, and the prosodic wave coding results to obtain the encoded speech information.

[0122] exist Figure 2 In the implementation shown, the three-head encoder consists of a speech content encoder, an ambient noise encoder, and a prosodic wave encoder. Correspondingly, the speech content, ambient noise, and fundamental frequency information in the input speech information to be processed can be encoded and modeled separately. Before encoding and modeling, Mel features can be used to represent the speech content information. For example, the speech content information can be represented as a two-dimensional matrix of [T, 80], the ambient noise can be represented as a two-dimensional matrix of [T, 256], and the fundamental frequency information can be represented as a one-dimensional matrix of [T].

[0123] It should be noted that Mel features are acoustic features designed specifically for human auditory perception. By considering the frequency perception characteristics of the human auditory system, they divide the frequency axis into several Mel frequency bands, thus more accurately reflecting human auditory characteristics. Mel features are commonly used in speech signal processing, speaker recognition, and voiceprint recognition. In the three matrices mentioned above, 80 and 256 represent the dimensions of the Mel features, while T represents the temporal length of the Mel features. 80 and 256 can be modified according to actual needs. Generally, a smaller number of dimensions can reduce model complexity but will result in some information loss, while a larger number of dimensions can extract more information but will increase the computational load of the model. In practical applications, a comprehensive trade-off between accuracy and efficiency should be struck to select an appropriate Mel feature dimension. In the above implementation, 80 and 256 are selected as the Mel feature dimensions for speech content information and environmental noise, respectively.

[0124] The speech content encoder in the three-head encoder consists of coding blocks based on a self-attention mechanism. The speech content is self-encoded and mapped to a high-dimensional space, resulting in a high-order feature representation of the speech content, which is a two-dimensional matrix of [T, 512]. The dimension of the Mel features in the two-dimensional matrix can also be set according to actual needs. In this implementation, a dimension of 512 is used as an example. It should be noted that the high-dimensional space can be understood as the hidden feature space. In the high-dimensional space, low-dimensional features can be decoupled to form higher-dimensional features (e.g., from 80 dimensions to 512 dimensions). The decoupled features are clearer and can be processed by the subsequent decoder (the timbre conversion model corresponding to the target user).

[0125] The environmental noise encoder in the three-head encoder is composed of a one-dimensional convolutional network. The environmental noise in the speech information to be processed will be identified and mapped into high-order environmental sound features, which are expressed as a two-dimensional matrix of [T, 256]. Similarly, 256 is only an example of the dimension of the Mel feature.

[0126] The working principle of the prosodic encoder in the three-head encoder is to use the YIN fundamental frequency extraction algorithm to extract the fundamental frequency information of the speech information to be processed, and then to discretize the extracted fundamental frequency value to obtain the prosodic wave coding result. The prosodic wave coding result is expressed as a one-dimensional matrix [T].

[0127] After obtaining the high-order feature representation of the speech content, the high-order ambient sound features, and the prosodic wave coding results, a fully connected layer can be used to convert the high-order ambient sound features into a two-dimensional matrix of size [T, 512], thereby expanding the dimension of its Mel feature from 256 to 512, and expanding the dimension of the Mel feature of the prosodic wave coding results, so that it is converted from [T] to a matrix of size [T, 1], and then a one-dimensional convolution is used to convert it into a matrix of size [T, 512].

[0128] It should be noted that since the high-order feature representation of the speech content is a two-dimensional matrix of [T, 512], this can be used as a benchmark to convert the format of the high-order ambient sound features and prosodic wave coding results separately, so as to generate two-dimensional matrices with the same Mel feature dimension as the two-dimensional matrix of the high-order feature representation. 512 is only an illustrative example of Mel feature dimension; the final result is that the Mel feature dimensions of the high-order feature representation of the speech content, the high-order ambient sound features, and the prosodic wave coding results are all the same.

[0129] Finally, the high-order feature representation of the speech content, the high-order ambient sound features, and the prosodic wave coding results can be fused and superimposed to obtain the coding result of the three-head encoder (speech information after coding modeling). The three-head encoder can encode speech in multiple dimensions, greatly improve the robustness of the model in noisy environments, accurately extract human voice information from the speech information to be processed, and discard irrelevant information such as ambient noise, thereby improving the clarity of speech information.

[0130] In step S103, the speech rate of the encoded and modeled speech information can be adjusted. The following two implementation methods are respectively an introduction to the method of slowing down the speech rate and an introduction to the method of speeding up the speech rate.

[0131] In one possible implementation, step S103, modifying the temporal sequence of the encoded and modeled speech information to adjust the speech rate, includes: performing extended linear interpolation on the encoded and modeled speech information in terms of temporal sequence to slow down the speech rate of the encoded speech information, such as... Figure 3 As shown, it includes:

[0132] Step S1031: Receive the instruction to slow down the speech rate;

[0133] The speech rate slowing command indicates that the adjusted speech rate is 1 / N times the original speech rate, where N is a positive integer greater than 1.

[0134] Step S1032: Expand the encoded value of the encoded speech information according to the speech rate slowing instruction;

[0135] The expansion is as follows: arrange the encoded values ​​into a sequence of encoded values, starting from the beginning of the sequence, insert N-1 empty positions every other position, until N-1 empty positions are inserted after the end of the sequence.

[0136] Step S1033: Perform linear interpolation on the empty spaces to slow down the speech rate of the encoded speech information;

[0137] The value inserted in the empty space before the end of the sequence is determined based on the two adjacent original code values ​​before and after the empty space, and the value inserted in the empty space after the end of the sequence is determined based on the code value of the end of the sequence.

[0138] exist Figure 3 In the implementation shown, an extended linear interpolation method can be used to adjust the encoded speech information and slow down the speech rate. In an exemplary application scenario, assuming the encoded value of the speech information is [1,2,3,4], if we want to slow down the speech rate by half, the encoded value sequence will first be expanded to [1,_,2,_,3,_,4,_]. The values ​​of the missing parts in the middle are determined by linear interpolation of the encoded values ​​before and after them, that is, the encoded value sequence is converted to [1,1.5,2,2.5,3,3.5,4,_]. The missing value at the end is obtained by copying the previous value, that is, the final encoded value sequence is [1,1.5,2,2.5,3,3.5,4,4]. Understandably, after temporal expansion, the speech rate will slow down. Therefore, the speech rate can be adjusted according to actual application needs, improving the user experience and indirectly expanding the applicability of the speech conversion method. Moreover, since interpolation is a linear method, it can maintain the sound quality and intonation of the original speech, avoiding audio distortion or unnaturalness.

[0139] In one possible implementation, step S103, modifying the temporal sequence of the encoded speech information to adjust its speech rate, includes: performing a time-series deletion operation on the encoded speech information to speed up its speech rate, such as... Figure 4 As shown, it includes:

[0140] Step S105: Receive the instruction to increase speech rate;

[0141] The speech rate increase command indicates that the adjusted speech rate is A times the original speech rate; A is a positive integer greater than 1.

[0142] Step S106: According to the speech rate adjustment instruction, delete the encoded value of the encoded speech information;

[0143] The deletion operation is as follows: arrange the encoded values ​​into a sequence of encoded values, starting from the beginning of the sequence, delete A-1 empty positions every other position, until the end of the sequence.

[0144] exist Figure 4 The implementation shown allows for value deletion, adjusting the encoded speech information after modeling and speeding up the speech rate. Specifically, the encoded values ​​of the encoded speech information can be deleted according to the speech rate acceleration command. If the speech rate acceleration command indicates an acceleration of A times, the encoded values ​​can be arranged into a sequence. Starting from the beginning of the sequence, A-1 empty positions are deleted every other position until the end of the sequence. In an exemplary application scenario, assuming the encoded values ​​of the encoded speech information are [1,2,3,4], the sequence of encoded values ​​after deletion is [1,3]. It is important to note that the first and last encoded values ​​of the sequence should be retained. Therefore, the speech rate can be adjusted according to actual application needs, improving the user experience and indirectly broadening the applicability of the speech conversion method.

[0145] It should be noted that after executing step S102, step S103 can be executed first, followed by step S104, that is, the speech rate of the encoded and modeled speech information is adjusted first, and then the speech information with adjusted speech rate is input into the pre-trained timbre conversion model to obtain the target acoustic features; or step S104 can be executed first, followed by step S103, that is, the speech information of the encoded and modeled speech information is input into the pre-trained timbre conversion model to obtain the target acoustic features, and then the speech rate of the target acoustic features is adjusted.

[0146] When using the first method, which directly controls speech rate based on multidimensional encoding results, speech rate can be easily adjusted using linear interpolation (plugging) and audio processing operations such as speech rate adjustment can be implemented through simple processes such as linear methods, effectively simplifying the audio processing process and improving the efficiency of audio processing.

[0147] In one possible implementation, before step S104, inputting the speech information after adjusting the speech rate into the pre-trained timbre conversion model to obtain the target acoustic features, the method further includes: determining the timbre expression features of the target user. Determining the timbre expression features of the target user includes: acquiring the speech information of the target user; converting the speech information of the target user into Mel acoustic features; inputting the Mel acoustic features into the TDNN network to obtain global timbre features; and performing mean processing on the global timbre features to obtain timbre expression features.

[0148] In specific application scenarios, the target speaker's speech can first be converted into Mel acoustic features. For example, the target speaker's speech can be converted into a two-dimensional matrix of [T, 80], where T is the temporal length of the Mel feature and 80 is the dimension of the Mel feature (80 is an example). Then, the Mel acoustic features are input into the TDNN network to obtain global timbre features. It should be noted that during model training, the model output is the speaker label (the speaker label refers to the speaker's identity information in the speech signal), and the cross-entropy loss function is used as the supervision signal. During inference, after inputting the Mel features into the model, the output of the last fully connected layer can be extracted as the global timbre features, expressed as a two-dimensional matrix of [T, 256]. The global timbre features are then mean-scaled, that is, [T, 256] is converted to [1, 256].

[0149] It should be noted that in speech processing, Mel acoustic features are a commonly used way to represent timbre features. When input into a TDNN network, dynamic information about timbre changes can be extracted. These dynamic timbre features can be used in scenarios such as speaker recognition and emotion recognition of speech signals.

[0150] Furthermore, model training typically requires input samples and corresponding labels, and the model's output needs to be compared with the labels to calculate the error. In speech signal processing, if tasks such as speech recognition or speaker identification are required, the speaker label corresponding to each speech sample needs to be used as a supervision signal to guide model training. During model training, the error needs to be calculated by comparing the model's output with the true labels, and then the error is used to adjust the model parameters. A commonly used loss function is the cross-entropy loss function.

[0151] After obtaining the timbre expression features by implementing the above method, the general timbre conversion model can be adaptively trained based on the timbre expression features to obtain the timbre conversion model corresponding to the target user. The general timbre conversion model is trained based on the Mel feature data and timbre feature expression corresponding to the speech data of a preset number of users.

[0152] It should be noted that current deep learning-based voice-changing technologies require hours of data for model training, which is time-consuming and labor-intensive. However, the method described in this invention—adaptively training a general timbre conversion model based on timbre expression features to obtain a timbre conversion model corresponding to the target user—can extract the timbre features of the target speaker based on a small amount of data (e.g., minutes of data) and use it for subsequent small-sample learning of the model. This significantly reduces the amount of training data required for the voice-changing model while accurately representing the timbre features of the target speaker.

[0153] Specifically, few-shot learning methods require only minutes of data (e.g., twenty minutes) to train the voice-changing model (the aforementioned timbre conversion model), reducing training steps and enabling rapid convergence and construction, thus significantly improving model building efficiency. In specific applications, timbre features can be input into the pre-trained model (the aforementioned general timbre conversion model). During pre-training, a 300-hour dataset of Chinese and English speech from thousands of speakers can be obtained, processed into multi-dimensional Mel features (e.g., 80-dimensional), and the timbre features of each speech segment can be extracted. The general timbre conversion model is then trained based on the multi-dimensional Mel features and timbre feature expressions, outputting Mel feature data. After pre-training, the parameters of the encoder module can be frozen, and the Adam optimizer can be used to iterate the pre-trained model (using a fixed learning rate for Adam optimizer parameter settings, with up to 2000 iterations). Furthermore, the model can employ an encoder-decoder architecture, and the loss function can be L2 Loss. Therefore, under limited data conditions, the model's generalization performance can be improved, enhancing its application effectiveness. Furthermore, it can achieve accurate model training results with less data, and can also improve the efficiency of model training, thereby improving the efficiency of speech conversion.

[0154] It's important to note that when fine-tuning the model using the Adam optimizer, the encoder module parameters can be kept fixed and not updated; only the remaining parts of the model can be optimized. The Adam algorithm, a commonly used stochastic gradient descent optimization algorithm, can improve the model's convergence speed and generalization ability. By fine-tuning the model, a model more adapted to the target speaker's timbre can be obtained, thus achieving better timbre conversion results.

[0155] In one possible implementation, such as Figure 5 As shown, after step S104, in which the speech information after adjusting the speech rate is input into the pre-trained timbre conversion model to obtain the target acoustic features, the method further includes:

[0156] Step S501: Slice the target acoustic features to obtain acoustic segments of the target acoustic features;

[0157] Step S502: Input the acoustic segment into the vocoder and convert it into time-domain sampling points to generate a speech segment;

[0158] Step S503: Perform smooth fusion processing on the speech segments to generate a speech stream.

[0159] The above steps will now be described in detail. In one possible implementation, such as... Figure 6 As shown, step S501 involves slicing the target acoustic features to obtain acoustic segments of the target acoustic features, including:

[0160] Step S601: Using the preset segment length t as a reference, slice the target acoustic features to obtain M basic acoustic segments of length t;

[0161] Where T is the total length of the target acoustic feature, and both T and t are positive integers. When T is divisible by t, M = T / t; when T is not divisible by t, M = the quotient of T / t + 1.

[0162] Step S602: At the end of each basic acoustic segment, take an additional acoustic feature of one unit length.

[0163] Step S603: Superimpose the basic acoustic segment and the selected acoustic feature of one unit length to obtain M acoustic segments of length t+1.

[0164] In one possible implementation, step S502, converting an acoustic segment into a vocoder and generating a speech segment, includes: inputting an acoustic segment of length t+1 into the vocoder to generate a speech segment of length K*(t+1); wherein each unit length of acoustic feature is converted into K time-domain sampling points by the vocoder, where K is a positive integer.

[0165] In one possible implementation, such as Figure 7 As shown, step S503, smoothing and fusing the speech segments to generate a speech stream, includes:

[0166] Step S701: Set the last K time-domain sampling points of the last unit length in each of the remaining speech segments, except for the last speech segment, as the tail buffer.

[0167] Step S702: Set the first unit length of K time-domain sampling points in each of the remaining speech segments (excluding the first speech segment) as the head fusion region;

[0168] Step S703: In two adjacent speech segments, the loudness of the tail buffer of the first speech segment is attenuated, and the loudness of the head fusion area of ​​the second speech segment is enhanced.

[0169] Step S704: Merge and splice all processed speech segments to generate a speech stream.

[0170] Slicing is performed based on a preset segment length t. Each slice acquires a segment of length t+1 (adding one unit of acoustic feature from the end of the base acoustic segment). Each unit-length acoustic feature is converted by a vocoder into K (e.g., K = 256) time-domain sampling points. The input acoustic segment is then converted into a speech segment of length 256*(t+1). If this speech segment is directly returned, jagged sounds will occur at the connection between the current and next speech segment, affecting sound quality. To solve this problem, a method can be used... Figure 7 The method shown sets the last unit-length sampling point in a speech segment as the tail buffer and the first unit-length sampling point as the head fusion region. Linear loudness attenuation is applied to the tail buffer of the preceding speech segment, and linear loudness enhancement is applied to the head fusion region of the following speech segment. After smooth transition through fusion and splicing, a speech stream is generated and streamed output is achieved (the fusion and splicing of the tail buffer and head fusion region is as follows...). Figure 8 As shown in the figure, this enables the smooth splicing and output of discrete speech segments, ensuring a smooth transition between speech segments, improving the fluency of speech, solving the problem of grating sounds, and enhancing speech quality.

[0171] The methods of loudness attenuation and loudness enhancement will now be explained.

[0172] When performing loudness attenuation, the following linear attenuation formula can be used (taking K=256 as an example):

[0173] (1-w)*x=y;(1)

[0174] w = n / 256; (2)

[0175] Where n is the distance between each sampling point (e.g., ... Figure 8 As shown, in the tail buffer, counting from left to right, the distance of the first sampling point on the left is 0, the distance of the second sampling point is 1, the distance of the third sampling point is 2, and the distance of the (n+1)th sampling point is n (where n is a natural number). w is the weight of the tail buffer, x is the value of each sampling point, and y is the linear decay value.

[0176] When enhancing loudness, the following linear enhancement formula can be used (taking K=256 as an example):

[0177] r*x=j;(3)

[0178] w = n / 256; (4)

[0179] Where n is the distance between each sampling point (e.g., ... Figure 8 As shown, in the head fusion region, counting from left to right, the distance of the first sampling point on the left is 0, the distance of the second sampling point is 1, the distance of the third sampling point is 2, and the distance of the (n+1)th sampling point is n (where n is a natural number). r is the weight of the head fusion region, x is the value of each sampling point, and j is the linear enhancement value.

[0180] Figure 7 The implementation shown provides a linear attenuation method to attenuate the loudness of the tail buffer in the previous speech segment and enhance the loudness of the head fusion region in the next speech segment. The attenuated and enhanced regions are then added together and fused. This ensures a smooth transition between the next and previous speech segments, solves the problem of jagged sounds caused at the segment connection, realizes streaming speech output, enhances sound quality, and broadens application scenarios (e.g., streaming speech output can be applied to live streaming scenarios).

[0181] In summary, in this embodiment of the invention, firstly, the speech of the target speaker is collected for timbre modeling to obtain timbre expression features. These features are then subjected to few-sample learning to obtain a timbre conversion model corresponding to the target speaker. Next, the speech information to be converted is multi-head encoded, and the results of the multi-head encoding are interpolated (or pruned) in the temporal dimension as needed to adjust the speech rate. The speech information with adjusted speech rate is then input into the timbre conversion model to obtain target acoustic features that have the same speech content as the source speech and the same timbre as the target user. Finally, the target acoustic features are sliced, and the slices are input into a vocoder to generate speech segments. Slicing smoothing and fusion methods are then applied to process the generated speech segments to generate a speech stream. Therefore, by learning from few samples, only a small amount of speech data (e.g., 3-5 minutes) is needed to convert input speech into the voice of any person. By encoding the speech in multiple dimensions, linear interpolation (or deletion) can be used to easily control the speech rate, while improving robustness in noisy environments. Through speech slicing and smooth fusion, streaming output can be achieved, resulting in high-quality real-time voice changing effects.

[0182] Furthermore, the method shown in the embodiments of the present invention also has the following technical effects:

[0183] It can help solve user pain points, has a wide range of applications, and a huge potential market size. For example, it can provide voice conversion support for the person behind the voice, enabling the production of corresponding dubbing for subsequent animation generation; it can be used in any live streaming application (including the currently popular live streaming application driven by the person behind the voice); and it can be applied to other voice interaction and voice-changing entertainment scenarios.

[0184] Specifically, when operating a digital human IP, the digital human's appearance and voice need to remain unchanged. However, in the current market, there are frequent issues where the voice of the digital human changes after the person behind the character is replaced. The method shown in this invention allows any person's voice to be converted to the same timbre, so even if the person behind the character is changed, the voice of the currently operating digital human will not change. Furthermore, in various voice interaction scenarios, users may use voice-changing technology to hide their real voice for privacy or entertainment reasons. The method shown in this invention can help users convert their own voice into any voice.

[0185] Figure 9 A speech conversion device according to an embodiment of the present invention is shown, such as Figure 9 As shown, the device 90 includes:

[0186] The voice information acquisition module 901 is used to acquire voice information to be processed.

[0187] The encoding modeling module 902 is used to encode and model the speech content, environmental noise and fundamental frequency information in the speech information to be processed based on a three-head encoder, so as to obtain the encoded and modeled speech information; wherein, the three-head encoder consists of a speech content encoder, an environmental noise encoder and a prosodic wave encoder;

[0188] The speech rate adjustment module 903 is used to modify the timing of the encoded speech information to adjust the speech rate of the encoded speech information.

[0189] The target acoustic feature acquisition module 904 is used to input the speech information after adjusting the speech rate into a pre-trained timbre conversion model corresponding to the target user to obtain the target acoustic features;

[0190] Among them, the speech content of the target acoustic feature is the same as the speech content of the speech information, and the timbre of the target acoustic feature is the same as the timbre of the target user.

[0191] In one possible implementation, the encoding modeling module 902 is further used to map the speech content to a high-dimensional space through an autoencoder based on the speech content encoder to obtain a high-order feature representation of the speech content; wherein the high-order feature representation is in the form of a two-dimensional matrix;

[0192] Environmental noise is extracted based on an environmental noise encoder, and the environmental noise is mapped to high-order environmental sound features; wherein, the environmental sound features are in the form of a two-dimensional matrix;

[0193] The fundamental frequency information is extracted based on the prosodic wave encoder, and the fundamental frequency information is discretized to obtain the prosodic wave coding result; wherein, the prosodic wave coding result is in the form of a one-dimensional matrix;

[0194] The environmental sound features and prosodic fluctuation encoding results are converted into formats to generate two-dimensional matrices with the same Mel feature dimension as the two-dimensional matrix of the higher-order feature expression.

[0195] The high-order feature representation of the speech content, the high-order ambient sound features, and the prosodic wave coding results are superimposed to obtain the encoded speech information.

[0196] In one possible implementation, the speech rate adjustment module 903 is further used to perform temporal augmented linear interpolation on the encoded speech information to slow down the speech rate of the encoded speech information; including:

[0197] Receive a speech rate slowdown command; the speech rate slowdown command indicates that the adjusted speech rate is 1 / N times the original speech rate, where N is a positive integer greater than 1;

[0198] According to the speech rate slowing instruction, the encoded values ​​of the encoded speech information are expanded; the expansion is as follows: the encoded values ​​are arranged into a sequence of encoded values, starting from the beginning of the sequence, inserting N-1 empty positions every other position, until the end of the sequence.

[0199] Linear interpolation is performed on the empty spaces to slow down the speech rate of the encoded speech information. The interpolated value at the empty space is determined based on the two adjacent original encoded values ​​before and after the empty space.

[0200] In one possible implementation, the speech rate adjustment module 903 is further used to perform temporal deletion operations on the encoded speech information to speed up the speech rate of the encoded speech information; including:

[0201] Receive speech rate increase command; the speech rate increase command indicates that the adjusted speech rate is A times the original speech rate, where A is a positive integer greater than 1;

[0202] According to the speech rate adjustment instruction, the encoded values ​​of the encoded speech information are deleted. The deletion operation is as follows: the encoded values ​​are arranged into a sequence of encoded values, starting from the beginning of the sequence, and every other position is deleted with A-1 empty positions until the end of the sequence.

[0203] In one possible implementation, device 90 further includes:

[0204] The timbre expression feature determination module is used to determine the timbre expression features of the target user before inputting the speech information after adjusting the speech rate into the pre-trained timbre conversion model to obtain the target acoustic features;

[0205] The training module is used to adaptively train a general timbre conversion model based on timbre expression features to obtain a timbre conversion model.

[0206] The general timbre conversion model is trained based on a preset number of users' speech data corresponding to a preset number of Mel feature data and timbre feature representations corresponding to the speech data.

[0207] In one possible implementation, the timbre expression feature determination module is also used to acquire the voice information of the target user;

[0208] Convert the target user's voice information into Mel acoustic features;

[0209] The Mel acoustic features are input into the TDNN network to obtain global timbre features;

[0210] The global timbre features are averaged to obtain the timbre expression features.

[0211] In one possible implementation, device 90 further includes:

[0212] The slicing module is used to slice the target acoustic features after inputting the speech information with adjusted speech rate into a pre-trained timbre conversion model to obtain the target acoustic features.

[0213] The speech segment generation module is used to convert acoustic segments into time-domain sampling points by the vocoder and generate speech segments.

[0214] The speech stream generation module is used to smoothly merge speech segments to generate a speech stream.

[0215] In one possible implementation, the slicing module is further used to slice the target acoustic feature based on a preset segment length t to obtain M basic acoustic segments of length t; where T is the total length of the target acoustic feature, and both T and t are positive integers. When T is divisible by t, M = T / t; when T is not divisible by t, M = the quotient of T / t + 1.

[0216] At the end of each basic acoustic segment, an additional acoustic feature of one unit length is taken.

[0217] By superimposing the basic acoustic segment with the selected acoustic feature of one unit length, M acoustic segments of length t+1 are obtained.

[0218] Optionally, the speech segment generation module is also used to input an acoustic segment of length t+1 into the vocoder to generate a speech segment of length K*(t+1).

[0219] Each unit length of acoustic feature is converted into K time-domain sampling points by a vocoder, where K is a positive integer.

[0220] In one possible implementation, the speech stream generation module is further configured to set the last K time-domain sampling points of the last unit length in each of the remaining speech segments, except for the last speech segment, as the tail buffer.

[0221] In each of the remaining speech segments except the first one, the first K time-domain sampling points of a unit length are set as the head fusion region;

[0222] In two adjacent speech segments, the loudness of the tail buffer of the first speech segment is attenuated, and the loudness of the head fusion area of ​​the second speech segment is enhanced.

[0223] All processed speech segments are merged and spliced ​​together to generate a speech stream.

[0224] The multi-head encoder used in this invention significantly improves the robustness and generalization of the voice-changing model, making stable voice changing feasible in noisy and complex environments. The speech rate controller makes the speech rate after voice changing controllable, thereby improving the flexibility of the voice-changing technology. For example, in live streaming scenarios, if the speaker speaks too fast, the speech rate can be slowed down as needed to improve the listening experience. The buffered spectrogram slicing method performs buffered slicing processing on acoustic features, resulting in better continuity between consecutive speech segments. The waveform segment smoothing method ensures smooth transitions between speech segments.

[0225] Therefore, by learning from few samples, only a small amount of speech data (e.g., 3-5 minutes) is needed to convert input speech into the voice of any person. By encoding the speech in multiple dimensions, linear interpolation (or deletion) can be used to easily control the speech rate, while improving robustness in noisy environments. Through speech slicing and smooth fusion, streaming output can be achieved, resulting in high-quality real-time voice changing effects.

[0226] This invention also provides an electronic device 100, such as... Figure 10As shown, it includes: a processor 1001, a memory 1002, and a program stored on the memory 1002 and executable on the processor 1001. When the program is executed by the processor, it implements the steps of a speech conversion method as shown in the above embodiment.

[0227] This invention also provides a computer-readable storage medium storing a computer program. When executed by a processor, the computer program implements the steps of the speech conversion method shown in the above embodiments and achieves the same technical effect. To avoid repetition, it will not be described again here. The computer-readable storage medium may be a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, etc.

[0228] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0229] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk), and includes several instructions to cause a terminal (which may be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in the various embodiments of the present invention.

[0230] The embodiments of the present invention have been described above with reference to the accompanying drawings. However, the present invention is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of the present invention without departing from the spirit and scope of the claims, and all of these forms are within the protection scope of the present invention.

Claims

1. A speech conversion method, characterized in that, The method includes: Acquire the voice information to be processed; Based on a three-head encoder, the speech content, environmental noise, and fundamental frequency information in the speech information to be processed are encoded and modeled to obtain the encoded and modeled speech information; wherein, the three-head encoder consists of a speech content encoder, an environmental noise encoder, and a prosodic wave encoder. The timing of the encoded speech information is modified to adjust the speech rate of the encoded speech information; The speech information after adjusting the speech rate is input into a pre-trained timbre conversion model corresponding to the target user to obtain the target acoustic features; Wherein, the speech content of the target acoustic feature is the same as the speech content of the speech information, and the timbre of the target acoustic feature is the same as the timbre of the target user; Based on a three-head encoder, the speech content, environmental noise, and fundamental frequency information in the speech information to be processed are encoded and modeled respectively to obtain the encoded and modeled speech information, including: Based on the speech content encoder, the speech content is autoencoded and mapped to a high-dimensional space to obtain a high-order feature representation of the speech content; wherein, the high-order feature representation is in the form of a two-dimensional matrix; The environmental noise is extracted based on the environmental noise encoder, and the environmental noise is mapped to high-order environmental sound features; wherein the environmental sound features are in the form of a two-dimensional matrix; Based on the prosodic wave encoder, the fundamental frequency information is extracted and the fundamental frequency information is discretized to obtain the prosodic wave encoding result; wherein, the prosodic wave encoding result is in the form of a one-dimensional matrix; The environmental sound features and the prosodic wave encoding results are respectively converted into formats to generate two-dimensional matrices with the same Mel feature dimension as the two-dimensional matrix of the higher-order feature expression. The higher-order feature representation of the speech content, the higher-order ambient sound features, and the prosodic wave coding results are superimposed to obtain the encoded speech information.

2. The method according to claim 1, characterized in that, Modifying the timing of the encoded speech information after modeling, in order to adjust the speech rate of the encoded speech information, includes: The encoded speech information is subjected to temporal augmented linear interpolation to slow down the speech rate; this includes: Receive a speech rate slowing instruction; wherein the speech rate slowing instruction is used to indicate that the adjusted speech rate is 1 / N times the original speech rate, where N is a positive integer greater than 1; According to the speech rate slowing instruction, the encoded value of the encoded speech information is expanded; wherein, the expansion is: arranging the encoded values ​​into a sequence of encoded values, starting from the beginning of the sequence, inserting N-1 empty positions every other position, until N-1 empty positions are inserted at the end of the sequence. Linear interpolation is performed on the empty spaces to slow down the speech rate of the encoded speech information. The value inserted at the empty space before the end of the sequence is determined based on the two adjacent original encoded values ​​before and after the empty space, and the value inserted at the empty space after the end of the sequence is determined based on the encoded value of the end of the sequence.

3. The method according to claim 1, characterized in that, Modifying the timing of the encoded speech information after modeling, in order to adjust the speech rate of the encoded speech information, includes: The encoded speech information is subjected to temporal deletion operations to increase the speech rate; this includes: Receive a speech rate increase instruction; wherein the speech rate increase instruction is used to indicate that the adjusted speech rate is A times the original speech rate, where A is a positive integer greater than 1; According to the speech rate acceleration instruction, the encoded values ​​of the encoded speech information are deleted; wherein, the deletion operation is: arranging the encoded values ​​into a sequence of encoded values, starting from the beginning of the sequence, deleting A-1 empty positions every other position, until the end of the sequence.

4. The method according to claim 1, characterized in that, Before inputting the speech information after adjusting the speech rate into a pre-trained timbre conversion model to obtain the target acoustic features, the method further includes: Determine the vocal timbre characteristics of the target user; Based on the aforementioned timbre expression features, the general timbre conversion model is adaptively trained to obtain the timbre conversion model; The general timbre conversion model is trained based on a preset number of voice data points from a preset number of users, corresponding to Mel feature data and timbre feature representations corresponding to the voice data.

5. The method according to claim 4, characterized in that, Determining the vocal timbre characteristics of the target user includes: Obtain the target user's voice information; The voice information of the target user is converted into Mel acoustic features; The Mel acoustic features are input into a time-delay neural network (TDNN) to obtain global timbre features; The global timbre features are averaged to obtain the timbre expression features.

6. The method according to claim 1, characterized in that, After inputting the speech information with adjusted speech rate into a pre-trained timbre conversion model to obtain the target acoustic features, the method further includes: The target acoustic features are sliced ​​to obtain acoustic segments of the target acoustic features; The acoustic segment is input into a vocoder and converted into time-domain sampling points to generate a speech segment; The speech segments are then smoothly fused to generate a speech stream.

7. The method according to claim 6, characterized in that, The target acoustic features are sliced ​​to obtain acoustic segments of the target acoustic features, including: Based on a preset segment length t, the target acoustic feature is sliced ​​to obtain M basic acoustic segments of length t; where T is the total length of the target acoustic feature, and both T and t are positive integers. When T is divisible by t, M = T / t; when T is not divisible by t, M = the quotient of T / t + 1. At the end of each basic acoustic segment, an additional acoustic feature of one unit length is taken. The basic acoustic segment is superimposed with the selected acoustic feature of one unit length to obtain M acoustic segments of length t+1.

8. The method according to claim 7, characterized in that, The acoustic segment is input into a vocoder and converted into time-domain sampling points to generate a speech segment, including: An acoustic segment of length t+1 is input into the vocoder to generate a speech segment of length K*(t+1). Each unit length of acoustic feature is converted into K time-domain sampling points by the vocoder, where K is a positive integer.

9. The method according to claim 8, characterized in that, The process of smoothly fusing the speech segments to generate a speech stream includes: In each speech segment except the last one, the last K time-domain sampling points of the last unit length are set as the tail buffer. In each of the remaining speech segments except the first one, the first K time-domain sampling points of a unit length are set as the head fusion region; In two adjacent speech segments, the loudness of the tail buffer of the first speech segment is attenuated, and the loudness of the head fusion area of ​​the second speech segment is enhanced. All processed speech segments are merged and spliced ​​together to generate a speech stream.

10. A voice conversion device, characterized in that, The device includes: The voice information acquisition module is used to acquire the voice information to be processed. The encoding modeling module is used to encode and model the speech content, environmental noise, and fundamental frequency information in the speech information to be processed based on a three-head encoder, so as to obtain the encoded and modeled speech information; wherein, the three-head encoder consists of a speech content encoder, an environmental noise encoder, and a prosodic wave encoder; The speech rate adjustment module is used to change the timing of the encoded and modeled speech information in order to adjust the speech rate of the encoded speech information. The target acoustic feature acquisition module is used to input the speech information after adjusting the speech rate into a pre-trained timbre conversion model corresponding to the target user to obtain the target acoustic features; Wherein, the speech content of the target acoustic feature is the same as the speech content of the speech information, and the timbre of the target acoustic feature is the same as the timbre of the target user; Specifically, based on a three-head encoder, the speech content, environmental noise, and fundamental frequency information in the speech information to be processed are encoded and modeled to obtain the encoded and modeled speech information, including: Based on the speech content encoder, the speech content is autoencoded and mapped to a high-dimensional space to obtain a high-order feature representation of the speech content; wherein, the high-order feature representation is in the form of a two-dimensional matrix; The environmental noise is extracted based on the environmental noise encoder, and the environmental noise is mapped to high-order environmental sound features; wherein the environmental sound features are in the form of a two-dimensional matrix; Based on the prosodic wave encoder, the fundamental frequency information is extracted and the fundamental frequency information is discretized to obtain the prosodic wave encoding result; wherein, the prosodic wave encoding result is in the form of a one-dimensional matrix; The environmental sound features and the prosodic wave encoding results are respectively converted into formats to generate two-dimensional matrices with the same Mel feature dimension as the two-dimensional matrix of the higher-order feature expression. The higher-order feature representation of the speech content, the higher-order ambient sound features, and the prosodic wave coding results are superimposed to obtain the encoded speech information.

11. An electronic device, characterized in that, include: A processor, a memory, and a program stored in the memory and executable on the processor, wherein the program, when executed by the processor, implements the steps of the speech conversion method as described in any one of claims 1 to 9.

12. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the steps of the speech conversion method as described in any one of claims 1 to 9.