A method for extracting zhuang language audio
By aligning Zhuang language audio and text in the time and frequency domains, and combining Fourier transform and speech separation network, the problem of mismatch between Zhuang language audio and text is solved, improving translation accuracy and speech recognition performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GUANGXI UNIV
- Filing Date
- 2024-01-08
- Publication Date
- 2026-06-23
AI Technical Summary
In the training of Zhuang language speech translation models, the mismatch between Zhuang audio and text leads to poor database performance. Existing technologies have failed to effectively solve the problems of Zhuang language's unique inverted structure and polyphonic characters and polyphonic characters.
The Zhuang language audio and text are aligned in the time and frequency domains, and noise is filtered out by Fourier transform. A speech separation network is constructed, including an encoder, a global encoder, a wizard module, a separation module, and a decoder. The model is optimized using a preset loss function to ensure that the text and audio information correspond completely.
It improves the accuracy of Zhuang language audio translation, solves the problems of Zhuang language's unique inverted structure and polyphonic characters, and ensures the accuracy of subsequent speech recognition and user experience.
Smart Images

Figure CN117877489B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of speech extraction, specifically to a method for extracting Zhuang language audio. Background Technology
[0002] Speaker recognition technology is a comprehensive technology that integrates knowledge from multiple fields. Because different people have different vocal cords, vocal tracts, and even lip shapes, as well as different vocal habits, the resulting voices will vary to varying degrees. These differences may be subtle, but after excellent feature extraction, these differences are gradually amplified, thus giving rise to the biometric feature of "voiceprint." Like fingerprints or iris features, voiceprint features offer good assurance in terms of reliability and uniqueness, thus meeting the prerequisites for biometric identification. Therefore, current voiceprint recognition technology is widely used in security fields such as financial security, social security, and communication security, as well as in smart homes. ;
[0003] Training a Zhuang language speech translation model requires establishing an accurate Zhuang language speech database, which includes Zhuang language audio and corresponding text. The text is generally in Chinese characters. However, due to the significant differences between the Zhuang language structure and Chinese characters, directly segmenting the text based on speech length or text length may result in mismatches between the text and audio, leading to an increase in dirty audio in the database and poor database performance. Therefore, this invention proposes a Zhuang language audio extraction method to solve the problems existing in the prior art. Summary of the Invention
[0004] To address the shortcomings of existing technologies, this invention proposes a Zhuang language audio extraction method. This method aligns audio and text in the frequency domain and adds frequency domain alignment on top of time domain alignment to make the results more accurate. This solution aligns audio and text in both the time and frequency domains to ensure that the text and audio information correspond completely.
[0005] The technical solution of this invention is implemented as follows: A method for extracting Zhuang language audio, comprising the following steps. ;
[0006] Step 1: Obtain the first text set and Zhuang language speech through video processing. The first text set includes multiple first texts A and corresponding text timelines a.
[0007] Step 2: Input video, extract audio file, perform Fourier transform on audio file to obtain audio spectrum, set waveform amplitude threshold, filter noise, the noise is audio with waveform amplitude below the threshold, and obtain target audio B and target audio time axis b;
[0008] Step 3: Align the text timeline a with the target audio timeline b, and adjust the length of the text timeline a to match the length of the target audio timeline b, to obtain multiple second texts c and the corresponding timelines c of the second texts;
[0009] Step 4: Based on the timeline c corresponding to the second text, extract the final audio from the audio file in the second step. Use the adjusted second text to extract the final audio, ensuring that the audio content and text content are completely consistent, thereby improving the accuracy of the translation.
[0010] Step 5: Obtain mixed speech sample data. The mixed speech sample data is a single-channel speech signal. The mixed speech sample data includes at least one of the following: noise signal, interference speech signal, reverberation signal, and guide speech. The guide speech includes the speech corresponding to the target object.
[0011] Step Six: Construct a speech separation network. The speech separation network includes an encoder, a global encoder, a wizard module, a separation module, and a decoder. The encoder and global encoder are used to output the features of the speech signal. The wizard module is used to output weight values based on the comparison results of the wizard speech and mixed speech sample data. The separation module is used to obtain the high-dimensional mapping of the target speech. The decoder is used to decode the data to obtain the target speech.
[0012] Step 7: Input the mixed speech sample data into the speech separation network to obtain the predicted target speech.
[0013] Step 8: Update the speech separation network based on the preset loss function and the predicted target speech to obtain the speech extraction model.
[0014] Step 9: Extract the target object's speech signal from the speech data to be processed using the speech extraction model; the speech data to be processed includes a single-channel speech signal.
[0015] A further improvement is that: in step four, the time axis c corresponding to the second text includes multiple time segments, which are used to extract multiple final audio segments at once. When the target video and the first text correspond one-to-one, the cutting can be performed directly.
[0016] A further improvement lies in: in step three, the target audio B includes target audios 1, 2, and 3. Assuming that one of the target audios is target audio 1, which corresponds to multiple first texts. At this time, target audio 1 needs to be trimmed. The trimming steps are located in step three and include aligning the starting point of the first text 1 with the starting point of the target audio, inputting the ending point of the first text 1, and obtaining the second text 1. Align the starting point of the first text 2 with the input ending point of the first text 1, input the ending point of the first text 2, and obtain the second text 2. Align the starting point of the first text n with the input ending point of the first text n-1, input the ending point of the first text n, and obtain the second text n. After target audio 1 is trimmed, due to the limitation of the audio duration in model training, when the target audio is too long, it will contain multiple first texts and needs to be trimmed according to the first texts.
[0017] A further improvement lies in: in step three, the target audio B includes target audios 1, 2, and 3. Assuming that one of the target audios is target audio 1, which corresponds to an inverted sentence containing inappropriate statements and has an ambiguous word order reversal. At this time, target audio 1 needs to be trimmed. The trimming steps are located in step three and include semantic confirmation of the inappropriate statements, which are divided into continuous segments 1, 2, 3, n. The continuous segments 1, 2, 3, n are defined as the first texts 1, 2, 3, n. Align the starting point of the first text 1 with the starting point of the target audio, input the ending point of the first text 1, and obtain the second text 1. Align the starting point of the first text 2 with the input ending point of the first text 1, input the ending point of the first text 2, and obtain the second text 2. Align the starting point of the first text n with the input ending point of the first text n-1, input the ending point of the first text n, and obtain the second text n. After target audio 1 is trimmed, the inverted sentences in the attributive clauses are currently judged by manual recognition.
[0018] A further improvement lies in: establishing an inverted structure speech database. In step three, the target audio B includes target audios 1, 2, and 3. Assuming that one of the target audios is target audio 1, which corresponds to an inverted structure. At this time, target audio 1 needs to be trimmed. The trimming steps are located in step three and include semantic confirmation of target audio 1, identifying the inverted structure, comparing it with the inverted structure speech database, and performing continuous segment cutting 1, 2, 3n. The continuous segments 1, 2, 3, n are defined as the first texts 1, 2, 3, n. Align the starting point of the first text 1 with the starting point of the target audio, input the ending point of the first text 1, and obtain the second text 1. Align the starting point of the first text 2 with the input ending point of the first text 1, input the ending point of the first text 2, and obtain the second text 2. Align the starting point of the first text n with the input ending point of the first text n-1, input the ending point of the first text n, and obtain the second text n. After target audio 1 is trimmed (inverted structures exist in short sentences, such as "tonight" and "my home", which need to be identified in advance, otherwise it is also easy to cause ambiguity).
[0019] A further improvement lies in the following: In step five, after obtaining the speech extraction module, the speech data to be processed is input into the quantized and fine-tuned network model. The model's calculations then yield the separation result of the target speech, thus ensuring the requirements for subsequent speech recognition are met. In some implementations, after training the speech extraction model, it can be tested and validated to ensure the training effect of the model. After obtaining mixed speech sample data, test and validation sample data can be obtained from it.
[0020] A further improvement is made in the following: In step six, a speech separation network construction module is used to construct a speech separation network; in step six, the trained speech extraction model is used to extract the test target speech signal from the test sample data, and then the extracted test target speech signal is compared with the verification sample data. Based on the comparison results, the speech extraction model is optimized. By analyzing the consistency between the prediction results and the original results, the prediction accuracy of the model can be effectively judged.
[0021] A further improvement is that after obtaining the speech extraction model in step seven, the speech of the target object in a single-channel speech can be extracted accurately and effectively, thereby effectively ensuring the subsequent application process.
[0022] A further improvement is made in step eight, which is used to construct a speech separation network. The speech separation network includes an encoder, a global encoder, a wizard module, a separation module, and a decoder. The encoder and the global encoder are used to output the features of the speech signal. The wizard module is used to output weight values based on the comparison results of the wizard speech and the mixed speech sample data. The separation module is used to obtain the high-dimensional mapping of the target speech. The decoder is used to decode the data to obtain the target speech.
[0023] A further improvement is made in step nine, in which the single-channel speech signal is extracted, and the extracted speech can be used for speech recognition and other purposes in subsequent processes, thereby improving the user experience.
[0024] Compared with existing technologies, this invention has the following advantages: The Zhuang language audio extraction method disclosed aligns audio and text in the time domain, and adds frequency domain alignment on the basis of time domain alignment to make the results more accurate. This solution aligns audio and text in both the time and frequency domains to ensure that the text and audio information correspond completely. Since Zhuang language itself has situations where one sound corresponds to multiple characters or multiple sounds correspond to one character, direct alignment from one sound to one character in the time domain will lead to a mismatch between text and audio information, resulting in inaccurate information. For subsequent use of audio data, due to the special properties of Zhuang language structure, inverted structures may appear in a short sentence, which requires semantic confirmation in a target audio; otherwise, ambiguity will arise. Such inverted structures do not exist in Chinese or other local dialects, and no existing technology provides technical inspiration for text segmentation in a short sentence. This solution addresses this type of problem. Attached Figure Description
[0025] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0026] Figure 1 This is a schematic diagram of the audio file of the present invention;
[0027] Figure 2 This is the first text clipping image of the present invention;
[0028] Figure 3 This is the second text clipping image of the present invention. Detailed Implementation
[0029] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0030] In the description of this invention, it should be noted that, unless otherwise explicitly specified and limited, the terms installation, connection, and linking should be interpreted broadly. For example, they can refer to fixed connections, detachable connections, or integral connections; they can refer to mechanical connections or electrical connections; they can refer to direct connections or indirect connections through an intermediate medium; and they can refer to the internal communication between two components.
[0031] See Figure 1 , Figure 2 , Figure 3 This invention discloses a method for extracting Zhuang language audio, comprising the following steps:
[0032] Step 1: Obtain the first text set and Zhuang language speech through video processing. The first text set includes multiple first texts A and corresponding text timelines a.
[0033] Step 2: Input video, extract audio file, perform Fourier transform on audio file to obtain audio spectrum, set waveform amplitude threshold, filter noise (noise is audio with waveform amplitude below the threshold), and obtain target audio B and target audio time axis b;
[0034] Step 3: Align the text timeline a with the target audio timeline b, and adjust the length of the text timeline a to match the length of the target audio timeline b, to obtain multiple second texts c and the corresponding timelines c of the second texts;
[0035] Step 4: Based on the timeline c corresponding to the second text, extract the final audio from the audio file in the second step. Use the adjusted second text to extract the final audio, ensuring that the audio content and text content are completely consistent, thereby improving the accuracy of the translation.
[0036] Step 5: Obtain mixed speech sample data. The mixed speech sample data is a single-channel speech signal. The mixed speech sample data includes at least one of the following: noise signal, interference speech signal, reverberation signal, and guide speech. The guide speech includes the speech corresponding to the target object.
[0037] Step 6: Construct a speech separation network. The speech separation network includes an encoder, a global encoder, a wizard module, a separation module, and a decoder. The encoder and global encoder are used to output the features of the speech signal. The wizard module is used to output weight values based on the comparison results of the wizard speech and mixed speech sample data. The separation module is used to obtain the high-dimensional mapping of the target speech. The decoder is used to decode the data to obtain the target speech.
[0038] Step 7: Input the mixed speech sample data into the speech separation network to obtain the predicted target speech.
[0039] Step 8: Update the speech separation network based on the preset loss function and the predicted target speech to obtain the speech extraction model.
[0040] Step 9: Use a speech extraction model to extract the target object's speech signal from the speech data to be processed; the speech data to be processed includes single-channel speech signals.
[0041] In step four, the timeline c corresponding to the second text includes multiple time segments. These multiple time segments are used to extract multiple final audio segments at once. If the target video and the first text correspond one-to-one, the video can be cut directly.
[0042] In step three, the target audio B includes target audios 1, 2, and 3. Assume that one of the target audios, target audio 1, corresponds to multiple first texts. At this time, target audio 1 needs to be trimmed. The trimming steps are in step three and include aligning the starting point of the first text 1 with the starting point of the target audio and inputting the ending point of the first text 1 to obtain the second text 1; aligning the starting point of the first text 2 with the input ending point of the first text 1 and inputting the ending point of the first text 2 to obtain the second text 2; aligning the starting point of the first text n with the input ending point of the first text n - 1 and inputting the ending point of the first text n to obtain the second text n. Target audio 1 is trimmed. Due to the limitation of the audio duration in model training, when the target audio is too long, it will contain multiple first texts and needs to be trimmed according to the first texts.
[0043] In step three, the target audio B includes target audios 1, 2, and 3. Assume that one of the target audios, target audio 1, corresponds to an inverted sentence with inappropriate words, and the word order is reversed and ambiguous. At this time, target audio 1 needs to be trimmed. The trimming steps are in step three and include semantic confirmation of the inappropriate sentence, which is divided into continuous segments 1, 2, 3, n. The continuous segments 1, 2, 3, n are defined as the first texts 1, 2, 3, n. Align the starting point of the first text 1 with the starting point of the target audio and input the ending point of the first text 1 to obtain the second text 1; align the starting point of the first text 2 with the input ending point of the first text 1 and input the ending point of the first text 2 to obtain the second text 2; align the starting point of the first text n with the input ending point of the first text n - 1 and input the ending point of the first text n to obtain the second text n. Target audio 1 is trimmed. The inverted sentences in the Zhuang language are currently judged by manual recognition.
[0044] Establish an inverted structure speech database. In step three, the target audio B includes target audios 1, 2, and 3. Assume that one of the target audios, target audio 1, corresponds to an inverted structure. At this time, target audio 1 needs to be trimmed. The trimming steps are in step three and include semantic confirmation of target audio 1, identifying the inverted structure, comparing with the inverted structure speech database, and performing continuous segment cutting 1, 2, 3n. The continuous segments 1, 2, 3, n are defined as the first texts 1, 2, 3, n. Align the starting point of the first text 1 with the starting point of the target audio and input the ending point of the first text 1 to obtain the second text 1; align the starting point of the first text 2 with the input ending point of the first text 1 and input the ending point of the first text 2 to obtain the second text 2; align the starting point of the first text n with the input ending point of the first text n - 1 and input the ending point of the first text n to obtain the second text n. Target audio 1 is trimmed. (Inverted structures exist in short sentences, such as "tonight" and "my home" in reverse order, which need to be identified in advance, otherwise it is also easy to cause ambiguity).
[0045] In step five, after obtaining the speech extraction module, the speech data to be processed is input into the quantized and fine-tuned network model. The model's calculations yield the separation result of the target speech, thus ensuring the requirements for subsequent speech recognition. In some implementations, after training the speech extraction model, it can be tested and validated to ensure the training effect. After obtaining mixed speech sample data, test and validation sample data can be obtained from it.
[0046] In step six, the speech separation network construction module is used to construct the speech separation network. In step six, the trained speech extraction model is used to extract the test target speech signal from the test sample data. The extracted test target speech signal is then compared with the verification sample data. Based on the comparison results, the speech extraction model is optimized. By analyzing the consistency between the prediction results and the original results, the prediction accuracy of the model can be effectively judged.
[0047] After obtaining the speech extraction model in step seven, the speech of the target object in a single-channel speech can be extracted accurately and effectively, thus ensuring the subsequent application process.
[0048] Step eight is used to construct a speech separation network, which includes an encoder, a global encoder, a wizard module, a separation module, and a decoder. The encoder and global encoder are used to output the features of the speech signal. The wizard module is used to output weight values based on the comparison results of the wizard speech and mixed speech sample data. The separation module is used to obtain the high-dimensional mapping of the target speech. The decoder is used to decode the data to obtain the target speech.
[0049] In step nine, the single-channel speech signal is extracted, which can then be used for speech recognition and other functions in subsequent processes, thus improving the user experience.
[0050] When using this Zhuang language audio extraction method, the number of target objects can be one or more. To perform speech recognition and other processing on the guide speech in subsequent processes, it is necessary to separate the guide speech from the mixed speech sample data. The mixed speech sample data can also be a single-channel speech signal. The single-channel speech signal can be a sound signal collected through only one microphone. When the model training process is based on supervised learning, the mixed speech sample data can also have corresponding labels to identify the guide speech, so that when processing the data using a speech separation network, the guide speech can be directly used for specific training. The specific identification method can be set based on the needs of actual applications and is not limited thereto. At least two human voice signals are mixed within a first signal-to-noise ratio (SNR) range to obtain a human voice mixed speech signal. The human voice signals can be pre-acquired or separated independent speech signals corresponding to human voices. The first SNR range is used to define the SNR interval of the mixed human voice signals, for example, it can be between 0dB and 5dB. Next, the human voice mixed speech signal is mixed with a noise signal within a second SNR range to obtain a composite speech signal. The noise signal can be an additional signal that interferes with the above speech signal. The second SNR range is used to define the SNR interval of the mixed two signals, for example, it can be between -6dB and 3dB. Finally, the composite speech signal is processed using a speech signal generation function to obtain mixed speech sample data. The speech signal generation function can generate corresponding speech signals based on the corresponding data to achieve the effect of simulating the speech in actual applications, thereby helping to construct simulated speech sample data.
[0051] The above are merely preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for extracting Zhuang language audio, characterized in that, Includes the following steps: Step 1: Obtain the first text set and Zhuang language speech through video processing. The first text set includes multiple first texts A and corresponding text timelines a. Step 2: Input video, extract audio file, perform Fourier transform on audio file to obtain audio spectrum, set waveform amplitude threshold, filter noise, the noise is audio with waveform amplitude below the threshold, and obtain target audio B and target audio time axis b; Step 3: Align the text timeline a and the target audio timeline b, and adjust the length of the text timeline a to be consistent with the length of the target audio timeline b, to obtain multiple second texts C and the timeline c corresponding to the second texts; Step 4: Extract the final audio from the audio file in step 2 based on the timeline c corresponding to the second text, and use the adjusted second text to extract the final audio. In step two, target audio B includes target audio 1, 2, and 3. One of the target audios is assumed to be target audio 1, which corresponds to multiple first texts. At this time, target audio 1 needs to be trimmed. The trimming step is located in step three, which includes aligning the start point of first text 1 with the start point of the target audio, inputting the end point of first text 1 to obtain second text 1, aligning the start point of first text 2 with the end point of first text 1, inputting the end point of first text 2 to obtain second text 2, aligning the start point of first text n with the end point of first text n-1, inputting the end point of first text n to obtain second text n. The target audio 1 is now trimmed. In step two, target audio B includes target audio 1, 2, and 3. At least one of target audio 1, 2, and 3 contains inappropriate sentences. The inappropriate sentence type includes inverted sentences. The target audio containing inappropriate sentences is cut out separately. Target audio 1, 2, and 3 may all contain inappropriate sentences. If target audio 1 contains inappropriate sentences, target audio 1 is cut out separately and marked.
2. The Zhuang language audio extraction method according to claim 1, characterized in that: In step four, the timeline c corresponding to the second text includes multiple time segments, which are used to extract multiple final audio clips at once.
3. The Zhuang language audio extraction method according to claim 1, characterized in that: Establish a reverse structure speech library. In step two, target audio B includes target audio 1, 2, and 3. At least one of target audio 1, 2, and 3 contains a reverse structure. The target audio containing the reverse structure is cut out separately.
4. The Zhuang language audio extraction method according to claim 3, characterized in that: A reverse structure speech database is established, and the target audio containing the reverse structure is compared with the reverse structure speech database to identify the reverse structure.