Speech enhancement method and device, electronic equipment, storage medium and program product
By decoupling the input and output of speech recognition and speech synthesis through a parallel processing mechanism, the speech synthesis delay problem is solved, and the timeliness of communication is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SUZHOU KEYUAN SOFTWARE TECH DEV
- Filing Date
- 2026-03-31
- Publication Date
- 2026-06-19
AI Technical Summary
Existing speech enhancement solutions suffer from time delays in speech synthesis due to the sequential execution of speech recognition and speech synthesis, which affects the timeliness of communication.
A parallel processing mechanism is adopted, in which the first processing thread and the second processing thread respectively perform streaming speech recognition processing and speech synthesis playback processing. The parallel processing mechanism decouples the output of the first processing thread from the input of the second processing thread.
It effectively reduces the time delay of speech synthesis, ensuring timely communication.
Smart Images

Figure CN122245282A_ABST
Abstract
Description
Technical Field
[0001] The present invention relates to the field of audio processing technology, and in particular to a voice enhancement method, apparatus, electronic device, storage medium and program product. Background Technology
[0002] In video communication scenarios, the audio quality of the speaker at the far end directly affects the communication experience. However, due to factors such as transmission environment interference and equipment performance limitations, the audio at the far end often suffers from problems such as distortion and noise interference, which leads to a decrease in the communication quality at the receiving end.
[0003] Existing technologies have proposed various speech enhancement or audio quality improvement schemes, such as signal processing-based methods or deep learning model-based speech enhancement methods. These schemes typically perform noise reduction, demixing, or other processing on the received audio signal to improve speech quality. However, these methods are often limited by the quality of the input speech signal itself. When the original speech signal contains significant distortion or missing information, its enhancement effect is limited, and it is difficult to balance processing performance with low latency requirements in complex real-time scenarios.
[0004] In recent years, with the continuous development of Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) technologies, speech processing methods based on "speech-to-text-to-speech" conversion have gradually attracted attention. This type of method, by converting speech into text representation and then re-synthesizing speech based on the text content, can theoretically effectively avoid the influence of noise and distortion in the original audio, providing a new technical path for improving speech quality. However, the aforementioned speech conversion scheme involves a sequential process of converting speech to text and then back to speech. Speech synthesis can only occur after all speech recognition is completed. This processing method often leads to a time delay in speech synthesis, thus affecting the timeliness of communication. Summary of the Invention
[0005] This invention provides a speech enhancement method, apparatus, electronic device, storage medium, and program product to solve the problem that existing solutions suffer from time delays in speech synthesis due to the sequential execution of speech recognition and speech synthesis, which in turn affects the timeliness of communication.
[0006] According to one aspect of the present invention, a speech enhancement method is provided, comprising: Get the current audio data; Based on the current audio data, a parallel processing mechanism is adopted, and the first processing thread and the second processing thread respectively perform streaming speech recognition processing and speech synthesis playback processing. The parallel processing mechanism decouples the output of the first processing thread from the input of the second processing thread.
[0007] According to another aspect of the present invention, a voice enhancement device is provided, comprising: The acquisition module is used to acquire the current audio data; The execution module is used to perform streaming speech recognition processing and speech synthesis playback processing respectively through the first processing thread and the second processing thread, based on the current audio data and using a parallel processing mechanism. The parallel processing mechanism decouples the output of the first processing thread from the input of the second processing thread.
[0008] According to another aspect of the present invention, an electronic device is provided, the electronic device comprising: at least one processor; and a memory communicatively connected to the at least one processor; The memory stores a computer program that can be executed by the at least one processor, which is then executed by the at least one processor to enable the at least one processor to perform the speech enhancement method according to any embodiment of the present invention.
[0009] According to another aspect of the present invention, a computer-readable storage medium is provided, the computer-readable storage medium storing computer instructions for causing a processor to execute and implement the speech enhancement method according to any embodiment of the present invention.
[0010] According to another aspect of the present invention, a computer program product is provided, the computer program product comprising a computer program that, when executed by a processor, implements the speech enhancement method described in any embodiment of the present invention.
[0011] The technical solution of this invention adopts a parallel mechanism to decouple the output of the first processing thread from the input of the second processing thread. By executing speech recognition processing and speech synthesis playback processing in parallel through two independent threads, the problem of time delay in speech synthesis is solved, and the beneficial effect of effectively reducing the time delay in speech synthesis and ensuring the timeliness of communication is achieved.
[0012] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of the present invention, nor is it intended to limit the scope of the invention. Other features of the invention will become readily apparent from the following description. Attached Figure Description
[0013] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0014] Figure 1 This is a flowchart illustrating a speech enhancement method provided in Embodiment 1 of the present invention; Figure 2 This is a flowchart illustrating a speech enhancement method provided in Embodiment 2 of the present invention; Figure 3 This is a flowchart illustrating a speech enhancement method provided in Embodiment 3 of the present invention; Figure 4 This is a schematic flowchart of a speech enhancement method provided in Embodiment 4 of the present invention; Figure 5 This is a schematic diagram of the structure of a speech enhancement device provided in Embodiment 5 of the present invention; Figure 6 This is a schematic diagram of an electronic device for a speech enhancement method according to an embodiment of the present invention. Detailed Implementation
[0015] To enable those skilled in the art to better understand the present invention, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention. It should be understood that the various steps described in the method embodiments of the present invention can be performed in different orders and / or in parallel. Furthermore, the method embodiments may include additional steps and / or omit the steps shown. The scope of the present invention is not limited in this respect.
[0016] The term "comprising" and its variations as used herein are open-ended inclusions, meaning "including but not limited to". The term "based on" means "at least partially based on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Definitions of other terms will be given in the description below.
[0017] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0018] It should be noted that the terms "a" and "a plurality of" used in this invention are illustrative rather than restrictive. Those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as "one or more".
[0019] The names of the messages or information exchanged between the multiple devices in the embodiments of the present invention are for illustrative purposes only and are not intended to limit the scope of these messages or information.
[0020] Example 1 Figure 1 This is a flowchart illustrating a voice enhancement method provided in Embodiment 1 of the present invention. This method is applicable to various real-time voice communication, remote conferencing, and smart terminal voice playback. The method can be executed by a voice enhancement device, which can be implemented by software and / or hardware and is generally integrated into an electronic device. In this embodiment, the electronic device includes, but is not limited to, computer equipment, smartwatches, iPads, and smartphones.
[0021] like Figure 1 As shown, the speech enhancement method provided in Embodiment 1 of the present invention includes the following steps: S110, Get the current audio data.
[0022] In this context, current audio data can be understood as audio data at the current moment. For example, in a call scenario, current audio data can be the call voice signal at the current moment; in a music playback scenario, current audio data can be the audio signal being played at the current moment.
[0023] In this embodiment, the electronic device can acquire the current audio data in a variety of ways. One implementation is that the electronic device can first use a microphone to convert the sound into an analog electrical signal, and then use a sound card to convert the continuous analog electrical signal into audio data.
[0024] S120. Based on the current audio data, a parallel processing mechanism is adopted to perform streaming speech recognition processing and speech synthesis playback processing through a first processing thread and a second processing thread, respectively; wherein, the parallel processing mechanism decouples the output of the first processing thread from the input of the second processing thread.
[0025] In this embodiment, a parallel processing mechanism is adopted, in which the first processing thread performs streaming speech recognition processing and the second processing thread performs speech synthesis playback processing, thereby decoupling the speech recognition output from the speech synthesis input.
[0026] In this embodiment, while the first processing thread performs streaming speech recognition processing on the current audio data to obtain the text corresponding to the current audio data, the second processing thread simultaneously performs speech synthesis and playback processing on the text corresponding to any audio data preceding the current audio data. Preferably, while the first processing thread performs streaming speech recognition processing on the current audio data to obtain the text corresponding to the current audio data, the second processing thread simultaneously performs speech synthesis and playback processing on the text corresponding to the previous audio data. When the second processing thread performs speech synthesis and playback on the text corresponding to the current audio data, the second processing thread simultaneously performs streaming speech recognition processing on the next audio data to obtain the text corresponding to the next audio data.
[0027] Among them, the streaming speech recognition processing performed through the first processing thread includes the following four methods: Method 1: The first processing thread inputs the current audio data into the speech recognition model and outputs the incremental text corresponding to the current audio data. Method 2: The first processing thread inputs the current audio data into the speech recognition model, outputs the incremental text corresponding to the current audio data, and then performs dynamic filtering based on the incremental text to obtain the effective text corresponding to the current audio data. Method 3: The first processing thread inputs multiple audio blocks into the speech recognition model and outputs incremental text corresponding to the multiple audio blocks. The multiple audio blocks are audio blocks of a unified audio format obtained after preprocessing the current audio data. Method 4: The first processing thread inputs multiple audio blocks into the speech recognition model, outputs incremental text corresponding to the multiple audio blocks, and then performs dynamic filtering based on the incremental text to obtain the valid text corresponding to the multiple audio blocks. Among them, the multiple audio blocks are audio blocks of a unified audio format obtained after preprocessing the current audio data.
[0028] In this embodiment, in response to the requirements of the speech recognition model for the input audio format, the acquired current audio data is standardized and preprocessed to convert the acquired audio data into mono audio data with a uniform sampling frequency that meets the requirements of speech recognition processing. Then, the audio data is sliced to generate continuous audio blocks.
[0029] The specific preprocessing process can be determined based on the speech recognition model. Different speech recognition models have different requirements for the input audio format. For example, preprocessing includes at least one of the following: channel unification, sampling frequency conversion, amplitude normalization, and switching processing.
[0030] The speech synthesis and playback processing performed through the second processing thread includes the following two methods: Method 1: The second processing thread splits the text corresponding to the audio data into multiple sub-texts, then converts each sub-text into a corresponding audio segment and plays it in real time. This text is either incremental text or valid text. Method 2: The second processing thread splits the text corresponding to the audio data into multiple sub-texts, then converts each sub-text into a corresponding audio segment, performs continuous line optimization processing based on the audio context information of each audio segment, and plays it in real time. This text is either incremental text or valid text.
[0031] In this embodiment, the first processing thread and the second processing thread can also interact with each other through a text queue. The text obtained by the first processing thread in performing streaming speech recognition processing is pushed to a thread-safe text queue, and the second processing thread retrieves the text from the text queue and performs speech synthesis and playback processing.
[0032] The first embodiment of this invention provides a speech enhancement method. First, current audio data is acquired. Then, based on the current audio data, a parallel processing mechanism is employed, with a first processing thread and a second processing thread respectively executing streaming speech recognition processing and speech synthesis playback processing. The parallel processing mechanism decouples the output of the first processing thread from the input of the second processing thread. This method, by employing a parallel mechanism and decoupling the output of the first processing thread from the input of the second processing thread, allows for the parallel execution of speech recognition processing and speech synthesis playback processing by two independent threads. This decoupling effectively reduces the time delay inherent in speech synthesis, ensuring timely communication.
[0033] Based on the above embodiments, modified embodiments of the above embodiments are proposed. It should be noted that, in order to keep the description brief, only the differences from the above embodiments are described in the modified embodiments.
[0034] In one embodiment, the second processing thread is provided with a playback buffer pool, which caches multiple audio segments generated by the second processing thread. If the generation speed of a voice audio segment is slower than the playback speed, then the last cached voice audio segment in the playback buffer pool is played. If the generation speed of a voice audio segment is faster than the playback speed, the voice audio segment is cached in the playback buffer pool and played according to the preset playback strategy.
[0035] In this embodiment, the second processing thread sets up a playback buffer pool to adjust when the speech synthesis speed and playback speed are inconsistent. When the generation delay of a speech audio segment exceeds a preset threshold, the last cached speech audio segment in the buffer pool is played first to avoid playback stuttering. When a speech audio segment is generated before the speech synthesized from the previous speech audio segment has finished playing, the generated speech audio segment is cached in the buffer pool and played out according to a preset playback rhythm, thereby achieving real-time performance and natural speech.
[0036] Example 2 Figure 2 This is a flowchart illustrating a speech enhancement method according to Embodiment 2 of the present invention. Embodiment 2 is an optimization based on the above embodiments. For details not covered in this embodiment, please refer to Embodiment 1.
[0037] like Figure 2 As shown, the speech enhancement method provided in Embodiment 2 of the present invention includes the following steps: S210, Get the current audio data.
[0038] S220. For the current audio data, the first processing thread performs streaming speech recognition processing to obtain the text corresponding to the current audio data, and caches the text corresponding to the current audio data in a text queue. At the same time, the second processing thread retrieves the text corresponding to the audio data before the current audio data from the text queue and performs speech synthesis playback.
[0039] The text corresponding to the current audio block can be either incremental text or valid text, with the valid text being obtained by dynamically filtering the incremental text. The text queue provides text to the second processing thread according to the order of the cached text.
[0040] In this embodiment, the text queue is independent of the first and second processing threads, but data interaction between them can be achieved through the text queue. The text obtained after the first processing thread performs speech recognition processing is cached in the text queue, with the text arranged in the order of generation. The text queue provides text to the second processing thread according to a first-in, first-out (FIFO) principle. Providing text can involve the text queue sending text to the second processing thread, or the second processing thread retrieving text from the text queue.
[0041] The second embodiment of this invention provides a speech enhancement method, which specifically implements a parallel processing mechanism, using a first processing thread and a second processing thread to perform streaming speech recognition processing and speech synthesis playback processing, respectively. This method uses a text queue to decouple the output of the first processing thread from the input of the second processing thread, effectively reducing the time delay in speech synthesis and ensuring timely communication.
[0042] Example 3 Figure 3 This is a flowchart illustrating a speech enhancement method according to Embodiment 3 of the present invention. Embodiment 3 is an optimization based on the above embodiments. For details not covered in this embodiment, please refer to Embodiments 1 and 2.
[0043] like Figure 3 As shown, the speech enhancement method provided in Embodiment 3 of the present invention includes the following steps: S310, Get the current audio data.
[0044] S320. Preprocess the current audio data to obtain multiple audio blocks with a unified audio format.
[0045] S330. Input multiple audio blocks into the speech recognition model through the first processing thread, and output the incremental text corresponding to the multiple audio blocks.
[0046] In this embodiment, the speech recognition model can perform streaming speech recognition processing on multiple input audio blocks, instantly generating corresponding incremental text information. The speech recognition model then decodes and outputs the incremental text after encoding through a neural network.
[0047] Among them, the speech recognition model can be any model with speech recognition function. Commonly used speech recognition models include the end-to-end ASR mainstream model, the pre-trained speech large model, and the traditional hybrid model.
[0048] S340. The incremental text is dynamically filtered by the first processing thread according to a preset filtering strategy to obtain the effective text corresponding to the multiple audio blocks.
[0049] In this embodiment, to ensure that the incremental text output by the first processing thread is suitable for subsequent speech synthesis processing, the first processing thread can also dynamically filter the incremental text. The filtering strategy is to select incremental text that meets at least one of the following conditions as valid text corresponding to the audio block: Condition 1: The length of the incremental text reaches a preset threshold; Condition 2: The incremental text includes labeled information that can serve as natural semantic boundaries; Condition 3: The incremental two-dimensional table remains unchanged within the preset time window.
[0050] S350. The second processing thread performs speech synthesis and playback on the text corresponding to the audio data preceding the current audio data. Preferably, while the first processing thread performs streaming speech synthesis on the current audio data, the second processing thread simultaneously performs speech synthesis and playback on the text corresponding to the previous audio data.
[0051] Understandably, if the second processing thread has not finished processing the text corresponding to the current audio data for speech synthesis and playback, but the first processing thread has already completed the streaming speech recognition processing for the next audio data, in this case, the first processing thread can cache the text corresponding to the next audio data to a text queue. When the second processing thread finishes processing the text corresponding to the current audio data for speech synthesis and playback, it can retrieve the text corresponding to the next audio data from the text queue for speech synthesis and playback, or the text queue can provide the text corresponding to the next audio data to the second processing thread so that the second processing thread can process the text corresponding to the next audio data for speech synthesis and playback.
[0052] It should be noted that steps S330-S340 are executed synchronously with step S350.
[0053] The third embodiment of this invention provides a speech enhancement method that specifically describes the process of performing streaming speech recognition processing through a first processing thread. On one hand, the method uses a streaming recognition mode through the first processing thread to output the corresponding incremental text result in real time for each received audio data, without relying on an offline correction mechanism based on silence detection, thereby reducing recognition latency. On the other hand, the method uses a preset filtering strategy to select effective text that meets the requirements for readability, thereby improving the accuracy of speech synthesis.
[0054] Example 4 Figure 4 This is a flowchart illustrating a speech enhancement method according to Embodiment 4 of the present invention. Embodiment 4 is an optimization based on the above embodiments. For details not covered in this embodiment, please refer to the above embodiments.
[0055] like Figure 4 As shown, the speech enhancement method provided in Embodiment 4 of the present invention includes the following steps: S410, Get the current audio data.
[0056] S420. The first processing thread performs streaming speech recognition processing on the current audio data to obtain the text corresponding to the current audio data.
[0057] It should be noted that step S420 is executed synchronously with steps S430-450.
[0058] S430. The second processing thread splits the text corresponding to the audio data preceding the current audio data into multiple sub-texts.
[0059] In this embodiment, the text can be incremental text or valid text.
[0060] Among them, the text is segmented according to rules. For general text, it can be segmented based on punctuation marks; for instruction-type speech text, it can be segmented based on keywords or trigger words; for natural dialogue text, it can be segmented into semantic short sentences; and for long text, it can be segmented based on a length threshold.
[0061] S440. The second processing thread converts each sub-text into the corresponding audio, generating the corresponding audio segment.
[0062] S450: The second processing thread performs smooth fusion processing at the splicing position of adjacent speech audio segments, and adaptively performs dynamic smoothing adjustment or silence interval optimization based on the abrupt change in acoustic features at the boundary of the splicing position, and plays the optimized speech audio segment in real time.
[0063] In this embodiment, the second processing thread caches the speech feature parameters of the previous speech audio during the generation of the speech audio segment, and performs splicing processing with reference to the speech feature parameters of the previous speech audio when generating the next speech audio segment, so as to improve the continuity of adjacent audio segments in terms of intonation and rhythm.
[0064] Specifically, the second processing thread will perform smooth fusion processing at the splicing positions of adjacent audio segments, and adaptively execute dynamic smoothing adjustments or silence interval optimization based on the abrupt changes in acoustic features at the boundary of the splicing position, including: S4501, The second processing thread determines the cross-gradient interval at the splicing position of adjacent audio segments.
[0065] Among them, a cross-gradient interval of estimated duration is determined at the splicing position of adjacent audio segments.
[0066] S4502. The audio signals within the cross-gradient interval are weighted and superimposed through the second processing thread. The audio signals include the tail audio signal of the previous audio segment and the head audio signal of the next audio segment.
[0067] The weighted superposition process refers to weighting and superimposing the tail audio signal of the previous audio segment and the head audio signal of the next audio segment within the crossover transition interval to eliminate the discontinuity of the audio waveform at the splicing point.
[0068] Specifically, the weighted superposition process includes: applying a decreasing weight to the tail audio signal of the preceding audio segment and an increasing weight to the head audio signal of the following audio segment, with the sum of the decreasing and increasing weights always equal to 1; then concatenating the weighted tail audio signal with the weighted head audio signal to obtain a cross-gradient output at the concatenation point. The weighting function can be any one of a linear weighting function, a cosine weighting function, or an exponential weighting function.
[0069] S4503. If the energy characteristics or fundamental frequency characteristics of the speech audio segment at the boundary of the splicing position are determined by the second processing thread, then the next speech audio segment in the adjacent speech audio segment is dynamically and smoothly adjusted.
[0070] In this embodiment, the second thread performs energy and fundamental frequency feature analysis on the audio signal of the speech audio segment at the splicing position boundary. When a sudden change in energy or fundamental frequency feature is detected, the next speech audio segment in the adjacent speech audio segment is dynamically and smoothly adjusted so that the adjacent speech audio segments maintain a smooth transition in volume and tone changes.
[0071] S4504. If the energy characteristics or fundamental frequency characteristics of the speech audio segment at the splicing position boundary are determined by the second processing thread to be free from abrupt changes, then the silence filling duration between the adjacent speech audio segments is adaptively adjusted according to the silence characteristics at the boundary of the adjacent speech audio segments.
[0072] In this embodiment, if there is no abrupt change in the energy characteristics or fundamental frequency characteristics of the speech audio segment at the boundary of the splicing position, then according to the silence characteristics at the boundary of the adjacent speech audio segment, the silence segment is dynamically trimmed or supplemented to avoid the situation of excessively long or short pauses during speech playback, thereby maintaining the continuity of the overall speech rhythm.
[0073] The silence duration is constrained between a preset minimum silence duration and a preset maximum silence duration.
[0074] The following is a detailed explanation of how the second processing thread performs audio-level continuity optimization processing at the splicing position of adjacent audio segments, using a specific embodiment: The i-th audio segment generated by the second processing thread is: In the above formula, This represents the total number of sampling points for the i-th speech audio signal, and n represents the sampling point number.
[0075] The (i+1)th audio segment generated by the second processing thread is: In the above formula, This represents the total number of sampling points for the i-th speech audio signal, and n represents the sampling point number.
[0076] In adjacent audio segments and The splicing position determines the cross-gradient interval, and the audio within the cross-gradient interval is weighted and superimposed.
[0077] Assuming the crossover duration is Sampling rate The length of the cross-gradient interval for: The formula for extracting the tail audio signal of the i-th speech audio segment is as follows: In the above formula, This represents the trailing audio signal of the i-th speech audio segment. This represents the total number of sampling points for the i-th speech audio signal. This represents the length of the transition interval, and n represents the sampling point number.
[0078] The formula for extracting the head audio signal of the (i+1)th speech audio segment is as follows: In the above formula, This represents the header audio signal of the (i+1)th speech audio segment. This represents the (i+1)th audio segment.
[0079] Define a decreasing weight function: In the above formula, This represents a decreasing weight function. This represents the length of the transition interval, and n represents the sampling point number.
[0080] Define an increasing weight function: In the above formula, This represents an increasing weight function. This represents the length of the transition interval, and n represents the sampling point number.
[0081] The decreasing and increasing weights satisfy the following conditions: The cross-gradient output at the splicing point is: In the above formula, This indicates a cross-gradient output at the splicing point. This represents a decreasing weight function. This represents an increasing weight function. This represents the trailing audio signal of the i-th speech audio segment. This represents the head audio signal of the i-th speech audio segment.
[0082] Determine whether there are abrupt changes in the energy characteristics of the speech audio segment at the boundary of the splicing position.
[0083] Calculate the short-time energy of adjacent speech audio segments within a window W near the splicing location boundary: In the above formula, This represents the short-time energy of the i-th speech audio segment within a window W near the splicing position boundary. This represents the short-time energy of the (i+1)th speech audio segment within a window W near the splicing position boundary. This represents the i-th audio segment within window W near the boundary of the splicing position. This represents the (i+1)th audio segment within window W near the boundary of the splicing position.
[0084] The short-time energy of the i-th speech audio segment within the window W near the splicing position boundary Short-time energy of the (i+1)th speech audio segment Satisfy the following formula: In the above formula, Indicates the energy mutation detection threshold. Can be set to The ratio, i.e. ,in The value ranges from 0.2 to 0.5.
[0085] If the above equation is satisfied, it indicates that there is a sudden change in short-time energy. Therefore, energy normalization (i.e., dynamic smoothing adjustment) is performed on the next speech audio segment in the adjacent speech audio segments. In the above formula, This represents the short-time energy of the i-th speech audio segment within a window W near the splicing position boundary. This represents the short-time energy of the (i+1)th speech audio segment within a window W near the splicing position boundary. This represents the (i+1)th audio segment within window W near the boundary of the splicing position. This represents the (i+1)th speech audio segment within the window W near the splicing position boundary after energy normalization.
[0086] In another specific embodiment, the energy is represented in the logarithmic field, and the decibel energy can be calculated: In the above formula, This represents the decibel energy of the i-th audio segment within a window W near the splicing position boundary. This represents the decibel energy of the (i+1)th speech audio segment within a window W near the boundary of the splicing position. This represents the short-time energy of the i-th speech audio segment within a window W near the splicing position boundary. This represents the short-time energy of the (i+1)th speech audio segment within the window W near the splicing position boundary.
[0087] If the difference in decibel energy between adjacent speech audio segments within window W near the splicing position boundary exceeds 2~6dB, it is determined that there is an energy mutation.
[0088] If it is determined that there is no abrupt change in the energy characteristics of the speech audio segment at the boundary of the splicing position, then the silence filling duration between adjacent speech audio segments is adaptively adjusted according to the silence characteristics at the boundary of adjacent speech audio segments.
[0089] Through short-time energy threshold Detect silence intervals to obtain the silence duration at the end of the preceding audio segment in adjacent speech audio segments. And the duration of silence at the beginning of the next audio segment in an adjacent audio segment. .
[0090] Assuming the target's silence duration The interval is: In the above formula, Indicates the minimum silence duration. Indicates the maximum duration of silence.
[0091] The final mute fill time is: In the above formula, Indicates the final mute fill time. This represents the target silence duration for the i-th audio segment. This represents the target silence duration for the (i+1)th audio segment. Indicates the minimum silence duration. Indicates the maximum duration of silence.
[0092] Embodiment 4 of this invention provides a speech enhancement method that clarifies the speech synthesis and playback process. This method utilizes a second processing thread to perform audio-level continuity optimization at the splicing points of adjacent speech audio segments. This maintains the continuity of pitch, speech rate, and prosody between adjacent speech audio segments, thereby reducing the sense of fragmentation caused by text segmentation during speech synthesis.
[0093] Example 5 Figure 5 This is a schematic diagram of a voice enhancement device provided in Embodiment 5 of the present invention. The device is applicable to various real-time voice communication, remote conferencing, and smart terminal voice playback. The device can be implemented by software and / or hardware and is generally integrated into electronic devices.
[0094] like Figure 5 As shown, the device includes an acquisition module 110 and an execution module 120.
[0095] Module 110 is used to acquire the current audio data; The execution module 120 is used to perform streaming speech recognition processing and speech synthesis playback processing respectively through a first processing thread and a second processing thread, based on the current audio data and using a parallel processing mechanism; wherein, the parallel processing mechanism decouples the output of the first processing thread from the input of the second processing thread.
[0096] In this embodiment, the device first acquires the current audio data through the acquisition module 110; then, the execution module 120, based on the current audio data, employs a parallel processing mechanism to perform streaming speech recognition processing and speech synthesis playback processing through a first processing thread and a second processing thread, respectively; wherein, the parallel processing mechanism decouples the output of the first processing thread from the input of the second processing thread.
[0097] This embodiment provides a speech enhancement device that can effectively reduce the time delay in speech synthesis and ensure timely communication.
[0098] Furthermore, the execution module 120 is specifically used to: for the current audio data, perform streaming speech recognition processing through the first processing thread to obtain the text corresponding to the current audio data, and at the same time, perform speech synthesis and playback on the text corresponding to the audio data preceding the current audio data through the second processing thread.
[0099] Furthermore, the execution module 120 is specifically used to: for the current audio data, perform streaming speech recognition processing through the first processing thread to obtain the text corresponding to the current audio data, cache the text corresponding to the current audio data in the text queue, and at the same time, retrieve the text corresponding to the audio data before the current audio data from the text queue through the second processing thread for speech synthesis and playback; The text queue provides text to the second processing thread according to the order of the cached text.
[0100] Based on the above optimizations, the execution module 120 includes a speech recognition submodule and a speech synthesis submodule.
[0101] In one embodiment, the speech recognition submodule includes: The first recognition unit is used to input the current audio data into the speech recognition model through the first processing thread and output the incremental text corresponding to the current audio data. The first filtering unit is used to dynamically filter the incremental text according to a preset filtering strategy through the first processing thread to obtain the valid text corresponding to the audio data.
[0102] In one embodiment, the speech recognition submodule includes: The second recognition unit is used to input multiple audio blocks into the speech recognition model through the first processing thread and output incremental text corresponding to the multiple audio blocks. The multiple audio blocks are audio blocks of a unified audio format obtained after preprocessing the current audio data. The second filtering unit is used to dynamically filter the incremental text according to a preset filtering strategy through the first processing thread to obtain valid text corresponding to multiple audio blocks.
[0103] Based on the above technical solution, the speech synthesis submodule includes: The splitting unit is used to split the text corresponding to the audio data preceding the current audio data into multiple sub-texts through the second processing thread, wherein the text is incremental text or valid text. The conversion unit is used to convert each sub-text into the corresponding audio through the second processing thread, generating the corresponding speech audio segment; The optimization unit is used to perform smooth fusion processing at the splicing position of adjacent speech audio segments through the second processing thread, and adaptively perform dynamic smooth adjustment or silence interval optimization according to the abrupt change of acoustic features at the boundary of the splicing position, and play the optimized speech audio segment in real time.
[0104] Based on the above technical solution, the optimization unit includes: The sub-unit is determined by the second processing thread to identify the cross-gradient interval at the splicing position of adjacent speech audio segments. The superposition subunit is used to perform weighted superposition processing on the audio signals in the cross-gradient interval through the second processing thread. The audio signals include the tail audio signal of the previous speech audio segment and the head audio signal of the next speech audio segment. The adjustment subunit is used to dynamically and smoothly adjust the next speech audio segment in the adjacent speech audio segments if the energy characteristics or fundamental frequency characteristics of the speech audio segment at the boundary of the splicing position are determined by the second processing thread. The adjustment subunit is used to determine, through the second processing thread, that there is no abrupt change in the energy characteristics or fundamental frequency characteristics of the speech audio segment at the boundary of the splicing position, and then adaptively adjust the silence filling duration between the adjacent speech audio segments according to the silence characteristics at the boundary of the adjacent speech audio segments.
[0105] Furthermore, the second processing thread is equipped with a playback buffer pool, which caches multiple audio segments generated by the second processing thread. If the generation speed of an audio segment is slower than the playback speed, the last cached audio segment in the playback buffer pool is played. If the generation speed of an audio segment is faster than the playback speed, the audio segment is cached in the playback buffer pool and played according to a preset playback strategy.
[0106] The above-described speech enhancement device can execute the speech enhancement method provided in any embodiment of the present invention, and has the corresponding functional modules and beneficial effects of executing the method.
[0107] Example 6 Figure 6 A schematic diagram of an electronic device 10 that can be used to implement embodiments of the present invention is shown. The electronic device can be a computer device, a smartwatch, an iPad, or a smartphone. The components shown herein, their connections and relationships, and their functions are merely examples and are not intended to limit the implementation of the invention described and / or claimed herein.
[0108] like Figure 6As shown, the electronic device 10 includes at least one processor 11 and a memory, such as a read-only memory (ROM) 12 or a random access memory (RAM) 13, communicatively connected to the at least one processor 11. The memory stores computer programs executable by the at least one processor. The processor 11 can perform various appropriate actions and processes based on the computer program stored in the ROM 12 or loaded from storage unit 18 into the RAM 13. The RAM 13 can also store various programs and data required for the operation of the electronic device 10. The processor 11, ROM 12, and RAM 13 are interconnected via a bus 14. An input / output (I / O) interface 15 is also connected to the bus 14.
[0109] Multiple components in electronic device 10 are connected to I / O interface 15, including: input unit 16, such as keyboard, mouse, etc.; output unit 17, such as various types of displays, speakers, etc.; storage unit 18, such as disk, optical disk, etc.; and communication unit 19, such as network card, modem, wireless transceiver, etc. Communication unit 19 allows electronic device 10 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0110] Processor 11 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various processors running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. Processor 11 performs the various methods and processes described above, such as speech enhancement methods.
[0111] In some embodiments, the speech enhancement method may be implemented as a computer program tangibly contained in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and / or installed on electronic device 10 via ROM 12 and / or communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the speech enhancement method described above may be performed. Alternatively, in other embodiments, processor 11 may be configured to perform the speech enhancement method by any other suitable means (e.g., by means of firmware).
[0112] Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various implementations may include: implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0113] In some embodiments, the speech enhancement method may be implemented as a computer program, which is implicitly included in a computer program product. When executed by a processor, the computer program implements the speech enhancement method of the present invention. The computer program product can be understood as a software product that primarily implements its solution through a computer program. The computer program for implementing the method of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, such that when executed by the processor, the computer program causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The computer program may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0114] In the context of this invention, a computer-readable storage medium can be a tangible medium that may contain or store a computer program for use by or in conjunction with an instruction execution system, apparatus, or device. A computer-readable storage medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination thereof. Alternatively, a computer-readable storage medium may be a machine-readable signal medium. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.
[0115] To provide interaction with a user, the systems and techniques described herein can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the electronic device. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).
[0116] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as data servers), or middleware components (e.g., application servers), or frontend components (e.g., user computers with graphical user interfaces or web browsers through which users can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., communication networks). Examples of communication networks include local area networks (LANs), wide area networks (WANs), blockchain networks, and the Internet.
[0117] A computing system can include clients and servers. Clients and servers are generally located far apart and typically interact through communication networks. The client-server relationship is created by computer programs running on the respective computers and having a client-server relationship with each other. The server can be a cloud server, also known as a cloud computing server or cloud host, which is a hosting product within the cloud computing service system to address the shortcomings of traditional physical hosts and VPS services, such as high management difficulty and weak business scalability.
[0118] It should be understood that the various forms of processes shown above can be used, with steps reordered, added, or deleted. For example, the steps described in this invention can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution of this invention can be achieved, and this is not limited herein.
[0119] The specific embodiments described above do not constitute a limitation on the scope of protection of this invention. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this invention should be included within the scope of protection of this invention.
Claims
1. A speech enhancement method, characterized in that, The method includes: Get the current audio data; Based on the current audio data, a parallel processing mechanism is adopted, and the first processing thread and the second processing thread respectively perform streaming speech recognition processing and speech synthesis playback processing. The parallel processing mechanism decouples the output of the first processing thread from the input of the second processing thread.
2. The method according to claim 1, characterized in that, The process, based on the current audio data, employs a parallel processing mechanism, using a first processing thread and a second processing thread to perform streaming speech recognition processing and speech synthesis playback processing, respectively, including: For the current audio data, the first processing thread performs streaming speech recognition processing to obtain the text corresponding to the current audio data, while the second processing thread performs speech synthesis and playback on the text corresponding to the audio data preceding the current audio data.
3. The method according to claim 1, characterized in that, The process, based on the current audio data, employs a parallel processing mechanism, using a first processing thread and a second processing thread to perform streaming speech recognition processing and speech synthesis playback processing, respectively, including: For the current audio data, the first processing thread performs streaming speech recognition processing to obtain the text corresponding to the current audio data, and caches the text corresponding to the current audio data in the text queue. At the same time, the second processing thread retrieves the text corresponding to the audio data before the current audio data from the text queue and performs speech synthesis playback. The text queue provides text to the second processing thread according to the order of the cached text.
4. The method according to claim 1, 2, or 3, characterized in that, Streaming speech recognition processing is performed through the first processing thread, including: The first processing thread inputs the current audio data into the speech recognition model and outputs the incremental text corresponding to the current audio data. The first processing thread dynamically filters the incremental text according to a preset filtering strategy to obtain the valid text corresponding to the current audio data.
5. The method according to claim 1, 2, or 3, characterized in that, Streaming speech recognition processing is performed through the first processing thread, including: The first processing thread inputs multiple audio blocks into the speech recognition model and outputs incremental text corresponding to the multiple audio blocks. The multiple audio blocks are audio blocks of a unified audio format obtained after preprocessing the current audio data. The first processing thread dynamically filters the incremental text according to a preset filtering strategy to obtain valid text corresponding to multiple audio blocks.
6. The method according to claim 1, 2, or 3, characterized in that, The speech synthesis and playback processing is performed through the second processing thread, including: The second processing thread splits the text corresponding to the audio data preceding the current audio data into multiple sub-texts, where the text is either incremental text or valid text. The second processing thread converts each sub-text into the corresponding audio, generating the corresponding audio segment. The second processing thread performs smooth fusion processing at the splicing position of adjacent audio segments, and adaptively performs dynamic smoothing adjustment or silence interval optimization based on the abrupt changes in acoustic features at the splicing position boundary, and plays the optimized audio segment in real time.
7. The method according to claim 6, characterized in that, The second processing thread will perform smooth fusion processing at the splicing positions of adjacent audio segments, and adaptively execute dynamic smoothing adjustment or silence interval optimization based on the abrupt changes in acoustic features at the boundary of the splicing position, including: The second processing thread determines the cross-gradient interval at the splicing position of adjacent audio segments; The audio signals within the cross-gradient interval are weighted and superimposed through the second processing thread. The audio signals include the tail audio signal of the previous audio segment and the head audio signal of the next audio segment. If the second processing thread determines that there is a sudden change in the energy characteristics or fundamental frequency characteristics of the speech audio segment at the boundary of the splicing position, then the next speech audio segment in the adjacent speech audio segment is dynamically and smoothly adjusted. If the second processing thread determines that there is no abrupt change in the energy characteristics or fundamental frequency characteristics of the speech audio segment at the boundary of the splicing position, then the silence filling duration between the adjacent speech audio segments is adaptively adjusted according to the silence characteristics at the boundary of the adjacent speech audio segments.
8. The method according to claim 1, 2, or 3, characterized in that, The second processing thread is equipped with a playback buffer pool, which caches multiple audio segments generated by the second processing thread. If the generation speed of a voice audio segment is slower than the playback speed, then the last cached voice audio segment in the playback buffer pool is played. If the generation speed of a voice audio segment is faster than the playback speed, the voice audio segment is cached in the playback buffer pool and played according to the preset playback strategy.
9. An electronic device, characterized in that, The electronic device includes: At least one processor; and a memory communicatively connected to the at least one processor; The memory stores a computer program that can be executed by the at least one processor, which is then executed by the at least one processor to enable the at least one processor to perform the speech enhancement method according to any one of claims 1-7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions that cause a processor to execute the speech enhancement method according to any one of claims 1-7.
11. A computer program product, characterized in that, The computer program product includes a computer program that, when executed by a processor, implements the speech enhancement method according to any one of claims 1-7.