An echo cancellation method based on a room impulse response estimation model

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By using a deep learning method based on a room impulse response estimation model, the room impulse response is directly estimated to eliminate echoes, solving the problems of preheating process and environmental sensitivity in existing technologies, and achieving stable echo cancellation effect and improved call quality.

CN117351987BActive Publication Date: 2026-06-23HUNAN UNIV

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: HUNAN UNIV
Filing Date: 2023-11-25
Publication Date: 2026-06-23

Application Information

Patent Timeline

25 Nov 2023

Application

23 Jun 2026

Publication

CN117351987B

IPC: G10L21/0216; G10L19/10; G10L25/57; G10L21/0208

AI Tagging

Application Domain

Speech analysis

Technology Topics

Impulse responseSpeech sound

Technical Efficacy Phrases

consistent performanceMultiple acoustic information

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

The pattern structure of the engineering tire with equal groove width and variable pitch can balance low noise and full wear period saturation
CN121848860BSolve the pain points of poor driving smoothnessstable drainageTyre tread bands/patterns Low noiseGroove width
A demand-oriented dyeing and printing strategy guidance system and method
CN120764958Bconsistent performanceFix performance issuesData processing applications Manufacturing computing systems Process engineering Industrial engineering
Hydrogen-rich water device capable of adjusting hydrogen water mixing
CN224313308UImprove practicality Improve solubility Water/sewage treatment by substance addition Non-contaminated water treatment Electrolysed water Buffer tank
An inlaid metal matrix ceramic membrane and a method of making
CN122273341ATransparent and smoothHigh filtration precisionMetal framework Ceramic membrane

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing echo cancellation methods in remote communication require a preheating process and are sensitive to environmental changes, resulting in unstable performance.

Method used

A room impulse response estimation model is adopted, which directly estimates the room impulse response to eliminate echo by building a deep learning model with encoder and decoder layers. The model includes dataset collection, model training and dual-talk detection, and is applicable to both single-talk and dual-talk scenarios.

Benefits of technology

It effectively eliminates echoes without the need for preheating, has greater interpretability and wide applicability, adapts to different environmental changes, and improves call quality.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN117351987B_ABST

Patent Text Reader

Abstract

The application discloses an echo cancellation method based on a room impulse response estimation model and belongs to the technical field of acoustic signal processing, and comprises the following steps: step S1, a room impulse response estimation model is built; step S2, the room impulse response estimation model is trained, evaluated and verified; step S3, double-speech detection is performed; and step S4, according to the detection result, no speech, single-speech echo speech and double-speech echo speech can be obtained, and echo cancellation is performed on the different paths. The application mainly aims at processing echo problems in the network audio call and video conference environment, can weaken and eliminate the echo problems, improves the voice and video call quality, and is suitable for a wide range of applications.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of acoustic signal processing technology, specifically to an echo cancellation method based on a room impulse response estimation model. Background Technology

[0002] In remote communication environments such as video conferencing and voice calls, the two parties in a conversation can generally be divided into far-end voice and near-end voice. The far-end microphone receives the far-end voice and transmits it to the near-end speaker, where it is dispersed into the space, creating a reverberant echo. This process can be considered as the convolution of the far-end voice with the room's impulse response, forming a reverberant echo (the near-end echo is obtained through room reflections, so reverberant echo and echo are synonymous hereafter). There are three scenarios: neither party speaks, so no voice signal is transmitted in the system and there is no echo; only the far-end speaker speaks, in which case the far-end speaker receives an echo of their own voice; if both parties speak simultaneously, the voice received by the near-end microphone is equivalent to the reverberant echo of the far-end voice superimposed on the near-end speaker's voice, making the near-end speaker's voice mixed with echo, difficult to understand, and affecting call quality. The latter two scenarios require the application of acoustic echo cancellation methods.

[0003] Currently, commonly used methods include traditional signal processing methods and deep learning methods. Signal processing methods include the least mean square algorithm, the normalized least mean square algorithm, and frequency domain adaptive filters; deep learning methods construct networks to predict echo path audio and thus perform echo cancellation.

[0004] The disadvantage of traditional signal processing methods lies in the need for preprocessing an audio segment before echo cancellation, essentially requiring a "warm-up" period to allow the algorithm to converge. Furthermore, changes in the near-field environment can necessitate a second "warm-up," impacting the echo cancellation effectiveness. Therefore, researchers have recently conducted in-depth studies on echo cancellation using deep learning methods. This invention belongs to the category of deep learning-based echo cancellation methods. However, compared to other methods that directly use deep learning models to estimate echo paths, the echo cancellation method proposed in this invention, based on a room impulse response estimation model, does not directly estimate the echo path but instead estimates the room impulse response that helps form the echo. Therefore, it has stronger interpretability and wider adaptability. Summary of the Invention

[0005] The technical problem to be solved by the present invention is to provide a novel echo cancellation method based on a room impulse response estimation model. By directly estimating the room impulse response, the echo path is estimated, thereby achieving echo cancellation. This method has stronger interpretability and a wider range of applications.

[0006] The technical solution adopted by this invention to solve its technical problem is: an echo cancellation method based on a room impulse response estimation model, comprising the following steps:

[0007] Step S1. Build a room impulse response estimation model;

[0008] Step S2. Train, evaluate, and validate the room impulse response estimation model;

[0009] Step S3. Dual-talk detection;

[0010] Step S4. Based on the detection results, we can identify speech with no speech, single-talk echo speech, and double-talk echo speech, and perform echo cancellation according to different paths.

[0011] Furthermore, in step S1, the process of building the room shock response estimation model is as follows:

[0012] Step S1-1. Data collection and synthesis. The required datasets are the clear speech dataset (TIMIT) and the impulse response dataset (AIR dataset, ACE dataset) to synthesize the reverberant speech dataset. The source of the required clear speech dataset and impulse response dataset is open source datasets on the Internet.

[0013] Step S1-2. Build the model, which includes an encoder layer and a decoder layer;

[0014] The encoder layer consists of two residual convolutional layers. Each residual convolutional layer is composed of a 2D convolutional layer, a batch normalization layer, a ReLU activation layer, a 2D convolutional layer, a batch normalization layer, and a ReLU activation layer, followed by an average pooling layer and a fully connected layer. The fully connected layer outputs the features, and the finally extracted features can characterize the basic parameters required for the room impulse response, so as to synthesize the room impulse response.

[0015] The decoder layer consists of a one-dimensional deconvolution layer, a batch normalization layer, a ReLU activation layer, another one-dimensional deconvolution layer, and finally a fully connected layer, which maps the waveform to a complete waveform.

[0016] Furthermore, in step S2, the process of training, evaluating, and validating the room impulse response estimation model is as follows:

[0017] Step S2-1. The loss function used for training is the SDR loss function, which is calculated by subtracting the L2 normalized value of the predicted impulse response from the actual impulse response, and then dividing by the L2 normalized value of the actual impulse response. The unit is dB.

[0018] Step S2-2. The model evaluation method uses root mean square error for evaluation. By calculating the distance between the actual room impulse response and the predicted room impulse response, the accuracy of the prediction can be obtained, thereby judging the effectiveness of the model.

[0019] Step S2-3. During the application process, the only available speech is clear far-end speech and near-end speech with echo, while the room impulse response is unknown. The task is to estimate the room impulse response. Therefore, the application scenario is simulated, and a small amount of real dataset is used, including near-end speech with echo and the corresponding real known room impulse response. The near-end speech with echo is fed into the model to obtain the estimated room impulse response, and it is compared with the corresponding real room impulse response to verify its basic effectiveness. If the waveforms are similar when observed by the naked eye, it can be considered basically effective.

[0020] Step S2-4. After training, evaluation and validation, the model has considerable prediction accuracy, and the allowable root mean square error deviation range needs to be less than 0.01 to ensure the effect of subsequent echo cancellation.

[0021] Furthermore, in step S3, the dual-talk detection process is as follows:

[0022] Energy detection is used to determine whether far-end and near-end speech are present. The average energy of far-end and near-end speech is calculated. If it exceeds a threshold T... e If the signal is positive, it indicates that speech is present at that end. At this point, we can determine whether speech is present, but we cannot determine whether it's a one-way or two-way conversation. This is because in the case of a one-way conversation at the far end, the presence of an echo at the near end can cause the system to mistakenly identify speech as being present at the near end, thus misjudging it as a one-way conversation. In this case, Echo Echo Enhancement (ERLE) is used for judgment. The calculation method is to compare the energy of the far-end signal with the energy of the near-end signal. When the far end is in a one-way conversation, the Echo Echo Enhancement (ERLE) will be larger, while in a two-way conversation, the Echo Echo Enhancement (ERLE) will be smaller. A threshold T is then set. ERLE When the value is greater than the threshold, it is judged as a single lecture; when the value is less than or equal to the threshold, it is judged as a double lecture.

[0023] Furthermore, in step S4, based on the detection results, it can be determined that there is no speech, single-talk echo speech, and double-talk echo speech, and echo cancellation is performed on them according to different paths;

[0024] Step S4-1. If no voice is transmitted in the system, echo cancellation is obviously not required, so return directly;

[0025] Step S4-2. If the current speech is a single-talk echo speech, then the near-end speech only contains the reverberation echo of the far-end speech. This speech can be estimated using the room impulse response estimation model. When the near-end speaker and room layout are fixed, there is only one copy of the room impulse response. To ensure that the system has only one copy of the room impulse response for the current environment, if it is the first time to estimate the room impulse response, the estimated room impulse response is directly stored internally. If the system already has the room impulse response for the current environment, the re-estimated room impulse response is averaged with the previous estimated room impulse response to update the impulse response stored in the system. This is to ensure the smoothness of the change. After obtaining the current room impulse response, the clear far-end speech is convolved with the room impulse response. The resulting speech is the estimated echo path. The echo-containing near-end speech is subtracted from this echo path to obtain the single-talk de-echo audio, thus completing the echo cancellation.

[0026] Step S4-3. If the current speech is a two-way echo speech, the near-end echo speech contains both the echo and the speech of the near-end speaker. Since there are two sound sources, and the proposed model can only be used for single-source room impulse response estimation, in this scenario, impulse response estimation and updating are not performed. Instead, the room impulse response estimated in the single-way speech is used directly for the de-echo operation. The feasibility of this method lies in the fact that the room impulse response should remain unchanged in both single-way and two-way speech without changing the speaker and room layout. In voice and video calls, single-way speech always occupies the majority of the time, usually with one person speaking and one person listening. Therefore, it can be guaranteed that the room impulse response estimated in the single-way speech is always stored. Finally, by convolving the far-end speech with the impulse response, the resulting speech is the estimated echo path. Subtracting this echo path from the near-end speech with echo, the de-echoed audio of the two-way speech is obtained.

[0027] Compared with the prior art, the advantages of the present invention are as follows:

[0028] (1) Compared with traditional echo cancellation, this method does not require a "warm-up" action. When receiving single or double talk, echo cancellation can be performed directly by this method, and it has consistent performance. This method does not depend on the environment and signal. The performance of related echo cancellation depends on the performance of the impulse response estimation model.

[0029] (2) Compared with other methods that use deep learning networks for echo cancellation, the advantages are: First, by estimating the room impulse response, the echo path can be estimated completely, which has better interpretability; Second, by estimating the room impulse response, more acoustic information of the room can be obtained, which helps to determine the scenario in which the algorithm is called. For example, when the room reverberation is small, the echo is very weak, and the algorithm can be not called, saving computing resources. Attached Figure Description

[0030] Figure 1 This is a flowchart of an embodiment of the present invention;

[0031] Figure 2 for Figure 1 The diagram shows the structure of the room impulse response estimation model in the embodiment shown. Detailed Implementation

[0032] The present invention will be further described in detail below with reference to the accompanying drawings and embodiments.

[0033] Reference Figure 1 An echo cancellation method based on a room impulse response estimation model includes the following steps:

[0034] Step S1. Build a room impulse response estimation model;

[0035] Step S1-1. Data collection and synthesis: The required datasets are the clear speech dataset (TIMIT) and the impulse response dataset (AIR dataset, ACE dataset) to synthesize the reverberant speech dataset. The source of the required clear speech dataset and impulse response dataset is the open source dataset on the Internet. The reverberant speech dataset is synthesized in about 56 hours, and 80% of it is selected as the training set.

[0036] Step S1-2. Build the model, such as Figure 2 As shown, the model includes an encoder layer and a decoder layer;

[0037] The encoder layer consists of two residual convolutional layers. Each residual convolutional layer is composed of a 2D convolutional layer, a batch normalization layer, a ReLU activation layer, another 2D convolutional layer, a batch normalization layer, and a ReLU activation layer, followed by an average pooling layer and a fully connected layer. The parameters of the 2D convolutional layers are all set consistently, with a kernel size of 3*3 and a stride of 1. The kernel parameters of the average pooling layer are set to 3*3. The fully connected layer outputs features in a fixed 512-dimensional form. The final extracted features can characterize the basic parameters required for the room impulse response, so as to synthesize the room impulse response.

[0038] The decoder layer consists of a one-dimensional deconvolution layer with a kernel size of 3*3 and a stride of 2, a batch normalization layer, a ReLU activation layer, followed by another one-dimensional deconvolution layer with the same parameter settings as the previous one-dimensional deconvolution layer, and finally a fully connected layer to map it into a complete waveform.

[0039] Step S2. Train, evaluate, and validate the room impulse response estimation model;

[0040] Step S2-1. The loss function used for training is the SDR loss function, which is calculated by subtracting the L2 normalized value of the predicted impulse response from the actual impulse response, and then dividing by the L2 normalized value of the actual impulse response. The unit is dB.

[0041] Step S2-2. The model evaluation method uses root mean square error for evaluation. By calculating the distance between the actual room impulse response and the predicted room impulse response, the accuracy of the prediction can be obtained, thereby judging the effectiveness of the model.

[0042] Step S2-3. During the application process, the only available speech is clear far-end speech and near-end speech with echo, while the room impulse response is unknown. The task is to estimate the room impulse response. Therefore, the application scenario is simulated, and a small amount of real dataset (including near-end speech with echo and the corresponding real known room impulse response) is used. The near-end speech with echo is fed into the model to obtain the estimated room impulse response, and it is compared with the corresponding real room impulse response to verify its basic effectiveness. If the waveforms are similar when observed by the naked eye, it can be considered basically effective.

[0043] Step S2-4. After training, evaluation, and validation, the model has considerable prediction accuracy, and the allowable root mean square error deviation range needs to be less than 0.01 to ensure the effectiveness of subsequent echo cancellation.

[0044] Step S3. Dual-talk detection;

[0045] Energy detection is used to determine whether far-end and near-end speech are present. The average energy of far-end and near-end speech is calculated. If it exceeds a threshold T... e If the signal is positive, it indicates that speech is present at that end. At this point, we can determine whether speech is present, but we cannot determine whether it's a one-way or two-way conversation. This is because in the case of a one-way conversation at the far end, the presence of an echo at the near end can cause the system to mistakenly identify speech as being present at the near end, thus misjudging it as a one-way conversation. In this case, Echo Echo Enhancement (ERLE) is used for judgment. The calculation method is to compare the energy of the far-end signal with the energy of the near-end signal. When the far end is in a one-way conversation, the Echo Echo Enhancement (ERLE) will be larger, while in a two-way conversation, the Echo Echo Enhancement (ERLE) will be smaller. A threshold T is then set. ERLE When the value is greater than the threshold, it is judged as a single lecture; when it is less than or equal to the threshold, it is judged as a double lecture.

[0046] Step S4. Based on the detection results, we can identify speech with no speech, single-talk echo speech, and double-talk echo speech, and perform echo cancellation according to different paths.

[0047] Step S4-1. If no voice is transmitted in the system, echo cancellation is obviously not required, so return directly;

[0048] Step S4-2. If the current speech is a single-talk echo speech, then the near-end speech only contains the reverberation echo of the far-end speech. This speech can be estimated using the room impulse response estimation model. When the near-end speaker and room layout are fixed, there is only one copy of the room impulse response. To ensure that the system has only one copy of the room impulse response for the current environment, if it is the first time to estimate the room impulse response, the estimated room impulse response is directly stored internally. If the system already has the room impulse response for the current environment, the re-estimated room impulse response is averaged with the previous estimated room impulse response to update the impulse response stored in the system. This is to ensure the smoothness of the change. After obtaining the current room impulse response, the clear far-end speech is convolved with the room impulse response. The resulting speech is the estimated echo path. The echo-containing near-end speech is subtracted from this echo path to obtain the single-talk de-echo audio, thus completing the echo cancellation.

[0049] Step S4-3. If the current speech is a two-way echo speech, the near-end echo speech contains both the echo and the near-end speaker's speech. Since there are two sound sources, and the proposed model can only be used for single-source room impulse response estimation, in this scenario, impulse response estimation and updating are not performed. Instead, the room impulse response estimated in the single-way speech is used directly for the de-echo operation. The feasibility of this method lies in the fact that the room impulse response should remain unchanged in both single-way and two-way speech without changing the speaker and room layout. Furthermore, in voice and video calls, single-way speech always occupies the majority of the time, usually with one person speaking and one person listening. Therefore, it can be ensured that the room impulse response estimated in the single-way speech is always stored. Finally, by convolving the far-end speech with the impulse response, the resulting speech is the estimated echo path. Subtracting this echo path from the near-end echo-bearing speech yields the two-way de-echo audio.

[0050] This invention addresses echo cancellation in voice and video environments. It predicts the room impulse response of the voice environment, thereby predicting the echo path and performing echo cancellation to improve voice quality in these environments. Furthermore, this method has a wide range of applications, is not limited by the environment of the voice signal, has strong interpretability, and exhibits excellent echo cancellation performance.

[0051] Those skilled in the art can make various modifications and variations to this invention. If such modifications and variations are within the scope of the claims of this invention and their equivalents, then such modifications and variations are also within the protection scope of this invention.

[0052] The contents not described in detail in the specification are prior art known to those skilled in the art.

Claims

1. An echo cancellation method based on a room impulse response estimation model, characterized in that, Includes the following steps: Step S1. Build a room impulse response estimation model; Step S2. Train, evaluate, and validate the room impulse response estimation model; Step S3. Dual-talk detection; Step S4. Based on the detection results, we can identify speech with no speech, single-talk echo speech, and double-talk echo speech, and perform echo cancellation according to different paths. Step S4-1. If no voice is transmitted in the system, echo cancellation is obviously not required, so return directly; Step S4-2. If the current speech is a single-talk echo speech, then the near-end speech only contains the reverberation echo of the far-end speech. This speech can be estimated using the room impulse response estimation model. When the near-end speaker and room layout are fixed, there is only one copy of the room impulse response. To ensure that the system has only one copy of the room impulse response for the current environment, if it is the first time to estimate the room impulse response, the estimated room impulse response is directly stored internally. If the system already has the room impulse response for the current environment, the re-estimated room impulse response is averaged with the previous estimated room impulse response to update the impulse response stored in the system. This is to ensure the smoothness of the change. After obtaining the current room impulse response, the clear far-end speech is convolved with the room impulse response. The resulting speech is the estimated echo path. The echo-containing near-end speech is subtracted from this echo path to obtain the single-talk de-echo audio, thus completing the echo cancellation. Step S4-3. If the current speech is a two-way echo speech, the near-end echo speech contains both the echo and the speech of the near-end speaker. Since there are two sound sources, and the proposed model can only be used for single-source room impulse response estimation, in this scenario, impulse response estimation and updating are not performed. Instead, the room impulse response estimated in the single-way speech is used directly for the de-echo operation. The feasibility of this method lies in the fact that the room impulse response should remain unchanged in both single-way and two-way speech without changing the speaker and room layout. In voice and video calls, single-way speech always occupies the majority of the time, usually with one person speaking and one person listening. Therefore, it can be guaranteed that the room impulse response estimated in the single-way speech is always stored. Finally, by convolving the far-end speech with the impulse response, the resulting speech is the estimated echo path. Subtracting this echo path from the near-end speech with echo, the de-echoed audio of the two-way speech is obtained.

2. The echo cancellation method based on the room impulse response estimation model according to claim 1, characterized in that, In step S1, the process of building the room impulse response estimation model is as follows: Step S1-1. Data collection and synthesis. The required datasets are clear speech dataset and impulse response dataset, which are used to synthesize reverberant speech dataset. The source of the required clear speech dataset and impulse response dataset is open source datasets on the Internet. Step S1-2. Build the model, which includes an encoder layer and a decoder layer; The encoder layer consists of two residual convolutional layers. Each residual convolutional layer is composed of a two-dimensional convolutional layer, a batch normalization layer, a ReLU activation layer, a two-dimensional convolutional layer, a batch normalization layer, and a ReLU activation layer, followed by an average pooling layer and a fully connected layer. The fully connected layer outputs features, and the extracted features can characterize the basic parameters required for the room impulse response, so as to synthesize the room impulse response; The decoder layer consists of a one-dimensional deconvolution layer, a batch normalization layer, a ReLU activation layer, another one-dimensional deconvolution layer, and finally a fully connected layer, which maps the waveform into a complete waveform.

3. The echo cancellation method based on the room impulse response estimation model according to claim 2, characterized in that, In step S2, the process of training, evaluating, and validating the room impulse response estimation model is as follows: Step S2-1. The loss function used for training is the SDR loss function, which is calculated by subtracting the L2 normalized value of the predicted impulse response from the actual impulse response, and then dividing by the L2 normalized value of the actual impulse response. The unit is dB. Step S2-2. The model evaluation method uses root mean square error for evaluation. By calculating the distance between the actual room impulse response and the predicted room impulse response, the accuracy of the prediction can be obtained, thereby judging the effectiveness of the model. Step S2-3. During the application process, the only available speech is clear far-end speech and near-end speech with echo, while the room impulse response is unknown. The task is to estimate the room impulse response. Therefore, the application scenario is simulated, and a small amount of real dataset is used, including near-end speech with echo and the corresponding real known room impulse response. The near-end speech with echo is fed into the model to obtain the estimated room impulse response, and it is compared with the corresponding real room impulse response to verify its effectiveness. If the waveforms are similar, it can be considered effective. Step S2-4. After training, evaluation and validation, the model has considerable prediction accuracy, and the allowable root mean square error deviation range needs to be less than 0.01 to ensure the effect of subsequent echo cancellation.

4. The echo cancellation method based on the room impulse response estimation model according to claim 3, characterized in that, In step S3, the dual-talk detection process is as follows: Energy detection is used to determine whether far-end and near-end speech are present. The average energy of far-end and near-end speech is calculated. If it exceeds a threshold T... e If the signal is positive, it indicates that speech is present at that end. At this point, we can determine the presence of speech, but not whether it's a one-way or two-way conversation. This is because in a one-way conversation at the far end, the presence of an echo at the near end can cause the system to mistakenly identify speech as being present at the near end, leading to a misjudgment of a one-way conversation. In this case, echo enhancement is used for judgment. The calculation method is to compare the energy of the far-end signal with the energy of the near-end signal. When it's a one-way conversation at the far end, the echo enhancement will be larger, while in a two-way conversation, the echo enhancement will be smaller. A threshold T is then set. ERLE When it is greater than the threshold T ERLE If the judgment is a single lecture, and the value is less than or equal to the threshold T. ERLE It was determined to be a double talk.