An echo cancellation method based on a room impulse response estimation model
By using a deep learning method based on a room impulse response estimation model, the room impulse response is directly estimated to eliminate echoes, solving the problems of preheating process and environmental sensitivity in existing technologies, and achieving stable echo cancellation effect and improved call quality.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUNAN UNIV
- Filing Date
- 2023-11-25
- Publication Date
- 2026-06-23
AI Technical Summary
Existing echo cancellation methods in remote communication require a preheating process and are sensitive to environmental changes, resulting in unstable performance.
A room impulse response estimation model is adopted, which directly estimates the room impulse response to eliminate echo by building a deep learning model with encoder and decoder layers. The model includes dataset collection, model training and dual-talk detection, and is applicable to both single-talk and dual-talk scenarios.
It effectively eliminates echoes without the need for preheating, has greater interpretability and wide applicability, adapts to different environmental changes, and improves call quality.
Smart Images

Figure CN117351987B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of acoustic signal processing technology, specifically to an echo cancellation method based on a room impulse response estimation model. Background Technology
[0002] In remote communication environments such as video conferencing and voice calls, the two parties in a conversation can generally be divided into far-end voice and near-end voice. The far-end microphone receives the far-end voice and transmits it to the near-end speaker, where it is dispersed into the space, creating a reverberant echo. This process can be considered as the convolution of the far-end voice with the room's impulse response, forming a reverberant echo (the near-end echo is obtained through room reflections, so reverberant echo and echo are synonymous hereafter). There are three scenarios: neither party speaks, so no voice signal is transmitted in the system and there is no echo; only the far-end speaker speaks, in which case the far-end speaker receives an echo of their own voice; if both parties speak simultaneously, the voice received by the near-end microphone is equivalent to the reverberant echo of the far-end voice superimposed on the near-end speaker's voice, making the near-end speaker's voice mixed with echo, difficult to understand, and affecting call quality. The latter two scenarios require the application of acoustic echo cancellation methods.
[0003] Currently, commonly used methods include traditional signal processing methods and deep learning methods. Signal processing methods include the least mean square algorithm, the normalized least mean square algorithm, and frequency domain adaptive filters; deep learning methods construct networks to predict echo path audio and thus perform echo cancellation.
[0004] The disadvantage of traditional signal processing methods lies in the need for preprocessing an audio segment before echo cancellation, essentially requiring a "warm-up" period to allow the algorithm to converge. Furthermore, changes in the near-field environment can necessitate a second "warm-up," impacting the echo cancellation effectiveness. Therefore, researchers have recently conducted in-depth studies on echo cancellation using deep learning methods. This invention belongs to the category of deep learning-based echo cancellation methods. However, compared to other methods that directly use deep learning models to estimate echo paths, the echo cancellation method proposed in this invention, based on a room impulse response estimation model, does not directly estimate the echo path but instead estimates the room impulse response that helps form the echo. Therefore, it has stronger interpretability and wider adaptability. Summary of the Invention
[0005] The technical problem to be solved by the present invention is to provide a novel echo cancellation method based on a room impulse response estimation model. By directly estimating the room impulse response, the echo path is estimated, thereby achieving echo cancellation. This method has stronger interpretability and a wider range of applications.
[0006] The technical solution adopted by this invention to solve its technical problem is: an echo cancellation method based on a room impulse response estimation model, comprising the following steps:
[0007] Step S1. Build a room impulse response estimation model;
[0008] Step S2. Train, evaluate, and validate the room impulse response estimation model;
[0009] Step S3. Dual-talk detection;
[0010] Step S4. Based on the detection results, we can identify speech with no speech, single-talk echo speech, and double-talk echo speech, and perform echo cancellation according to different paths.
[0011] Furthermore, in step S1, the process of building the room shock response estimation model is as follows:
[0012] Step S1-1. Data collection and synthesis. The required datasets are the clear speech dataset (TIMIT) and the impulse response dataset (AIR dataset, ACE dataset) to synthesize the reverberant speech dataset. The source of the required clear speech dataset and impulse response dataset is open source datasets on the Internet.
[0013] Step S1-2. Build the model, which includes an encoder layer and a decoder layer;
[0014] The encoder layer consists of two residual convolutional layers. Each residual convolutional layer is composed of a 2D convolutional layer, a batch normalization layer, a ReLU activation layer, a 2D convolutional layer, a batch normalization layer, and a ReLU activation layer, followed by an average pooling layer and a fully connected layer. The fully connected layer outputs the features, and the finally extracted features can characterize the basic parameters required for the room impulse response, so as to synthesize the room impulse response.
[0015] The decoder layer consists of a one-dimensional deconvolution layer, a batch normalization layer, a ReLU activation layer, another one-dimensional deconvolution layer, and finally a fully connected layer, which maps the waveform to a complete waveform.
[0016] Furthermore, in step S2, the process of training, evaluating, and validating the room impulse response estimation model is as follows:
[0017] Step S2-1. The loss function used for training is the SDR loss function, which is calculated by subtracting the L2 normalized value of the predicted impulse response from the actual impulse response, and then dividing by the L2 normalized value of the actual impulse response. The unit is dB.
[0018] Step S2-2. The model evaluation method uses root mean square error for evaluation. By calculating the distance between the actual room impulse response and the predicted room impulse response, the accuracy of the prediction can be obtained, thereby judging the effectiveness of the model.
[0019] Step S2-3. During the application process, the only available speech is clear far-end speech and near-end speech with echo, while the room impulse response is unknown. The task is to estimate the room impulse response. Therefore, the application scenario is simulated, and a small amount of real dataset is used, including near-end speech with echo and the corresponding real known room impulse response. The near-end speech with echo is fed into the model to obtain the estimated room impulse response, and it is compared with the corresponding real room impulse response to verify its basic effectiveness. If the waveforms are similar when observed by the naked eye, it can be considered basically effective.
[0020] Step S2-4. After training, evaluation and validation, the model has considerable prediction accuracy, and the allowable root mean square error deviation range needs to be less than 0.01 to ensure the effect of subsequent echo cancellation.
[0021] Furthermore, in step S3, the dual-talk detection process is as follows:
[0022] Energy detection is used to determine whether far-end and near-end speech are present. The average energy of far-end and near-end speech is calculated. If it exceeds a threshold T... e If the signal is positive, it indicates that speech is present at that end. At this point, we can determine whether speech is present, but we cannot determine whether it's a one-way or two-way conversation. This is because in the case of a one-way conversation at the far end, the presence of an echo at the near end can cause the system to mistakenly identify speech as being present at the near end, thus misjudging it as a one-way conversation. In this case, Echo Echo Enhancement (ERLE) is used for judgment. The calculation method is to compare the energy of the far-end signal with the energy of the near-end signal. When the far end is in a one-way conversation, the Echo Echo Enhancement (ERLE) will be larger, while in a two-way conversation, the Echo Echo Enhancement (ERLE) will be smaller. A threshold T is then set. ERLE When the value is greater than the threshold, it is judged as a single lecture; when the value is less than or equal to the threshold, it is judged as a double lecture.
[0023] Furthermore, in step S4, based on the detection results, it can be determined that there is no speech, single-talk echo speech, and double-talk echo speech, and echo cancellation is performed on them according to different paths;
[0024] Step S4-1. If no voice is transmitted in the system, echo cancellation is obviously not required, so return directly;
[0025] Step S4-2. If the current speech is a single-talk echo speech, then the near-end speech only contains the reverberation echo of the far-end speech. This speech can be estimated using the room impulse response estimation model. When the near-end speaker and room layout are fixed, there is only one copy of the room impulse response. To ensure that the system has only one copy of the room impulse response for the current environment, if it is the first time to estimate the room impulse response, the estimated room impulse response is directly stored internally. If the system already has the room impulse response for the current environment, the re-estimated room impulse response is averaged with the previous estimated room impulse response to update the impulse response stored in the system. This is to ensure the smoothness of the change. After obtaining the current room impulse response, the clear far-end speech is convolved with the room impulse response. The resulting speech is the estimated echo path. The echo-containing near-end speech is subtracted from this echo path to obtain the single-talk de-echo audio, thus completing the echo cancellation.
[0026] Step S4-3. If the current speech is a two-way echo speech, the near-end echo speech contains both the echo and the speech of the near-end speaker. Since there are two sound sources, and the proposed model can only be used for single-source room impulse response estimation, in this scenario, impulse response estimation and updating are not performed. Instead, the room impulse response estimated in the single-way speech is used directly for the de-echo operation. The feasibility of this method lies in the fact that the room impulse response should remain unchanged in both single-way and two-way speech without changing the speaker and room layout. In voice and video calls, single-way speech always occupies the majority of the time, usually with one person speaking and one person listening. Therefore, it can be guaranteed that the room impulse response estimated in the single-way speech is always stored. Finally, by convolving the far-end speech with the impulse response, the resulting speech is the estimated echo path. Subtracting this echo path from the near-end speech with echo, the de-echoed audio of the two-way speech is obtained.
[0027] Compared with the prior art, the advantages of the present invention are as follows:
[0028] (1) Compared with traditional echo cancellation, this method does not require a "warm-up" action. When receiving single or double talk, echo cancellation can be performed directly by this method, and it has consistent performance. This method does not depend on the environment and signal. The performance of related echo cancellation depends on the performance of the impulse response estimation model.
[0029] (2) Compared with other methods that use deep learning networks for echo cancellation, the advantages are: First, by estimating the room impulse response, the echo path can be estimated completely, which has better interpretability; Second, by estimating the room impulse response, more acoustic information of the room can be obtained, which helps to determine the scenario in which the algorithm is called. For example, when the room reverberation is small, the echo is very weak, and the algorithm can be not called, saving computing resources. Attached Figure Description
[0030] Figure 1 This is a flowchart of an embodiment of the present invention;
[0031] Figure 2 for Figure 1 The diagram shows the structure of the room impulse response estimation model in the embodiment shown. Detailed Implementation
[0032] The present invention will be further described in detail below with reference to the accompanying drawings and embodiments.
[0033] Reference Figure 1 An echo cancellation method based on a room impulse response estimation model includes the following steps:
[0034] Step S1. Build a room impulse response estimation model;
[0035] Step S1-1. Data collection and synthesis: The required datasets are the clear speech dataset (TIMIT) and the impulse response dataset (AIR dataset, ACE dataset) to synthesize the reverberant speech dataset. The source of the required clear speech dataset and impulse response dataset is the open source dataset on the Internet. The reverberant speech dataset is synthesized in about 56 hours, and 80% of it is selected as the training set.
[0036] Step S1-2. Build the model, such as Figure 2 As shown, the model includes an encoder layer and a decoder layer;
[0037] The encoder layer consists of two residual convolutional layers. Each residual convolutional layer is composed of a 2D convolutional layer, a batch normalization layer, a ReLU activation layer, another 2D convolutional layer, a batch normalization layer, and a ReLU activation layer, followed by an average pooling layer and a fully connected layer. The parameters of the 2D convolutional layers are all set consistently, with a kernel size of 3*3 and a stride of 1. The kernel parameters of the average pooling layer are set to 3*3. The fully connected layer outputs features in a fixed 512-dimensional form. The final extracted features can characterize the basic parameters required for the room impulse response, so as to synthesize the room impulse response.
[0038] The decoder layer consists of a one-dimensional deconvolution layer with a kernel size of 3*3 and a stride of 2, a batch normalization layer, a ReLU activation layer, followed by another one-dimensional deconvolution layer with the same parameter settings as the previous one-dimensional deconvolution layer, and finally a fully connected layer to map it into a complete waveform.
[0039] Step S2. Train, evaluate, and validate the room impulse response estimation model;
[0040] Step S2-1. The loss function used for training is the SDR loss function, which is calculated by subtracting the L2 normalized value of the predicted impulse response from the actual impulse response, and then dividing by the L2 normalized value of the actual impulse response. The unit is dB.
[0041] Step S2-2. The model evaluation method uses root mean square error for evaluation. By calculating the distance between the actual room impulse response and the predicted room impulse response, the accuracy of the prediction can be obtained, thereby judging the effectiveness of the model.
[0042] Step S2-3. During the application process, the only available speech is clear far-end speech and near-end speech with echo, while the room impulse response is unknown. The task is to estimate the room impulse response. Therefore, the application scenario is simulated, and a small amount of real dataset (including near-end speech with echo and the corresponding real known room impulse response) is used. The near-end speech with echo is fed into the model to obtain the estimated room impulse response, and it is compared with the corresponding real room impulse response to verify its basic effectiveness. If the waveforms are similar when observed by the naked eye, it can be considered basically effective.
[0043] Step S2-4. After training, evaluation, and validation, the model has considerable prediction accuracy, and the allowable root mean square error deviation range needs to be less than 0.01 to ensure the effectiveness of subsequent echo cancellation.
[0044] Step S3. Dual-talk detection;
[0045] Energy detection is used to determine whether far-end and near-end speech are present. The average energy of far-end and near-end speech is calculated. If it exceeds a threshold T... e If the signal is positive, it indicates that speech is present at that end. At this point, we can determine whether speech is present, but we cannot determine whether it's a one-way or two-way conversation. This is because in the case of a one-way conversation at the far end, the presence of an echo at the near end can cause the system to mistakenly identify speech as being present at the near end, thus misjudging it as a one-way conversation. In this case, Echo Echo Enhancement (ERLE) is used for judgment. The calculation method is to compare the energy of the far-end signal with the energy of the near-end signal. When the far end is in a one-way conversation, the Echo Echo Enhancement (ERLE) will be larger, while in a two-way conversation, the Echo Echo Enhancement (ERLE) will be smaller. A threshold T is then set. ERLE When the value is greater than the threshold, it is judged as a single lecture; when it is less than or equal to the threshold, it is judged as a double lecture.
[0046] Step S4. Based on the detection results, we can identify speech with no speech, single-talk echo speech, and double-talk echo speech, and perform echo cancellation according to different paths.
[0047] Step S4-1. If no voice is transmitted in the system, echo cancellation is obviously not required, so return directly;
[0048] Step S4-2. If the current speech is a single-talk echo speech, then the near-end speech only contains the reverberation echo of the far-end speech. This speech can be estimated using the room impulse response estimation model. When the near-end speaker and room layout are fixed, there is only one copy of the room impulse response. To ensure that the system has only one copy of the room impulse response for the current environment, if it is the first time to estimate the room impulse response, the estimated room impulse response is directly stored internally. If the system already has the room impulse response for the current environment, the re-estimated room impulse response is averaged with the previous estimated room impulse response to update the impulse response stored in the system. This is to ensure the smoothness of the change. After obtaining the current room impulse response, the clear far-end speech is convolved with the room impulse response. The resulting speech is the estimated echo path. The echo-containing near-end speech is subtracted from this echo path to obtain the single-talk de-echo audio, thus completing the echo cancellation.
[0049] Step S4-3. If the current speech is a two-way echo speech, the near-end echo speech contains both the echo and the near-end speaker's speech. Since there are two sound sources, and the proposed model can only be used for single-source room impulse response estimation, in this scenario, impulse response estimation and updating are not performed. Instead, the room impulse response estimated in the single-way speech is used directly for the de-echo operation. The feasibility of this method lies in the fact that the room impulse response should remain unchanged in both single-way and two-way speech without changing the speaker and room layout. Furthermore, in voice and video calls, single-way speech always occupies the majority of the time, usually with one person speaking and one person listening. Therefore, it can be ensured that the room impulse response estimated in the single-way speech is always stored. Finally, by convolving the far-end speech with the impulse response, the resulting speech is the estimated echo path. Subtracting this echo path from the near-end echo-bearing speech yields the two-way de-echo audio.
[0050] This invention addresses echo cancellation in voice and video environments. It predicts the room impulse response of the voice environment, thereby predicting the echo path and performing echo cancellation to improve voice quality in these environments. Furthermore, this method has a wide range of applications, is not limited by the environment of the voice signal, has strong interpretability, and exhibits excellent echo cancellation performance.
[0051] Those skilled in the art can make various modifications and variations to this invention. If such modifications and variations are within the scope of the claims of this invention and their equivalents, then such modifications and variations are also within the protection scope of this invention.
[0052] The contents not described in detail in the specification are prior art known to those skilled in the art.
Claims
1. An echo cancellation method based on a room impulse response estimation model, characterized in that, Includes the following steps: Step S1. Build a room impulse response estimation model; Step S2. Train, evaluate, and validate the room impulse response estimation model; Step S3. Dual-talk detection; Step S4. Based on the detection results, we can identify speech with no speech, single-talk echo speech, and double-talk echo speech, and perform echo cancellation according to different paths. Step S4-1. If no voice is transmitted in the system, echo cancellation is obviously not required, so return directly; Step S4-2. If the current speech is a single-talk echo speech, then the near-end speech only contains the reverberation echo of the far-end speech. This speech can be estimated using the room impulse response estimation model. When the near-end speaker and room layout are fixed, there is only one copy of the room impulse response. To ensure that the system has only one copy of the room impulse response for the current environment, if it is the first time to estimate the room impulse response, the estimated room impulse response is directly stored internally. If the system already has the room impulse response for the current environment, the re-estimated room impulse response is averaged with the previous estimated room impulse response to update the impulse response stored in the system. This is to ensure the smoothness of the change. After obtaining the current room impulse response, the clear far-end speech is convolved with the room impulse response. The resulting speech is the estimated echo path. The echo-containing near-end speech is subtracted from this echo path to obtain the single-talk de-echo audio, thus completing the echo cancellation. Step S4-3. If the current speech is a two-way echo speech, the near-end echo speech contains both the echo and the speech of the near-end speaker. Since there are two sound sources, and the proposed model can only be used for single-source room impulse response estimation, in this scenario, impulse response estimation and updating are not performed. Instead, the room impulse response estimated in the single-way speech is used directly for the de-echo operation. The feasibility of this method lies in the fact that the room impulse response should remain unchanged in both single-way and two-way speech without changing the speaker and room layout. In voice and video calls, single-way speech always occupies the majority of the time, usually with one person speaking and one person listening. Therefore, it can be guaranteed that the room impulse response estimated in the single-way speech is always stored. Finally, by convolving the far-end speech with the impulse response, the resulting speech is the estimated echo path. Subtracting this echo path from the near-end speech with echo, the de-echoed audio of the two-way speech is obtained.
2. The echo cancellation method based on the room impulse response estimation model according to claim 1, characterized in that, In step S1, the process of building the room impulse response estimation model is as follows: Step S1-1. Data collection and synthesis. The required datasets are clear speech dataset and impulse response dataset, which are used to synthesize reverberant speech dataset. The source of the required clear speech dataset and impulse response dataset is open source datasets on the Internet. Step S1-2. Build the model, which includes an encoder layer and a decoder layer; The encoder layer consists of two residual convolutional layers. Each residual convolutional layer is composed of a two-dimensional convolutional layer, a batch normalization layer, a ReLU activation layer, a two-dimensional convolutional layer, a batch normalization layer, and a ReLU activation layer, followed by an average pooling layer and a fully connected layer. The fully connected layer outputs features, and the extracted features can characterize the basic parameters required for the room impulse response, so as to synthesize the room impulse response; The decoder layer consists of a one-dimensional deconvolution layer, a batch normalization layer, a ReLU activation layer, another one-dimensional deconvolution layer, and finally a fully connected layer, which maps the waveform into a complete waveform.
3. The echo cancellation method based on the room impulse response estimation model according to claim 2, characterized in that, In step S2, the process of training, evaluating, and validating the room impulse response estimation model is as follows: Step S2-1. The loss function used for training is the SDR loss function, which is calculated by subtracting the L2 normalized value of the predicted impulse response from the actual impulse response, and then dividing by the L2 normalized value of the actual impulse response. The unit is dB. Step S2-2. The model evaluation method uses root mean square error for evaluation. By calculating the distance between the actual room impulse response and the predicted room impulse response, the accuracy of the prediction can be obtained, thereby judging the effectiveness of the model. Step S2-3. During the application process, the only available speech is clear far-end speech and near-end speech with echo, while the room impulse response is unknown. The task is to estimate the room impulse response. Therefore, the application scenario is simulated, and a small amount of real dataset is used, including near-end speech with echo and the corresponding real known room impulse response. The near-end speech with echo is fed into the model to obtain the estimated room impulse response, and it is compared with the corresponding real room impulse response to verify its effectiveness. If the waveforms are similar, it can be considered effective. Step S2-4. After training, evaluation and validation, the model has considerable prediction accuracy, and the allowable root mean square error deviation range needs to be less than 0.01 to ensure the effect of subsequent echo cancellation.
4. The echo cancellation method based on the room impulse response estimation model according to claim 3, characterized in that, In step S3, the dual-talk detection process is as follows: Energy detection is used to determine whether far-end and near-end speech are present. The average energy of far-end and near-end speech is calculated. If it exceeds a threshold T... e If the signal is positive, it indicates that speech is present at that end. At this point, we can determine the presence of speech, but not whether it's a one-way or two-way conversation. This is because in a one-way conversation at the far end, the presence of an echo at the near end can cause the system to mistakenly identify speech as being present at the near end, leading to a misjudgment of a one-way conversation. In this case, echo enhancement is used for judgment. The calculation method is to compare the energy of the far-end signal with the energy of the near-end signal. When it's a one-way conversation at the far end, the echo enhancement will be larger, while in a two-way conversation, the echo enhancement will be smaller. A threshold T is then set. ERLE When it is greater than the threshold T ERLE If the judgment is a single lecture, and the value is less than or equal to the threshold T. ERLE It was determined to be a double talk.