Echo cancellation system, method and device based on server-side full reference signal

By combining multi-channel audio processing and silence detection on the server side with a saturation calculation module, efficient and accurate echo cancellation is achieved, solving the problems of high computational complexity and inaccurate cancellation in existing technologies, and improving the user experience of audio conferencing.

CN122245334APending Publication Date: 2026-06-19XIAMEN XINGZONG DIGITAL TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
XIAMEN XINGZONG DIGITAL TECH CO LTD
Filing Date
2026-03-20
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In multi-party audio conferencing systems, existing technologies using AEC (Audio Echo Cancellation) for echo cancellation have high computational complexity and are sensitive to network latency, resulting in inaccurate echo cancellation, affecting speaking fluency and causing auditory fatigue.

Method used

On the server side, a multi-channel audio receiving module, a mixing processing module, a silence detection module, an echo cancellation processing module, and a saturation calculation module are deployed. By receiving and storing the original audio samples from the client as a precise reference signal, a global mixed audio is generated, and echo cancellation processing is performed on the server side to prevent numerical overflow and achieve accurate echo cancellation.

🎯Benefits of technology

It reduces the complexity of echo cancellation, improves the accuracy of echo cancellation, prevents client users from hearing their own voices, reduces CPU computation and memory usage, and improves audio processing speed.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245334A_ABST
    Figure CN122245334A_ABST
Patent Text Reader

Abstract

This application provides an echo cancellation system, method, and device based on a complete reference signal on the server side, applicable to the field of audio processing technology. The system of this application uses the original audio sample from the client, stored on the server side, as a precise reference signal. When returning the output audio signal to the client, the server has already subtracted the client's corresponding reference signal from the globally mixed audio, thereby eliminating the client's voice and preventing the client user from hearing their own voice, thus achieving echo cancellation. Compared to existing technologies where echo cancellation is achieved by the client estimating the echo using AEC technology, the system of this application leverages the server's possession of a complete reference signal to achieve echo cancellation, reducing complexity while ensuring the accuracy of echo cancellation.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of audio processing technology, and in particular to an echo cancellation system, method and device based on a server-side complete reference signal. Background Technology

[0002] In traditional multi-party audio conferencing systems, the server mixes the audio of all participants and sends it to each participant. This causes participants to hear their own voice (echo). Due to network latency, their own voice will return hundreds of milliseconds after they speak, thus interfering with the fluency of their speech. Furthermore, continuously hearing one's own voice during long meetings can lead to auditory fatigue.

[0003] In existing technologies, echo cancellation is typically achieved using AEC (Acoustic Echo Cancellation). Its main principle is as follows: using the far-end sound played from the speaker as a reference signal, an adaptive filter is used to simulate the sound wave propagation path from the speaker to the microphone. Then, based on the reference signal and the estimated propagation path, a predicted echo value is generated. The predicted echo is then subtracted from the signal acquired by the microphone. Finally, the remaining echo is further suppressed to achieve echo cancellation. However, this method has high computational complexity and is sensitive to network latency jitter. Therefore, it is prone to errors during echo cancellation, resulting in low accuracy. Summary of the Invention

[0004] The purpose of this application is to provide an echo cancellation system, method, and device based on a server-side complete reference signal, so as to reduce the complexity of echo cancellation and improve the accuracy of echo cancellation. The specific technical solution is as follows: In a first aspect of this application, an echo cancellation system based on a server-side complete reference signal is provided for use in multi-party audio conferencing. The system is deployed on the server side and includes: The multi-channel audio receiving module is used to simultaneously receive audio streams from multiple clients and allocate an independent reference signal buffer for each client to store the client's original audio samples, thereby forming an accurate reference signal for each participant on the server side. The mixing processing module, connected to the multi-channel audio receiving module, is used to accumulate audio samples from all clients one by one to generate a global mixed audio and store it in the mixing buffer. A silence detection module, connected to the multi-channel audio receiving module, is used to detect the silence status of each client in real time. The echo cancellation processing module is connected to the reference signal buffer, the mixing buffer, and the silence detection module, respectively. For clients detected as non-silent, the module subtracts the client's reference signal from the global mixed audio to obtain the output audio signal; for clients detected as silent, the module directly uses the global mixed audio as the output audio signal. The saturation operation module, connected to the echo cancellation processing module, is used to perform numerical boundary processing on the audio signal to be output, prevent overflow in 16-bit signed integer operations, and generate the output audio signal. The output distribution module, connected to the saturation calculation module, is used to send the output audio signals to the corresponding clients respectively.

[0005] In one possible implementation, the saturation operation module is specifically used for: When the sample value of the audio signal to be output is greater than the upper limit of 32767 for a 16-bit signed integer, the output signal will be clamped to 32767. When the sample value of the audio signal to be output is less than the lower limit of -32768, the output signal will be clamped to -32768. Otherwise, keep the original value unchanged.

[0006] In one possible implementation, the system supports multi-channel audio processing: The multi-channel audio receiving module establishes an independent reference signal buffer for each channel of each client. The mixing module accumulates and generates independent mixed audio for each channel separately; The echo cancellation processing module performs an independent subtraction operation for each channel; The output distribution module merges the output audio signals of each channel and sends them to the corresponding client.

[0007] In one possible implementation, the system further includes a memory pool management module for: Create a shared mix buffer to store the globally mixed audio; For clients marked as mute by the mute detection module, a reference to the shared mixing buffer is provided to achieve zero-copy output; Allocate a separate output buffer for non-mute clients to store the output audio signal processed by the echo cancellation module and the saturation operation module.

[0008] In one possible implementation, the silence detection module is specifically used for: Calculate the audio energy value for each client during the mixing cycle; The energy value is compared with a preset silence threshold; If the energy value is lower than the mute threshold, the client is determined to be in a mute state; wherein, the mute threshold is dynamically adjusted according to the background noise level.

[0009] In one possible implementation, the echo cancellation processing module employs a SIMD instruction set to perform subtraction operations on multiple audio samples in parallel, specifically including: Audio samples in the reference signal buffer and the mixing buffer are stored in a 16-byte aligned manner to meet the memory alignment requirements of SIMD instructions; The _mm256_load_si256 instruction in the AVX2 instruction set is invoked to load 16 16-bit integer samples from an aligned memory address; The _mm256_subs_epi16 instruction is invoked to perform a saturation subtraction operation on the 16 loaded pairs of samples; The _mm256_store_si256 instruction is called to store the calculation result into the aligned output buffer.

[0010] In one possible implementation, the audio stream is a 48kHz sampling rate, 16-bit depth PCM format, with each mixing cycle lasting 20 milliseconds and containing 960 audio samples.

[0011] In a second aspect of this application, an echo cancellation method based on a server-side complete reference signal is provided, applied to a multi-party audio conference, the method comprising: Simultaneously, audio streams from multiple clients are received, and an independent reference signal buffer is allocated to each client to store the client's original audio samples, thereby forming a precise reference signal for each participant on the server side; The audio samples from all clients are accumulated one by one to generate a global mixed audio, which is then stored in the mixing buffer. Real-time monitoring of the mute status of each client; For clients detected as not silent, the reference signal of the client is subtracted from the global mixed audio to obtain the audio signal to be output; for clients detected as silent, the global mixed audio is directly used as the audio signal to be output. Numerical boundary processing is performed on the audio signal to be output to prevent overflow in 16-bit signed integer arithmetic, and an output audio signal is generated. The output audio signals are sent to the corresponding clients respectively.

[0012] In one possible implementation, when the sample value of the audio signal to be output is greater than the upper limit of 32767 for a 16-bit signed integer, the output signal is clamped to 32767. When the sample value of the audio signal to be output is less than the lower limit of -32768, the output signal will be clamped to -32768. Otherwise, keep the original value unchanged.

[0013] In one possible implementation, the audio stream is a multi-channel audio stream, then, The simultaneous reception of audio streams from multiple clients, and the allocation of an independent reference signal buffer for each client, includes: The multi-channel audio receiving module establishes an independent reference signal buffer for each channel of each client. The step of accumulating audio samples from all clients sample by sample to generate a globally mixed audio includes: The audio samples for each channel are accumulated separately to generate independent mixed audio; The step of subtracting the client's reference signal from the globally mixed audio to obtain the output audio signal includes: For each channel, the client's reference signal is subtracted from the global mixed audio to obtain the audio signal to be output; The step of sending the output audio signals to the corresponding clients includes: The output audio signals of each channel are combined and sent to the corresponding client.

[0014] In one possible implementation, the method further includes: Create a shared mix buffer to store the globally mixed audio; For clients marked as mute by the mute detection module, a reference to the shared mixing buffer is provided to achieve zero-copy output; Allocate a separate output buffer for non-mute clients to store the output audio signal processed by the echo cancellation module and the saturation operation module.

[0015] In one possible implementation, the real-time detection of the mute status of each client includes: Calculate the audio energy value for each client during the mixing cycle; The energy value is compared with a preset silence threshold; If the energy value is lower than the mute threshold, the client is determined to be in a mute state; wherein, the mute threshold is dynamically adjusted according to the background noise level.

[0016] In one possible implementation, the audio samples in the reference signal buffer and the mixing buffer are stored in a 16-byte aligned manner to meet the memory alignment requirements of SIMD instructions; the step of subtracting the client's reference signal from the global mixed audio to obtain the output audio signal includes: The _mm256_load_si256 instruction in the AVX2 instruction set is invoked to load 16 16-bit integer samples from an aligned memory address; The _mm256_subs_epi16 instruction is invoked to perform a saturation subtraction operation on the 16 loaded pairs of samples; The _mm256_store_si256 instruction is called to store the calculation result into the aligned output buffer.

[0017] In one possible implementation, the audio stream is a 48kHz sampling rate, 16-bit depth PCM format, with each mixing cycle lasting 20 milliseconds and containing 960 audio samples.

[0018] In a third aspect of this application, an electronic device is provided, comprising: Memory, used to store computer programs; When a processor executes a program stored in memory, it implements the method described in the second aspect of the embodiments of this application.

[0019] In a fourth aspect of the present application, a computer-readable storage medium is provided, wherein a computer program is stored therein, and the computer program, when executed by a processor, implements the method described in the second aspect of the present application.

[0020] Compared with the prior art, the embodiments of this application have at least the following technical effects: This application provides an echo cancellation system, method, and device based on a server-side complete reference signal. The system, deployed on a server side, includes: a multi-channel audio receiving module for simultaneously receiving audio streams from multiple clients and allocating an independent reference signal buffer for each client to store its original audio samples, thereby forming a precise reference signal for each participant on the server side; a mixing processing module connected to the multi-channel audio receiving module for sample-by-sample accumulation of audio samples from all clients to generate a globally mixed audio stream, which is then stored in the mixing buffer; and a silence detection module connected to the multi-channel audio receiving module for real-time detection of the silence status of each client. The echo cancellation processing module, connected to the reference signal buffer, the mixing buffer, and the silence detection module, is used to subtract the reference signal of a client detected as non-silent from the global mixed audio to obtain the audio signal to be output; for a client detected as silent, the global mixed audio is used directly as the audio signal to be output; the saturation operation module, connected to the echo cancellation processing module, is used to perform numerical boundary processing on the audio signal to be output to prevent overflow in 16-bit signed integer operations and generate the output audio signal; the output distribution module, connected to the saturation operation module, is used to send the output audio signal to the corresponding client.

[0021] The system applying this application's embodiments uses the server to store the client's original audio samples as a precise reference signal. When returning the output audio signal to the client, the server has already subtracted the client's corresponding reference signal from the globally mixed audio, thereby eliminating the client's voice and preventing the client user from hearing their own voice, thus achieving echo cancellation. Compared to the prior art where the client estimates and cancels echoes using AEC technology, the system of this application's embodiments leverages the server's advantage of possessing a complete reference signal to achieve echo cancellation, reducing complexity while ensuring the accuracy of echo cancellation.

[0022] Of course, implementing any method of the embodiments of this application does not necessarily require achieving all of the advantages described above at the same time. Attached Figure Description

[0023] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings.

[0024] Figure 1A schematic diagram of an echo cancellation system based on a server-side complete reference signal provided in an embodiment of this application; Figure 2 A schematic diagram illustrating a saturation operation provided in an embodiment of this application; Figure 3 A schematic diagram illustrating echo cancellation in the system provided in this application embodiment; Figure 4 A flowchart illustrating an echo cancellation method based on a server-side complete reference signal provided in an embodiment of this application; Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0025] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art based on this application are within the scope of protection of this application.

[0026] In traditional multi-party audio conferencing systems, the server mixes the audio of all participants and sends it to each participant. This causes participants to hear their own voice (echo), resulting in the following problems: 1. Auditory interference: Participants hearing their own voices can disrupt their speech fluency; 2. Delayed echo: Due to network latency, your voice will return several hundred milliseconds after you speak; 3. Acoustic Echo: If a loudspeaker is used, it will produce an acoustic echo. 4. Auditory fatigue: Hearing your own voice continuously during long meetings can lead to auditory fatigue.

[0027] The inventors discovered that in existing technologies, echo cancellation is usually performed on the client side using AEC technology or Subband Acoustic Echo Cancellation (Subband AEC). However, AEC technology has high computational complexity and is sensitive to network latency jitter. Although Subband AEC is more efficient in frequency domain processing, it still requires client-side implementation.

[0028] To address at least one of the aforementioned problems, in a first aspect, this application provides an echo cancellation system based on a server-side complete reference signal, applied to multi-party audio conferencing. The system is deployed on a server side, which can be a computer, distributed device, mobile phone, or other terminal device. Figure 1As shown, the system includes: The multi-channel audio receiving module 101 is used to simultaneously receive audio streams from multiple clients and allocate an independent reference signal buffer for each client to store the client's original audio samples, thereby forming a precise reference signal for each participant on the server side.

[0029] In this context, "multiple clients" refers to the client devices used by each participant in a multi-party audio conference. These clients can be deployed on electronic devices such as mobile phones, computers, and tablets. A multi-party audio conference is a meeting that allows multiple participants to communicate via audio. Examples include initiating a multi-person voice chat and establishing an online meeting through relevant conferencing software.

[0030] In practical applications, when processing the audio stream received from the client, it can be done by receiving audio streams sent from various clients within a mixing cycle. The audio stream is audio data transmitted as a continuous byte stream. To facilitate audio data processing, upon receiving the audio stream, it can be stored in PCM (Pulse Code Modulation) format with a 48kHz sampling rate and 16-bit depth.

[0031] Setting a 48kHz sampling rate allows for accurate reproduction of frequency components up to 24kHz, fully covering the range of human hearing. A 16-bit depth enables the use of 16 binary numbers to represent volume levels, distinguishing 65,536 different amplitude levels, ensuring the processed audio data meets everyday listening needs. Storage in PCM format ensures a lossless conversion of audio data from analog to digital signals, preserving the original form of the audio stream data.

[0032] A mixing cycle refers to the size and latency of the audio buffer during periodic audio processing. In practical applications, after audio input, N audio samples accumulate in the buffer. The engine then processes these N audio samples all at once before outputting them to the sound card. This process is called a mixing cycle. The mixing cycle can be set according to actual needs, achieving a balance between latency and system stability. In one example, the mixing cycle latency can be 2-5 milliseconds, and the audio buffer contains 128-256 audio samples. In another example, the mixing cycle can be set to 20 milliseconds, containing 960 audio samples.

[0033] After receiving the audio stream from the client, it is stored in a reference buffer as a reference signal for subsequent echo cancellation. In this embodiment, the server can allocate an independent reference signal buffer (our_buf[]) for each client to store the client's original audio samples (i.e., audio streams stored in 48kHz, 16-bit PCM format), making them a precise reference signal for each participant. In one example, for the i-th (i=1,2,...,N, where N represents the number of participants in the conference) client channel, its corresponding reference signal buffer contains a reference signal R_i[n], where n is the sample index (n=0,1,...,S-1), and S is the number of samples in the mixing cycle.

[0034] The mixing processing module 102 is connected to the multi-channel audio receiving module and is used to accumulate the audio samples of all clients sample by sample to generate a global mixed audio and store it in the mixing buffer.

[0035] To enable participants to hear each other's voices, the audio streams received from each client need to be mixed to generate a global mixed audio M[n]. In one example, the global mixed audio can be calculated using the following formula; M[n]=Σ(i=1toN)R_i[n]; Where M[n] represents the global mixed audio; R_i[n] represents the audio sample of the i-th client (i.e. the reference signal of the i-th participant).

[0036] In practical applications, after generating the global mixed audio, the global mixed audio can also be stored in the mixing buffer (mixed_audio[]) so that the global mixed audio can be quickly obtained when performing echo cancellation later.

[0037] The silence detection module 103 is connected to the multi-channel audio receiving module and is used to detect the silence status of each client in real time.

[0038] In one possible implementation, the client's muted state can be detected through the following steps: Step 1: Calculate the audio energy value for each client during the mixing cycle; Step 2: Compare the audio energy value with the preset silence threshold; Step 3: If the audio energy value is lower than the mute threshold, the client is determined to be in a mute state.

[0039] In one example, for a specific client, the audio stream of that client during the mixing cycle is captured, converted to PMC format, and its corresponding RMS (Root Mean Square) is calculated using the following formula: ; Where M represents the number of audio samples in the mixing cycle. This represents the k-th audio sample (audio stream normalized to [...]). 1.0,1.0]).

[0040] After calculating the RMS for this mixing cycle, it is converted into audio energy value (Decibels Full Scale, dBFS) using the following formula: RMS dBFS =20×log 10 (RMS); Among them, RMS dBFS This represents the audio energy value, which is converted to RMS in dBFS (Decibels Full Scale).

[0041] The silence threshold is dynamically adjusted based on the background noise level. In one example, the silence threshold can be calculated using the following formula: T = N0 + Δ; T represents the silence threshold. N 0 represents the estimated noise floor (in dBFS) of the current environment in which the client is located, which is calculated continuously based on the RMS value of the "silent frame"; Δ represents the safety offset, which is used to avoid misinterpreting the user's faint voice (such as breathing sounds, soft voices). In one example, Δ can be set in the range of [+6dB, +12dB].

[0042] The noise floor can be estimated using the following formula. N Update 0: N0 new =α RMS current +(1 α) N0 old ; N0 new This represents the updated N0; N0 old Indicates N0 before the update; RMS current This represents the RMS value of the current audio frame, and α represents the learning rate, which is used to automatically increase when there are sudden noise changes. In one example, α can be set in the range of [0.05, 0.1].

[0043] In another possible implementation, users can mute the client by turning off their microphone or disabling microphone access for the client app during a meeting. In this case, the client can send a mute request to the server. Upon receiving the mute request, the mute detection module marks the client as being in a mute state.

[0044] The echo cancellation processing module 104 is connected to the reference signal buffer, the mixing buffer, and the silence detection module, respectively. For clients detected as non-silent, it subtracts the client's reference signal from the global mixed audio to obtain the audio signal to be output; for clients detected as silent, it directly uses the global mixed audio as the audio signal to be output.

[0045] Since the global audio mix includes reference signals from each client, to prevent client users from receiving their own voices and creating echoes, this embodiment subtracts the client's reference signal from the global audio mix to obtain the output audio signal. In one example, the output audio signal can be calculated using the following formula: O_i[n]=M[n]-R_i[n]=Σ(j≠i)R_j[n]; Where O_i[n] represents the audio signal to be output, j represents the channel corresponding to the i-th participant and other participants excluding the i-th participant, and R_j[n] represents the reference signal corresponding to the j-th participant.

[0046] For silent clients, since the participants themselves are not speaking, the global mixed audio can be directly used as the output audio signal, thereby reducing the computational load on the CPU (Central Processing Unit) and improving processing speed.

[0047] In one possible implementation, the system of this application embodiment may further include a memory pool management module, used for: Create a shared mix buffer to store the globally mixed audio; For clients marked as mute by the mute detection module, a reference to the shared mixing buffer is provided to achieve zero-copy output; Allocate a separate output buffer for non-mute clients to store the output audio signal processed by the echo cancellation module and the saturation operation module.

[0048] The system using the embodiments of this application can avoid copying the global mixed audio to the final output buffer (final_buf) corresponding to each channel for each client by creating a shared mixing buffer to store the global mixed audio, thereby reducing the amount of CPU computation and improving the audio processing speed.

[0049] The saturation operation module 105 is connected to the echo cancellation processing module and is used to perform numerical boundary processing on the to-be-output audio signal to prevent overflow in 16-bit signed integer operations and generate the output audio signal.

[0050] In order to prevent distortion when the audio signal is output, it is necessary to perform numerical boundary processing on the to-be-output audio signal so that it is within the numerical range that the system can process. In a possible implementation, when the sample value of the to-be-output audio signal is greater than the upper limit value 32767 of the 16-bit signed integer, the output signal is clamped to 32767; when the sample value of the to-be-output audio signal is less than the lower limit value -32768, the output signal is clamped to -32768; otherwise, the original value remains unchanged.

[0051] Among them, the sample value represents the specific value of each discrete sampling point in the to-be-output audio signal. In one example, numerical overflow can be prevented by the following method: If M[n] - R_i[n] > S_MAX, then O_i[n] = S_MAX; If M[n] - R_i[n] < S_MIN, then O_i[n] = S_MIN; Otherwise, O_i[n] = M[n] - R_i[n].

[0052] Among them, S_MAX represents the maximum value of the audio sample, that is, 32767; S_MIN represents the minimum value of the audio sample, that is, -32768.

[0053] In another example, as Figure 2 shown, first calculate the to-be-output audio signal, and then judge whether the to-be-output audio signal is within the range of [-32768, 32767]. If it is greater than �2767, it is clamped to 32767; if it is less than -32768, it is clamped to -32768. Otherwise (that is, within the range of [-32768, 32767]), its original value remains unchanged. Finally, the output audio signal is generated and sent to the participant.

[0054] Applying the system of the embodiments of the present application, by clamping the output audio signal within the range of [-32768, 32767], data overflow can be prevented from causing distortion. Compared with the floating-point anti-overflow scheme, the anti-overflow mechanism of the present application can reduce the CPU and memory occupancy; compared with the scheme of ignoring overflow, the distortion degree of the anti-overflow mechanism of the present application is lower.

[0055] The output distribution module 106 is connected to the saturation operation module and is used to send the output audio signal to the corresponding client respectively.

[0056] In one example, Figure 3 This is a schematic diagram of a system for echo cancellation according to an embodiment of this application. First, the clients corresponding to participants A, B, and C each input their own audio to the server's audio receiving module. The server buffers the reference signal in a PCM format with 48kHz and 960 samples. Then, the mixing processing module calculates the globally mixed audio, and the echo cancellation processing module calculates the output signal. A saturation calculation module clamps the output signal to a preset value range to generate the output audio signal. Finally, the audio sending module sends the output audio signal to its respective client. For participant A, the received audio signal includes the voices of participants B and C, but not its own voice.

[0057] The system applying this application's embodiments uses the server to store the client's original audio samples as a precise reference signal. When returning the output audio signal to the client, the server has already subtracted the client's corresponding reference signal from the globally mixed audio, thereby eliminating the client's voice and preventing the client user from hearing their own voice, thus achieving echo cancellation. Compared to the prior art where the client estimates and cancels echoes using AEC technology, the system of this application's embodiments leverages the server's advantage of possessing a complete reference signal to achieve echo cancellation, reducing complexity while ensuring the accuracy of echo cancellation.

[0058] In one possible implementation, the system of this application embodiment supports multi-channel audio processing, then: The multi-channel audio receiving module can establish an independent reference signal buffer for each channel of each client; The mixing module can accumulate and generate independent mixed audio for each channel separately; The echo cancellation processing module can perform independent subtraction operations for each channel; The output distribution module can merge the output audio signals of each channel and send them to the corresponding client.

[0059] In one example, when storing the reference signal for the client, a left channel reference signal buffer can be created for the left channel to store the reference signal R_L_i[n] (the reference signal for the left channel of the i-th client channel); and a right channel reference buffer can be created for the right channel to store the reference signal R_R_i[n] (the reference signal for the right channel of the i-th client channel).

[0060] Similarly, when performing global audio mixing calculations, for the left channel, the audio samples corresponding to all clients in the left channel can be accumulated one by one to generate the global audio mixing corresponding to the left channel; for the right channel, the audio samples corresponding to all clients in the right channel can be accumulated one by one to generate the global audio mixing corresponding to the right channel.

[0061] Right now: M_L[n] = Σ(i=1 to N) R_L_i[n]; M_R[n] = Σ(i=1 to N) R_R_i[n]; Where M_L[n] represents the global mixed audio corresponding to the left channel; M_R[n] represents the global mixed audio corresponding to the right channel.

[0062] Accordingly, during echo cancellation, for the left channel, the client's left channel reference signal can be subtracted from the global mixed audio corresponding to the left channel to obtain the audio signal to be output for the left channel; for the right channel, the client's right channel reference signal can be subtracted from the global mixed audio corresponding to the right channel to obtain the audio signal to be output for the right channel. That is: O_L_i[n]=M_L[n]-R_L_i[n]=Σ(j≠i)R_L_j[n]; O_R_i[n]=M_R[n]-R_R_i[n]=Σ(j≠i)R_R_j[n]; Where O_L_i[n] represents the audio signal to be output corresponding to the left channel; O_R_i[n] represents the audio signal to be output corresponding to the right channel.

[0063] For the i-th client, merge O_L_i[n] and O_R_i[n] and send them to the i-th client.

[0064] Compared to mixing audio signals from each channel, the system using the embodiments of this application reduces distortion of the output audio signal and makes the output audio signal more "fidelity-enhancing" by performing audio processing and echo cancellation on each channel separately.

[0065] In one possible implementation, the echo cancellation processing module in this application system uses the SIMD (Single Instruction Multiple Data) instruction set to process multiple audio samples in parallel for subtraction operations, specifically including: Audio samples in the reference signal buffer and the mixing buffer are stored in a 16-byte aligned manner to meet the memory alignment requirements of SIMD instructions; The instruction _mm256_load_si256 (used to load integer data from memory into AVX2 registers) in the AVX2 (Advanced Vector Extensions 2) instruction set is invoked to load 16 16-bit integer samples from an aligned memory address. The _mm256_subs_epi16 instruction (a parallel vector instruction in the AVX2 instruction set used to perform saturated subtraction on 16-bit signed integers) is invoked to perform saturated subtraction on the 16 loaded pairs of samples. The _mm256_store_si256 instruction (a store instruction in the AVX2 instruction set used to write 256-bit integer data from a register back to 32-byte aligned memory) is invoked to store the result of the operation in the aligned output buffer.

[0066] The system using the embodiments of this application employs the SIMD instruction set for parallel processing, which can improve audio processing speed compared to serial processing of each sample. In one example, for 960 audio samples, serial processing requires 960 loops, while the system of the embodiments of this application only requires 60 loops for parallel processing using the SIMD instruction set, thereby improving the system's operating performance.

[0067] In a second aspect of this application, an echo cancellation method based on a server-side complete reference signal is provided, applied to multi-party audio conferencing, the method comprising: Figure 4 The steps shown are as follows: Step S401: Simultaneously receive audio streams from multiple clients and allocate an independent reference signal buffer for each client to store the client's original audio samples, thereby forming a precise reference signal for each participant on the server side.

[0068] Step S402: Accumulate the audio samples from all clients sample by sample to generate a global mixed audio and store it in the mixing buffer.

[0069] Step S403: Real-time detection of the mute status of each client.

[0070] Step S404: For clients detected as not silent, subtract the client's reference signal from the global mixed audio to obtain the audio signal to be output; for clients detected as silent, directly use the global mixed audio as the audio signal to be output.

[0071] Step S405: Perform numerical boundary processing on the audio signal to be output to prevent overflow in 16-bit signed integer arithmetic and generate the output audio signal.

[0072] Step S406: Send the output audio signals to the corresponding clients respectively.

[0073] In one possible implementation, when the sample value of the audio signal to be output is greater than the upper limit of 32767 for a 16-bit signed integer, the output signal is clamped to 32767. When the sample value of the audio signal to be output is less than the lower limit of -32768, the output signal will be clamped to -32768. Otherwise, keep the original value unchanged.

[0074] In one possible implementation, the audio stream is a multi-channel audio stream, then, The simultaneous reception of audio streams from multiple clients, and the allocation of an independent reference signal buffer for each client, includes: The multi-channel audio receiving module establishes an independent reference signal buffer for each channel of each client. The step of accumulating audio samples from all clients sample by sample to generate a globally mixed audio includes: The audio samples for each channel are accumulated separately to generate independent mixed audio; The step of subtracting the client's reference signal from the globally mixed audio to obtain the output audio signal includes: For each channel, the client's reference signal is subtracted from the global mixed audio to obtain the audio signal to be output; The step of sending the output audio signals to the corresponding clients includes: The output audio signals of each channel are combined and sent to the corresponding client.

[0075] In one possible implementation, the method further includes: Create a shared mix buffer to store the globally mixed audio; For clients marked as mute by the mute detection module, a reference to the shared mixing buffer is provided to achieve zero-copy output; Allocate a separate output buffer for non-mute clients to store the output audio signal processed by the echo cancellation module and the saturation operation module.

[0076] In one possible implementation, the real-time detection of the mute status of each client includes: Calculate the audio energy value for each client during the mixing cycle; The energy value is compared with a preset silence threshold; If the energy value is lower than the mute threshold, the client is determined to be in a mute state; wherein, the mute threshold is dynamically adjusted according to the background noise level.

[0077] In one possible implementation, the audio samples in the reference signal buffer and the mixing buffer are stored in a 16-byte aligned manner to meet the memory alignment requirements of SIMD instructions; the step of subtracting the client's reference signal from the global mixed audio to obtain the output audio signal includes: The _mm256_load_si256 instruction in the AVX2 instruction set is invoked to load 16 16-bit integer samples from an aligned memory address; The _mm256_subs_epi16 instruction is invoked to perform a saturation subtraction operation on the 16 loaded pairs of samples; The _mm256_store_si256 instruction is called to store the calculation result into the aligned output buffer.

[0078] In one possible implementation, the audio stream is a 48kHz sampling rate, 16-bit depth PCM format, with each mixing cycle lasting 20 milliseconds and containing 960 audio samples.

[0079] The method described in this application first stores the client's original audio sample on the server side, using it as a precise reference signal. When returning the output audio signal to the client, the server has already subtracted the client's corresponding reference signal from the globally mixed audio, thereby eliminating the client's voice and preventing the client user from hearing their own voice, thus achieving echo cancellation. Compared to the prior art where the client estimates and cancels echoes using AEC technology, the method in this application leverages the server's advantage of having a complete reference signal to achieve echo cancellation, reducing complexity while ensuring the accuracy of echo cancellation.

[0080] In another aspect of the embodiments of this application, an electronic device is also provided, see [link to relevant documentation]. Figure 5 ,include: Memory 501 is used to store computer programs; Processor 502, when executing a program stored in memory, implements: Simultaneously, audio streams from multiple clients are received, and an independent reference signal buffer is allocated to each client to store the client's original audio samples, thereby forming a precise reference signal for each participant on the server side; The audio samples from all clients are accumulated one by one to generate a global mixed audio, which is then stored in the mixing buffer. Real-time monitoring of the mute status of each client; For clients detected as not silent, the reference signal of the client is subtracted from the global mixed audio to obtain the audio signal to be output; for clients detected as silent, the global mixed audio is directly used as the audio signal to be output. Numerical boundary processing is performed on the audio signal to be output to prevent overflow in 16-bit signed integer arithmetic, and an output audio signal is generated. The output audio signals are sent to the corresponding clients respectively.

[0081] The communication bus mentioned in the above electronic devices can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This communication bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, only one thick line is used to represent it in the diagram, but this does not mean that there is only one bus or one type of bus.

[0082] The communication interface is used for communication between the aforementioned electronic devices and other devices.

[0083] The memory may include random access memory (RAM) or non-volatile memory (NVM), such as at least one disk storage device. Optionally, the memory may also be at least one storage device located remotely from the aforementioned processor.

[0084] The processors mentioned above can be general-purpose processors, including central processing units (CPUs), network processors (NPs), etc.; they can also be digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0085] In another embodiment provided in this application, a computer-readable storage medium is also provided, which stores a computer program that, when executed by a processor, implements any of the above-described echo cancellation methods based on a server-side complete reference signal.

[0086] In another embodiment provided in this application, a computer program product containing instructions is also provided, which, when run on a computer, causes the computer to implement any of the above-described echo cancellation methods based on a server-side complete reference signal.

[0087] In the above embodiments, implementation can be achieved entirely or partially through software, hardware, firmware, or any combination thereof. When implemented using software, it can be implemented entirely or partially in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a solid-state drive (SSD), etc.

[0088] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0089] The various embodiments in this specification are described in a related manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. Related parts can be found in the descriptions of the system embodiments.

[0090] The above description is merely a preferred embodiment of this application and is not intended to limit the scope of protection of this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application are included within the scope of protection of this application.

Claims

1. An echo cancellation system based on a server-side complete reference signal, applied to multi-party audio conferencing, characterized in that, The system is deployed on the server side and includes: The multi-channel audio receiving module is used to simultaneously receive audio streams from multiple clients and allocate an independent reference signal buffer for each client to store the client's original audio samples, thereby forming an accurate reference signal for each participant on the server side. The mixing processing module, connected to the multi-channel audio receiving module, is used to accumulate audio samples from all clients one by one to generate a global mixed audio and store it in the mixing buffer. A silence detection module, connected to the multi-channel audio receiving module, is used to detect the silence status of each client in real time. The echo cancellation processing module is connected to the reference signal buffer, the mixing buffer, and the silence detection module, respectively. For clients detected as non-silent, the module subtracts the client's reference signal from the global mixed audio to obtain the output audio signal; for clients detected as silent, the module directly uses the global mixed audio as the output audio signal. The saturation operation module, connected to the echo cancellation processing module, is used to perform numerical boundary processing on the audio signal to be output, prevent overflow in 16-bit signed integer operations, and generate the output audio signal. The output distribution module, connected to the saturation calculation module, is used to send the output audio signals to the corresponding clients respectively.

2. The system according to claim 1, characterized in that, The saturation operation module is specifically used for: When the sample value of the audio signal to be output is greater than the upper limit of 32767 for a 16-bit signed integer, the output signal will be clamped to 32767. When the sample value of the audio signal to be output is less than the lower limit of -32768, the output signal will be clamped to -32768. Otherwise, keep the original value unchanged.

3. The system according to claim 1, characterized in that, The system supports multi-channel audio processing: The multi-channel audio receiving module establishes an independent reference signal buffer for each channel of each client. The mixing module accumulates the audio samples of each channel to generate independent mixed audio; The echo cancellation processing module performs an independent subtraction operation for each channel; The output distribution module merges the output audio signals of each channel and sends them to the corresponding client.

4. The system according to claim 1, characterized in that, The system also includes a memory pool management module, used for: Create a shared mix buffer to store the globally mixed audio; For clients marked as mute by the mute detection module, a reference to the shared mixing buffer is provided to achieve zero-copy output; Allocate a separate output buffer for non-mute clients to store the output audio signal processed by the echo cancellation module and the saturation operation module.

5. The system according to claim 1, characterized in that, The noise detection module is specifically used for: Calculate the audio energy value for each client during the mixing cycle; The audio energy value is compared with a preset silence threshold; If the audio energy value is lower than the mute threshold, the client is determined to be in a mute state; wherein the mute threshold is dynamically adjusted according to the background noise level.

6. The system according to claim 1, characterized in that, The echo cancellation processing module employs the SIMD instruction set to perform subtraction operations on multiple audio samples in parallel, specifically including: Audio samples in the reference signal buffer and the mixing buffer are stored in a 16-byte aligned manner to meet the memory alignment requirements of SIMD instructions; The _mm256_load_si256 instruction in the AVX2 instruction set is invoked to load 16 16-bit integer samples from an aligned memory address; The _mm256_subs_epi16 instruction is invoked to perform a saturation subtraction operation on the 16 loaded pairs of samples; The _mm256_store_si256 instruction is called to store the calculation result into the aligned output buffer.

7. The system according to claim 1, characterized in that, The audio stream is in PCM format with a 48kHz sampling rate and 16-bit depth. Each mixing cycle lasts 20 milliseconds and contains 960 audio samples.

8. An echo cancellation method based on a server-side complete reference signal, applied to multi-party audio conferencing, characterized in that, The method includes: Simultaneously, audio streams from multiple clients are received, and an independent reference signal buffer is allocated to each client to store the client's original audio samples, thereby forming a precise reference signal for each participant on the server side; The audio samples from all clients are accumulated one by one to generate a global mixed audio, which is then stored in the mixing buffer. Real-time monitoring of the mute status of each client; For clients detected as not silent, the reference signal of the client is subtracted from the global mixed audio to obtain the audio signal to be output; for clients detected as silent, the global mixed audio is directly used as the audio signal to be output. Numerical boundary processing is performed on the audio signal to be output to prevent overflow in 16-bit signed integer arithmetic, and an output audio signal is generated. The output audio signals are sent to the corresponding clients respectively.

9. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor, when executing a program stored in memory, implements the method of claim 8.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium contains a computer program that, when executed by a processor, implements the method of claim 8.