Audio data processing method, system, device and medium
By acquiring spatial impulse response signals to calculate reverberation time, reflection energy ratio, and background noise density sequence, and dynamically adjusting audio processing parameters, the problem that existing audio processing solutions cannot adapt to changes in the acoustic environment is solved, achieving clear and balanced audio signal output and enhancing immersion.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHENGDU XIAOCHANG TECH CO LTD
- Filing Date
- 2026-02-26
- Publication Date
- 2026-06-12
AI Technical Summary
Existing audio processing solutions cannot adapt to dynamic changes in the acoustic environment in fixed or semi-fixed performance spaces such as KTVs, home theaters, and live streaming rooms, leading to problems such as reverberation, equalization, and noise interference, which affect the timbre and immersion.
By acquiring the spatial impulse response signal, calculating the reverberation time, reflection energy ratio, and background noise density sequence, the parameters of the digital reverberator and multi-band equalizer are dynamically adjusted. Combined with the singing mode, the blending weights of vocals and accompaniment are adjusted to achieve adaptive audio processing.
It achieves clear, balanced, and spatially coordinated audio signal output in dynamic acoustic environments, enhancing the immersive experience and auditory adaptability of singing.
Smart Images

Figure CN122201236A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing technology, and specifically to an audio data processing method, system, device, and medium. Background Technology
[0002] In fixed or semi-fixed performance spaces such as KTVs, home theaters, and live streaming rooms, existing audio processing solutions apply audio effects (such as reverberation and equalization) to optimize the listening experience, which are usually based on static or preset assessments of the acoustic environment. However, the actual acoustic environment is determined by multiple time variables such as room physical characteristics, background noise, and human activities. This leads to a mismatch between preset processing parameters and the real-time state of the environment, which in turn triggers a series of technical defects.
[0003] Specifically, firstly, fixed reverberation parameters cannot adapt to dynamic changes in environmental reverberation time (such as changes in sound absorption due to the increase or decrease of people). This results in either excessive reverberation leading to muddy voices and unclear pronunciation, or insufficient reverberation causing a flat sound and a lack of spatial awareness. Secondly, equalization compensation, based on an ideal reflection model, cannot respond to real-time changes in the early / late reflection energy ratio caused by object movement or the opening and closing of doors and windows. This leads to ineffective or excessive compensation for timbre (especially mid-to-high frequency clarity and presence), introducing harshness or masking effects. Simultaneously, fluctuations in background noise levels (such as air conditioning start / stop and outdoor noise intrusion) alter the perceived signal-to-noise ratio and frequency response of the main sound signal. Existing mechanisms lack the ability to quantify noise interference as a processing parameter correction factor, making voices easily masked when noise levels rise and effects appear abrupt when noise levels fall. Finally, in the real-time mixing of vocals and accompaniment, the existing balancing method based on a fixed gain ratio cannot cope with the natural changes in the singer's volume as the song's volume fluctuates dynamically, causing the vocals to sometimes be drowned out by the accompaniment and sometimes stand out abruptly above the accompaniment, thus disrupting the overall harmony and immersiveness of the music. Summary of the Invention
[0004] In response to the technical problems mentioned in the background art, the present invention provides an audio data processing method, system, device and medium, which constructs a closed-loop control based on real-time, multi-dimensional acoustic environment perception to collaboratively and adaptively adjust multiple effect parameters and fusion strategies, thereby continuously outputting clear, balanced and spatially coordinated audio signals in a dynamically changing physical environment.
[0005] An audio data processing method includes: acquiring a spatial impulse response signal within a processing period prior to a target time; acquiring a reverberation time sequence, a reflection energy ratio sequence, and a background noise density sequence based on the spatial impulse response signal; acquiring operating parameters of a digital reverberator and a multi-band equalizer based on the reverberation time sequence, the reflection energy ratio sequence, and the background noise density sequence; processing a human voice audio signal at the target time based on the operating parameters and outputting a first energy value; processing an accompaniment audio signal at the target time based on the operating parameters and outputting a second energy value; calculating the short-time energy of the first energy signal and the second energy signal; calculating an accompaniment blending weight and a human voice blending weight based on the short-time energy and the target human voice proportion under a preset singing mode; and generating a target audio signal based on the accompaniment blending weight and the human voice blending weight.
[0006] Optionally, obtaining the operating parameters of the digital reverberator and the multi-band equalizer based on the reverberation time series, the reflection energy ratio series, and the background noise density series includes: obtaining the excess number of background noise density data exceeding the preset noise density data in the background noise density series, dividing the excess number by the number of background noise density data in the background noise density series to obtain a correction index; obtaining a standard reverberation time range and obtaining a reverberation time dynamic range based on the correction index and the standard reverberation time range; obtaining a standard reflection energy ratio range and obtaining a reflection energy ratio dynamic range based on the correction index and the standard reflection energy ratio range; obtaining the proportion of all reverberation time data in the reverberation time series that are within the reverberation time dynamic range as a first effective proportion, and obtaining the proportion of all reflection energy ratio data in the reflection energy ratio series that are within the reflection energy ratio dynamic range as a second effective proportion; mapping the operating parameters of the digital reverberator based on the first effective proportion, and mapping the operating parameters of the multi-band equalizer based on the second effective proportion.
[0007] Optionally, processing the human voice audio signal at the target time based on the operating parameters and outputting a first energy value includes: inputting the human voice audio signal into a digital reverberator, controlling the reverberation tail decay rate using the operating parameters of the digital reverberator, and obtaining a reverberated human voice audio signal; inputting the reverberated human voice audio signal into a multi-band equalizer, adjusting the gain of each frequency band using the operating parameters of the multi-band equalizer, and obtaining a reverberated human voice audio signal; squaring the reverberated human voice audio signal and integrating it within a 50-millisecond sliding window to obtain the first energy value.
[0008] Optionally, processing the accompaniment audio signal at the target time based on the operating parameters and outputting a second energy value includes: inputting the accompaniment audio signal into a digital reverb unit, controlling the reverb tail decay rate using the operating parameters of the digital reverb unit, and obtaining a reverb-processed accompaniment audio signal; inputting the reverb-processed accompaniment audio signal into a multi-band equalizer, adjusting the gain of each frequency band using the operating parameters of the multi-band equalizer, and obtaining a reverb-processed accompaniment audio signal; squaring the reverb-processed accompaniment audio signal and integrating it within a 50-millisecond sliding window to obtain the second energy value.
[0009] Optionally, calculating the short-time energy of the first energy signal and the second energy signal includes: applying Hanning window weighting to the first energy value and the second energy value respectively, dividing the frame under the conditions of a 200-millisecond frame length and 50% overlap, and calculating the arithmetic mean of the energy values in each frame to obtain the short-time energy sequence of the human voice and the short-time energy sequence of the accompaniment.
[0010] Optionally, calculating the accompaniment blending weight and vocal blending weight based on short-time energy and the target vocal proportion in a preset singing mode includes: obtaining the target vocal proportion corresponding to the current singing mode, and calculating the sum of the short-time energy of the vocal and the short-time energy of the accompaniment in each frame; if the sum of a certain frame is greater than zero, then the target vocal proportion multiplied by the ratio of the short-time energy of the vocal in that frame to the sum is used as the vocal blending weight of that frame, and the remaining part is used as the accompaniment blending weight of that frame; if the sum of a certain frame is equal to zero, then both the vocal blending weight and the accompaniment blending weight of that frame are set to 0.5.
[0011] Optionally, generating the target audio signal based on the accompaniment blending weight and the vocal blending weight includes: multiplying the reverberated and equalized accompaniment audio signal and the vocal audio signal frame by the corresponding accompaniment blending weight and vocal blending weight respectively, and adding the two weighted signals sample by sample to generate the target audio signal.
[0012] An audio data processing system is also provided, which implements the above-mentioned audio data processing method. The system includes: an acoustic perception module, used to acquire the spatial impulse response signal of the time period to be processed before the target time through a microphone array deployed in the singing space, and to acquire the reverberation time sequence, reflection energy ratio sequence and background noise density sequence based on the spatial impulse response signal; a parameter configuration module, used to acquire the decay time coefficient of the digital reverberator and the gain offset of the multi-band equalizer at each center frequency point based on the reverberation time sequence, reflection energy ratio sequence and background noise density sequence; a signal processing module, used to process the human voice audio signal and the accompaniment audio signal based on the decay time coefficient and the gain offset respectively; and a fusion control module, used to calculate the short-time energy of the processed human voice audio signal and the accompaniment audio signal, generate frame-by-frame fusion weights based on the target human voice proportion in the preset singing mode, and generate the target audio signal based on the frame-by-frame fusion weights.
[0013] An electronic device is also provided, comprising: a memory having a computer program stored thereon; and a processor for executing the computer program in the memory to implement an audio data processing method.
[0014] A non-transitory computer-readable storage medium is also provided, on which a computer program is stored, which, when executed by a processor, implements an audio data processing method.
[0015] The beneficial effects of this invention are reflected in: In the entire audio data processing method, firstly, by periodically extracting the spatial impulse response and calculating the reverberation time series, the decay time coefficient of the digital reverberator is mapped in real time using the dynamic range and effective ratio analysis of background noise correction. This allows for automatic adjustment of the artificial reverberation decay rate based on changes in sound absorption caused by changes in the number of people, avoiding the muddiness caused by excessive reverberation or the dryness caused by insufficient reverberation, thus ensuring harmonious integration of artificial reverberation with the room's acoustic characteristics. Furthermore, by analyzing the early and late reflection energy ratio sequences of the impulse response and evaluating its stability in conjunction with noise correction indicators, the gain offset of a multi-band equalizer at each key frequency point is dynamically generated. This allows for response to reflection changes caused by the opening and closing of doors and windows or the movement of objects, compensating for easily affected frequency bands such as mid-high frequencies, mitigating the dull timbre caused by compensation failure or the harshness caused by overcompensation. Furthermore, the proportion of background noise density sequence exceeding the standard is quantified as a correction index. This index not only expands the reasonable judgment range of reverberation and reflection ratio to enhance robustness under noise, but its implicit noise intensity information also indirectly affects the final reverberation attenuation coefficient and equalization gain through an effective proportion. This makes it tend to shorten reverberation and potentially increase the gain of key frequency bands to counteract masking when noise increases, and restore a more natural processing amplitude when noise decreases, thus improving the signal-to-noise ratio and listening adaptability under different noise environments. Furthermore, by calculating the short-time energy of the vocal and accompaniment signals after applying the same environmental adaptive processing (same reverberation and equalization), and combining it with the target vocal proportion preset in the singing mode, the fusion weight is dynamically calculated frame by frame. This strategy can follow the natural fluctuations of the singer's volume, relatively increasing its gain weight when the vocal is weak to prevent it from being drowned out by the accompaniment, and moderately suppressing it when the vocal is too strong to maintain overall balance, thereby achieving adaptive adjustment between vocal prominence and musical harmony, and improving the immersive experience of the performance. Attached Figure Description
[0016] To more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the accompanying drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. In all the drawings, similar elements or parts are generally identified by similar reference numerals. In the drawings, the elements or parts are not necessarily drawn to scale.
[0017] Figure 1 This is a schematic diagram illustrating the steps of the audio data processing method of the present invention; Figure 2 This is a schematic diagram of a portion of steps S3 in the audio data processing method of the present invention; Figure 3 This is a schematic diagram of another part of step S3 in the audio data processing method of the present invention; Figure 4 This is a schematic diagram of a portion of step S4 in the audio data processing method of the present invention; Figure 5 This is a block diagram illustrating an electronic device according to an embodiment of the present invention.
[0018] Figure label: 700 - Electronic device; 701 - Processor; 702 - Memory; 703 - Multimedia component; 704 - I / O interface; 705 - Communication component. Detailed Implementation
[0019] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations.
[0020] Therefore, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the invention without inventive effort are within the scope of protection of the invention.
[0021] It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. Furthermore, the terms "first," "second," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0022] like Figure 1 As shown, an audio data processing method is provided. In one embodiment, the method includes: S1. Obtain the spatial impulse response signal during the processing period before the target time, and obtain the reverberation time sequence, reflection energy ratio sequence and background noise density sequence based on the spatial impulse response signal; S2. Obtain the operating parameters of the digital reverberator and multi-band equalizer based on the reverberation time sequence, reflection energy ratio sequence, and background noise density sequence; S3. Process the human voice audio signal at the target time based on the operating parameters and output a first energy value; process the accompaniment audio signal at the target time based on the operating parameters and output a second energy value. S4. Calculate the short-time energy of the first energy signal and the second energy signal, and calculate the accompaniment fusion weight and the vocal fusion weight based on the short-time energy and the target vocal proportion in the preset singing mode, and generate the target audio signal based on the accompaniment fusion weight and the vocal fusion weight.
[0023] In this embodiment, it should be noted that in S1, the input is the original sound wave of the physical space, and the output is three quantifiable time series, which respectively characterize the reverberation characteristics, reflection structure and noise background of the environment.
[0024] In practice, 30 seconds before the user sings (e.g., before pressing the "start" button), a 0.6-meter diameter circular array of eight omnidirectional microphones periodically collects the room's response to a known test signal. This test signal is a 1-second composite signal with a 48kHz sampling rate, consisting of a logarithmically swept sine wave from 20Hz to 20kHz superimposed with uniformly distributed white noise. Its wide bandwidth and good autocorrelation facilitate the separation of the pure spatial impulse response from the recording. For example, the test signal is played every 5 seconds through the room's main speakers, and the microphone array records synchronously. Subsequently, each recording... The sound is denoised and temporally aligned, and a maximum length sequence deconvolution algorithm is used to extract the spatial impulse response signal from the sound source to each microphone position. This signal is essentially a temporal representation of the acoustic transmission path, and the slope of its energy decay curve reflects the reverberation time. The ratio of early (0-50ms) to late (50-300ms) reflection energy reflects the room's reflectivity.
[0025] Subsequently, the 30-second impulse response data stream was segmented into 500-millisecond segments for non-overlapping analysis. For each segment, the Schroeder integral method was used to calculate its energy decay curve, and the time required for a 60dB decay was linearly fitted to obtain a T60 reverberation time value, thus forming a reverberation time sequence containing approximately 60 data points. Simultaneously, within each impulse response segment, the moment of reaching the peak sound was precisely identified, and the root mean square (RMS) values of the signal within the subsequent 0-50ms and 50-300ms windows were calculated. The latter was divided by the former to obtain the reflected energy ratio, forming a synchronized reflected energy ratio sequence. Furthermore, during the intervals between test signal playback, ambient background noise was collected using the same microphone array, and its average power spectral density in the human ear-sensitive frequency band from 1kHz to 4kHz was calculated using short-time Fourier transform, forming a background noise density sequence. These three sequences together constitute a dynamic acoustic fingerprint: the reverberation time sequence reveals changes in the overall sound absorption of the room (such as changes in the number of people), the reflection energy ratio sequence reflects changes in local reflective surfaces (such as furniture movement), and the background noise density sequence quantifies the intensity fluctuations of steady-state interference sources such as air conditioning and ventilation.
[0026] In S2, the perceptual data obtained in S1 is transformed into specific control parameters that can drive the audio processing unit. This stage first processes the background noise density sequence and generates a global correction index η through a threshold comparison mechanism.
[0027] For example, if the preset noise density threshold is -40 dBFS, and the number of points exceeding this threshold in the statistical sequence is counted, assuming 12 out of 60 data points exceed the threshold, then the excess number is 12, and the correction index η = 12 / 60 = 0.2. This η value (between 0 and 1) quantifies the degree of deviation of environmental noise from ideal listening conditions.
[0028] Next, η is used to dynamically relax the criteria for "normal" acoustic characteristics. A standard reverberation time range [T60_min, T60_max] = [0.8s, 1.5s] and a standard reflection energy ratio range [RER_min, RER_max] = [0.3, 0.7] are defined. A preset scaling factor α = β = 0.5 is used. When η = 0.2, the lower limit of the reverberation time dynamic range is calculated as T60_low = 0.8 - 0.5 * 0.8 * 0.2 = 0.72s, and the upper limit is T60_high = 1.5 + 0.5 * 1.5 * 0.2 = 1.65s. Similarly, the lower limit of the reflection energy ratio dynamic range is RER_low = 0.3 - 0.5 * 0.3 * 0.2 = 0.27, and the upper limit is RER_high = 0.7 + 0.5 * 0.7 * 0.2 = 0.77. The greater the noise, the wider the tolerance range for "normal" reverberation time and reflected energy ratio, because noise can mask acoustic details, and overly strict judgments may lead to frequent parameter jumps.
[0029] Then, the proportion of data points in the reverberation time series falling within the interval [0.72s, 1.65s] is calculated as the first effective proportion P1; the proportion of data points in the reflection energy ratio series falling within the interval [0.27, 0.77] is calculated as the second effective proportion P2. Assume P1 = 0.9 and P2 = 0.7.
[0030] Finally, P1 and P2 are converted into processing parameters using a preset mapping function. For a digital reverb, the key parameter is the decay time coefficient τ (1.0 represents the standard decay rate). An example mapping is τ = 0.5 + 0.5 * P1. When P1 = 0.9, τ = 0.95, meaning the reverb will decay at approximately 95% of the standard rate.
[0031] For a multi-band equalizer, its parameter is the gain offset ΔG at each center frequency (e.g., 1kHz, 2kHz, 4kHz). An example mapping is: when P2 is high, ΔG is close to 0; when P2 is low, positive compensation is applied at the mid-to-high frequencies, for example, ΔG_2kHz = (1-P2)*4dB. When P2 = 0.7, ΔG_2kHz = (1-0.7)*4 = 1.2dB. Thus, the output of S2 is τ and a set of ΔG values, precisely controlled for subsequent audio processing, adjusted according to the stability of the real-time acoustic environment and noise level.
[0032] In S3, the control parameters generated in S2 are applied to perform consistent spatialization and spectral shaping on the two original signals of vocals and accompaniment, and the instantaneous energy of the processed signals is calculated.
[0033] For the human voice signal, the time-domain signal x_vocal(t) captured by the main microphone at a sampling rate of 48kHz is first fed into a digital reverb unit. This reverb unit employs an algorithmic structure based on a feedback delay network, containing multiple delay lines. The gain of the feedback loops collectively determines the decay rate at the reverberation tail. The decay time coefficient τ directly controls these feedback gains. When τ = 0.95 (from the S2 example), the feedback gain is set to cause the reverberation tail energy to decay according to e^(-13.8*t / T60), where T60 is the result of τ multiplied by a reference reverberation time (e.g., 1.2s), meaning the actual decay time is approximately 1.14s.
[0034] The processed signal x_vocal_rev(t) is then fed into an 8-band parametric equalizer. Each band has a fixed center frequency and Q value (e.g., Q=1.414 for the 2kHz band), and its gain is set to the factory preset value (possibly 0dB) plus the offset ΔG calculated by S2. Following the example above, the gain is increased by 1.2dB at 2kHz. This is intended to compensate for mid-to-high frequency losses that may be caused by an inadequate room reflection structure (reflected by a low P2 value). The equalized signal x_vocal_proc(t) then enters the energy calculation unit. This unit squares each sample value of the signal (x^2) and then accumulates (integrates) it within a sliding window of 50 milliseconds (corresponding to 2400 samples) to obtain the first energy value E1(t). This is a continuous instantaneous power approximation. For example, at a certain time t, if the sum of squares of the samples of x_vocal_proc(t) within the window is 120, then E1(t) = 120.
[0035] The accompaniment signal x_accomp(t) undergoes the same processing chain: using the same τ value (0.95) through the same or identically configured digital reverb, using the same ΔG values for each frequency band through the same or identically configured multi-band equalizer, x_accomp_proc(t) is generated, and the second energy value E2(t) is obtained through the same 50-millisecond sliding window integration. Using the same processing parameters for both vocals and accompaniment ensures consistency in spatial sense (reverberation) and timbre balance (equalization), avoiding inconsistencies between artificial effects and the natural acoustic environment. The calculated E1(t) and E2(t) provide objective data based on the loudness of the processed vocal and accompaniment signals for the next stage of intelligent fusion.
[0036] In S4, the mixing ratio of the processed vocals and accompaniment is dynamically adjusted based on the real-time energy relationship and the user's selected artistic intent (singing mode) to generate the final output signal. Its inputs are the continuous energy values E1(t) and E2(t) output from S3, and the preset target vocal proportion ρ (e.g., ρ=0.65 in "solo mode").
[0037] First, E1(t) and E2(t) are smoothed and normalized on a time scale. Specifically, a Hanning window weighted framing process with a frame length of 200 milliseconds and a frame shift of 100 milliseconds (50% overlap) is used, and the arithmetic mean of all energy values within each frame is calculated. For example, for the first frame starting from t=0, the E1(t) values contained therein are averaged after window weighting to obtain the short-time energy of human voice E1_frame[0]. Assume E1_frame[0]=100 and E2_frame[0]=70. This yields two discrete short-time energy sequences, which are more stable than the original instantaneous energy values and better reflect loudness changes at the syllable or phrase level.
[0038] Next, for each frame n, the total energy of that frame is calculated as S[n] = E1_frame[n] + E2_frame[n]. The fusion weights are calculated based on whether the total energy is zero. If S[n] > 0, the vocal fusion weight w_v[n] = ρ * (E1_frame[n] / S[n]), and the accompaniment fusion weight w_a[n] = 1 - w_v[n]. Substituting the above values: S[0] = 170, w_v[0] = 0.65 * (100 / 170) ≈ 0.382, w_a[0] = 1 - 0.382 = 0.618. The design logic of this formula is: taking the target proportion ρ as the expectation, scaling it according to the actual proportion of the vocals in the current total energy. If the actual proportion of the vocals is lower than the target, its weight will be appropriately increased to approach the target; otherwise, it will be suppressed. If S[n]=0 (no sound frame), then set it as w_v[n]=w_a[n]=0.5.
[0039] Finally, the target audio signal y(t) is generated. The two signals x_vocal_proc(t) and x_accomp_proc(t) processed by S3 are also divided into the same 200-millisecond frames (synchronized with the energy frames). For each audio sample in the nth frame time period, the vocal sample is multiplied by w_v[n], the accompaniment sample is multiplied by w_a[n], and then the two are added together.
[0040] Continuing the previous example, within the time interval of frame 0, each processed accompaniment sample value is multiplied by 0.618, and the vocal sample value is multiplied by 0.382, and then the results are summed and output. In this way, the final mixed signal maintains a dynamic balance within each frame: it respects the natural dynamics of the singer's volume fluctuations (reflected by E1_frame[n]) while tending to maintain the preset vocal prominence (defined by ρ), thereby achieving an adaptive and coherent balance between the accompaniment background and the vocal foreground.
[0041] In summary, the entire audio data processing method firstly involves periodically extracting the spatial impulse response and calculating the reverberation time series. Using dynamic range and effective ratio analysis of background noise correction, the decay time coefficient of the digital reverberator is mapped in real time. This allows for automatic adjustment of the artificial reverberation decay rate based on changes in sound absorption caused by changes in the number of people, avoiding the muddiness caused by excessive reverberation or the dryness caused by insufficient reverberation, thus ensuring harmonious integration of artificial reverberation with the room's acoustic characteristics. Furthermore, by analyzing the early and late reflection energy ratio sequences of the impulse response and evaluating its stability using noise correction indicators, the gain offset of a multi-band equalizer at key frequency points is dynamically generated. This allows for response to reflection changes caused by opening and closing doors and windows or object movement, compensating for easily affected frequency bands such as mid-high frequencies, mitigating the dull timbre caused by compensation failure or the harshness caused by overcompensation. Furthermore, the proportion of background noise density sequence exceeding the standard is quantified as a correction index. This index not only expands the reasonable judgment range of reverberation and reflection ratio to enhance robustness under noise, but its implicit noise intensity information also indirectly affects the final reverberation attenuation coefficient and equalization gain through an effective proportion. This makes it tend to shorten reverberation and potentially increase the gain of key frequency bands to counteract masking when noise increases, and restore a more natural processing amplitude when noise decreases, thus improving the signal-to-noise ratio and listening adaptability under different noise environments. Furthermore, by calculating the short-time energy of the vocal and accompaniment signals after applying the same environmental adaptive processing (same reverberation and equalization), and combining it with the target vocal proportion preset in the singing mode, the fusion weight is dynamically calculated frame by frame. This strategy can follow the natural fluctuations of the singer's volume, relatively increasing its gain weight when the vocal is weak to prevent it from being drowned out by the accompaniment, and moderately suppressing it when the vocal is too strong to maintain overall balance, thereby achieving adaptive adjustment between vocal prominence and musical harmony, and improving the immersive experience of the performance.
[0042] In one implementation, obtaining the operating parameters of the digital reverberator and multi-band equalizer in S2 based on the reverberation time sequence, reflection energy ratio sequence, and background noise density sequence includes: S21. Obtain the excess amount of background noise density data in the background noise density sequence that exceeds the preset noise density data, divide the excess amount by the number of background noise density data in the background noise density sequence, and obtain the correction index. S22. Obtain the standard reverberation time range and obtain the reverberation time dynamic range according to the correction index and the standard reverberation time range; obtain the standard reflection energy ratio range and obtain the reflection energy ratio dynamic range according to the correction index and the standard reflection energy ratio range. S23. Obtain the percentage of all reverberation time data in the reverberation time series that are within the dynamic range of the reverberation time and use it as the first effective percentage; and obtain the percentage of all reflection energy ratio data in the reflection energy ratio series that are within the dynamic range of the reflection energy ratio and use it as the second effective percentage. S24. Obtain the operating parameters of the digital reverb unit according to the first effective ratio mapping, and obtain the operating parameters of the multi-band equalizer according to the second effective ratio mapping.
[0043] In this embodiment, it should be noted that in S21, the background noise density sequence obtained in S1 is quantified into a single, global correction index η to characterize the interference intensity of environmental noise on audio processing.
[0044] First, a noise density threshold representing a relatively quiet environment is preset, such as -40 dBFS (full-scale decibels). This threshold is set based on the acceptable background noise level of the human ear in a typical listening environment. The preset noise density threshold is determined based on the acceptable background noise level under typical listening conditions. It is usually calculated and calibrated by referring to the full-scale decibel (dBFS) of the audio system and the human ear's hearing threshold in a quiet room (approximately 30-40 dBA, sound pressure level). During the system design phase, the threshold is determined by measuring the static background noise of multiple standard rooms and taking its statistical upper limit (such as the top 95th percentile). For example, during the design verification phase, the background noise was measured at night in multiple empty KTV rooms (air conditioning off), and the A-weighted sound pressure level distribution was found to be 32-38 dBA. Corresponding to the 24-bit audio interface used by the system (full scale approximately +24 dBu), the noise density was calculated to be approximately -42 dBFS to -38 dBFS. To leave a margin, the threshold was set to -40 dBFS. During operation, if a data point in the sequence has a value of -38 dBFS, it is judged to be out of the range.
[0045] Furthermore, the number of data points in the background noise density sequence that exceed the preset threshold is counted, iterating through each data point. Assuming a 30-second sensing phase, 60 background noise density data points are generated at 500-millisecond intervals. By comparing each point, 15 data points are found to exceed -40 dBFS. The excess number N_excess is then 15. The correction index η is obtained by dividing N_excess by the total number of data points in the sequence, N_total (60 in this case), i.e., η = 15 / 60 = 0.25. This η value, between 0 and 1, indicates that the environmental noise level exceeded the preset "quiet" baseline for 25% of the recent observation period. It is not merely a statistical value, but also the logical basis for subsequent dynamic parameter adjustments. A higher η value indicates that the environmental noise is more persistent or intense, requiring a greater degree of modification to the processing strategy. For example, it may be necessary to shorten the reverberation to avoid masking the sound and relax the criteria for "normal" fluctuations in other acoustic parameters, since noise itself reduces the ability of hearing to distinguish subtle acoustic characteristics.
[0046] In S22, the correction index η calculated in S21 is received, and based on it, the acceptable range of reverberation time to reflection energy ratio is dynamically defined. The technical purpose is to enable the judgment standard to adapt to different noise environments, avoiding misjudgment as abnormal due to normal fluctuations in environmental parameters under noisy conditions, which would lead to frequent jumps in processing parameters.
[0047] This step requires setting two reference ranges: a standard reverberation time range (e.g., [T60_min, T60_max] = [0.8s, 1.5s]) and a standard reflection energy ratio range (e.g., [RER_min, RER_max] = [0.3, 0.7]).
[0048] The standard reverberation time range is determined based on the typical volume, purpose, and acoustic design specifications of the target space (such as a KTV room). This range aims to balance speech intelligibility (requiring shorter reverberation) and musical fullness / spatiality (requiring longer reverberation). It is based on acoustic design manuals (such as recommended values for small entertainment rooms) and is obtained by averaging and taking a reasonable fluctuation range through actual measurements in a limited number of acoustically well-equipped reference rooms. For example, for a KTV room with a volume of approximately 50 cubic meters, industry design guidelines recommend a reverberation time (mid-frequency 500Hz-1kHz) of 1.0±0.3 seconds. Through impulse response measurements of 10 well-regarded sample rooms, the T60 values are distributed between 0.85s and 1.45s. Therefore, the standard range is set to [0.8s, 1.5s].
[0049] The standard range of reflected energy ratios is determined based on research into the impact of the energy ratio of early reflected sound (0-50ms, affecting clarity and localization) to late reflected sound (50-300ms, affecting spatial perception and fullness) on subjective listening experience in acoustic psychology. Subjective listening tests were conducted in rooms with different acoustic characteristics to statistically determine the energy ratio distribution range corresponding to rooms perceived as "natural" or "good." For example, in an acoustic laboratory, simulated environments with various reflection characteristics were constructed by adjusting reflectors and sound-absorbing materials, and listeners were organized to rate the reproduced speech. Statistical analysis revealed that the highest proportion of clear and moderately spatially perceptible evaluations were obtained when the ratio of early reflected sound energy to late reflected sound energy (RER) was between 0.35 and 0.65. This range was extended to 0.3-0.7 to cover reasonable fluctuations and was therefore set as the standard range.
[0050] Meanwhile, two scaling factors, α and β, are preset. The values of α and β are determined through a finite number of simulations and experiments to provide sufficient robustness when noise changes, while avoiding excessive expansion of the range that would lead to a decrease in parameter sensitivity. For example, in system development, simulated data with different noise levels (η from 0 to 1) and different acoustic stability (P1, P2 changes) are used to test the system performance. When α=β=0.5, the parameters τ and ΔG can produce smooth and audibly acceptable change curves at different noise levels, and the system response is not excessive when noise suddenly occurs. The values of α and β are obtained in this way.
[0051] Taking η=0.25 as an example, the dynamic range of reverberation time is first calculated: lower limit T60_low=0.8-0.5*0.8*0.25=0.7s; upper limit T60_high=1.5+0.5*1.5*0.25≈1.6875s. Therefore, the dynamic range is extended to [0.7s, 1.6875s].
[0052] Similarly, calculate the dynamic range of the reflected energy ratio: lower limit RER_low=0.3-0.5*0.3*0.25=0.3-0.0375=0.2625; upper limit RER_high=0.7+0.5*0.7*0.25=0.7+0.0875=0.7875, i.e. [0.2625, 0.7875].
[0053] Calculations show that when noise interference increases (η=0.25), the tolerance range for normal reverberation time widens from 0.7 seconds to nearly 1.0 seconds, and the tolerance range for reflection energy ratio also widens accordingly. This design is based on the masking effect principle in acoustic psychology. In a high-noise background, the human ear's sensitivity to reverberation length and reflection details decreases. Therefore, a more lenient criterion is adopted to enhance robustness and reduce parameter oscillations caused by measurement fluctuations.
[0054] In S23, the matching degree between the current real-time acoustic environment and the noise-corrected "dynamic reasonable range" defined in S22 is evaluated, and the result is represented by two effective ratios, P1 and P2. The reverberation time series and reflection energy ratio series obtained in S1 are used as inputs, and the dynamic range calculated in the previous step is used as a scale for measurement.
[0055] First, iterate through each T60 data point in the reverberation time series and determine whether it falls within the dynamic range [T60_low, T60_high] (i.e., [0.7s, 1.6875s]). Assuming there are 60 T60 data points in the series, after comparison, 51 points are found to be within this range, so the first effective proportion P1 = 51 / 60 = 0.85. Similarly, iterate through each RER data point in the reflection energy ratio series and determine whether it falls within the dynamic range [RER_low, RER_high] (i.e., [0.2625, 0.7875]). Assuming 42 points are within this range, the second effective proportion P2 = 42 / 60 = 0.7.
[0056] A P1=0.85 indicates that, under the current noise conditions, the measured reverberation time falls within the considered "reasonable" extended range 85% of the time, suggesting that the room's reverberation characteristics are relatively stable and well-suited to the noise environment. A P2=0.7 indicates that the reflected energy structure largely conforms to the extended expectations, but still exhibits a 30% fluctuation or deviation. These two ratios quantify the "consistency" or "stability" of the environmental acoustic characteristics, serving as the direct basis for generating specific control parameters in the next stage. A higher ratio indicates a more "ideal" or stable environment, allowing for smaller corrections to the original signal.
[0057] In S24, the abstract effective ratios P1 and P2 calculated in S23 are converted into specific operating parameters that can directly control the audio processing hardware (digital reverb unit and multi-band equalizer) through a deterministic mapping relationship.
[0058] For digital reverb units, the key control parameter is the decay time coefficient τ (a dimensionless multiplier, where 1.0 represents using a preset standard decay rate). A typical mapping function is: τ = 0.5 + 0.5 * P1. In the example above, P1 = 0.85, then τ = 0.5 + 0.5 * 0.85 = 0.925. This means that the actual decay time of the digital reverb unit will be set to 92.5% of the standard value. When P1 is high, τ is close to 1, and the reverb decays slowly to preserve the sense of space; if P1 is very low (e.g., 0.2), then τ = 0.6, and the reverb decays faster to improve clarity in unstable or poor acoustic conditions.
[0059] For a multi-band equalizer, the parameter is the gain offset ΔG_i (in dB) at each center frequency. Mapping is applied to the mid-high frequency range (e.g., 2kHz, 4kHz) because these bands are more important for vocal clarity and presence. An example mapping is: ΔG_i = k * (1 - P2), where k is a maximum compensation coefficient (e.g., 4dB). When P2 = 0.7, ΔG_i = 4 * (1 - 0.7) = 1.2dB. This means that a gain of +1.2dB will be applied to the corresponding mid-high frequency band (e.g., 2kHz). The logic behind this mapping is: the lower the uniformity of the reflected energy ratio (smaller P2), the more likely the room's reflection structure is absorbing or dispersing the mid-high frequencies excessively, requiring greater equalization compensation to reproduce the timbre; conversely, a higher P2 results in less compensation.
[0060] At this point, the S2 stage outputs τ and a set of ΔG_i, which are precise processing parameters adapted to the current acoustic environment based on a comprehensive assessment of noise level and environmental stability.
[0061] like Figure 2 As shown, in one embodiment, S3, processing the human voice audio signal at the target time based on the operating parameters and outputting a first energy value includes: S31. Input the human voice audio signal into the digital reverberation unit, use the operating parameters of the digital reverberation unit to control the reverberation tail decay rate, and obtain the reverberated human voice audio signal. S32. Input the reverberated human voice audio signal into a multi-band equalizer, use the operating parameters of the multi-band equalizer to adjust the gain of each frequency band, and obtain the reverberated human voice audio signal. S33. After squaring the reverberation-processed human voice audio signal, integrate it within a 50-millisecond sliding window to obtain the first energy value.
[0062] In this embodiment, it should be noted that in S31, the original time-domain signal x_vocal(t) of the human voice picked up at the target time is input to the digital reverberator, and the decay time coefficient τ (e.g., τ=0.92) generated in step S24 is applied to control the decay rate of the reverberation tail. The digital reverberator typically adopts a feedback delay network (FDN) structure, the core of which consists of multiple parallel delay lines (e.g., 8 lines, with coprime delay times) and a corresponding feedback matrix. The decay time coefficient τ is directly related to the desired target reverberation time T60_desired, with the relationship T60_desired=τ*T60_base, where T60_base is a preset reference reverberation time (e.g., 1.2 seconds).
[0063] In FDN, the feedback loop gain g_i for each delay line is set to g_i = 10^(-3*d_i / (T60_desired*fs)), where d_i is the length of the delay line (number of sampling points) and fs is the sampling rate (48kHz). Therefore, when τ = 0.92, T60_desired ≈ 1.1 seconds. For a delay line with a length of 2400 sampling points (50 milliseconds), its feedback gain g_i ≈ 10^(-3*2400 / (1.1*48000)) ≈ 0.87. This gain value determines the proportion of signal retained each time it passes through the delay line. The larger τ is (the longer T60_desired), the closer g_i is to 1, the slower the signal decays, and the longer the reverberation tail lasts; conversely, the decay is faster.
[0064] After the original human voice signal passes through the parameterized FDN network, it is superimposed with the input direct sound and the copy signal after different delays and attenuations. The output signal x_vocal_rev(t) contains the controlled artificial reverberation effect, and its reverberation characteristics are consistent with the attenuation rate quantized by τ.
[0065] In step S32, the reverberated vocal signal x_vocal_rev(t) is input to a multi-band equalizer. This equalizer is parametric, with preset center frequencies f_i, bandwidths (determined by the Q value, e.g., Q=1.414 corresponds to octave bandwidth), and initial gain G_i_base (possibly 0dB) for N (e.g., 8) frequency bands. In step S24, a gain offset ΔG_i is calculated for each center frequency f_i (e.g., for the 2kHz band, ΔG_2k=+1.2dB). The actual operating gain of the equalizer is G_i_actual=G_i_base+ΔG_i. For digital filter implementations (e.g., using a dual second-order filter structure), this gain value is directly used to calculate the filter coefficients.
[0066] Taking the 2kHz band as an example, when a peak gain of +1.2dB is required, the corresponding gain parameter in the filter transfer function is set to a coefficient factor of 10^(1.2 / 20)≈1.15. After the signal x_vocal_rev(t) flows through all the parallel filter bands updated according to this parameter, its spectral shape is reshaped: in bands where ΔG_i is positive (such as the mid-high frequencies in the previous example), the signal energy is boosted; in bands where ΔG_i is zero or negative, it is maintained or attenuated. The output signal x_vocal_proc(t) then completes the predetermined spectral compensation. This operation aims to specifically correct the insufficient perception of specific frequency bands of human voice caused by room reflection structures (via P2 mapping) or noise masking effects.
[0067] In S33, the energy of the final equalized human voice signal x_vocal_proc(t) is estimated in real time, and the first energy value E1(t) is output. Specifically, for each discrete sampling time n (corresponding to time t=n / fs), the current sample value x_vocal_proc[n] is taken, and its instantaneous power is approximated by squaring it to obtain x_vocal_proc[n]^2. This squared value is then incorporated into a sliding window of length L=50 milliseconds for accumulation and integration. Since the sampling rate fs=48kHz, the window length corresponds to L_samples=0.05*48000=2400 sampling points.
[0068] Therefore, E1[n] at time n (i.e., the discrete form of E1(t)) is calculated as the sum of the squares of samples n-L_samples+1 to sample n (a total of 2400 samples): E1[n]=Σ_{k=n-2399}^{n}(x_vocal_proc[k]^2). This operation is essentially a short-time energy calculation, with the window length (50 milliseconds) chosen to balance instantaneous response (capable of reflecting the onset of syllables or notes) and smoothness (avoiding drastic fluctuations caused by extreme values of individual samples).
[0069] For example, assuming that within a 2400-sample window, the sum of the squares of the sample values of x_vocal_proc[n] is 3250, then the output E1(t) at this moment is 3250 (dimensionless, but proportional to the audio power). This value reflects the instantaneous loudness level of the human voice signal after spatialization and spectral shaping.
[0070] like Figure 3 As shown, in one embodiment, S3, processing the accompaniment audio signal at the target time based on the operating parameters and outputting the second energy value includes: S34. Input the accompaniment audio signal into the digital reverb unit, use the operating parameters of the digital reverb unit to control the reverb tail decay rate, and obtain the reverb-processed accompaniment audio signal. S35. Input the reverberated accompaniment audio signal into a multi-band equalizer, adjust the gain of each frequency band using the operating parameters of the multi-band equalizer, and obtain the reverberated accompaniment audio signal. S36. After squaring the reverberation-processed accompaniment audio signal, integrate it within a 50-millisecond sliding window to obtain the second energy value.
[0071] In this embodiment, it should be noted that in S34, the accompaniment audio signal x_accomp(t) undergoes the exact same digital reverb processing as in step S31. This means that the accompaniment signal is fed into another digital reverb instance with the same algorithmic structure (such as FDN), or shares the same reverb with the vocal processing but undergoes time-division multiplexing. Crucially, this reverb uses the exact same decay time coefficient τ (0.92 in the previous example) as when processing vocals to configure its internal parameters (feedback gain, etc.).
[0072] Therefore, the reverberation decay time characteristics experienced by the accompaniment signal (T60_desired≈1.1 seconds) are strictly consistent with the vocal signal, avoiding the spatial perception tearing caused by the vocal having a reverberant effect while the accompaniment is dry or has other inconsistent reverberation characteristics. The processed accompaniment signal x_accomp_rev(t) matches the vocal signal x_vocal_rev(t) in the time constant of the reverberation tail decay.
[0073] In step S35, the reverberated accompaniment signal x_accomp_rev(t) is input to a multi-band equalizer, and the same gain offset {ΔG_i} generated in step S24 is applied, identical to that in step S32. That is, the gain adjustment ΔG_i applied to each frequency band (e.g., 2kHz, 4kHz, etc.) of the equalizer through which the accompaniment signal passes is exactly the same as that applied during vocal processing (e.g., an additional +1.2dB at 2kHz). This ensures that the accompaniment signal and the vocal signal undergo completely identical spectral shaping strategies.
[0074] This processing is not intended to alter the timbre balance of the accompaniment itself, but rather to achieve two key objectives: First, to compensate for potential energy loss in the same frequency bands due to room acoustic deficiencies, ensuring spectral integrity; second, and more importantly, to ensure that when vocals and accompaniment are weighted and mixed in subsequent steps, their relative energy relationships across frequency bands are not distorted by the room acoustic deficiencies compensation operation because both have undergone the same pre-equalization processing, guaranteeing the harmony of the mixed timbre. The output signal is x_accomp_proc(t).
[0075] In S36, the same algorithm and parameters as in S33 are used to perform short-time energy calculation on the processed accompaniment signal x_accomp_proc(t). That is, each sample is squared and integrated and summed within a sliding window of 50 milliseconds (2400 sampling points) to obtain the discrete sequence E2[n] of the second energy value E2(t).
[0076] For example, at the same time n, assuming that the sum of the squares of the 2400 samples before and after x_accomp_proc[n] is 1800, then E2[n] = 1800. Thus, two parallel energy flows E1(t) and E2(t) based on the same time window and calculation method are obtained.
[0077] like Figure 4 As shown, in one embodiment, calculating the short-time energy of the first energy signal and the second energy signal in S4 includes: S41. Apply Hanning window weighting to the first and second energy values respectively, divide the frame under the conditions of 200 millisecond frame length and 50% overlap, and calculate the arithmetic mean of the energy values in each frame to obtain the short-time energy sequence of the human voice and the short-time energy sequence of the accompaniment.
[0078] In this embodiment, it should be noted that in S41, the continuous energy values E1(t) and E2(t) output by S33 and S36 are further time-normalized to obtain a vocal short-time energy sequence and an accompaniment short-time energy sequence that are more suitable for rhythm / phrase level energy analysis.
[0079] The specific operation consists of three sub-steps: First, Hanning window weighting is applied to the discrete sequences of E1(t) and E2(t), respectively. The window function is w[m]=0.5*(1-cos(2πm / (M-1))), where M is the number of samples corresponding to the window length. This weighting aims to reduce spectral leakage caused by discontinuities at frame boundaries during subsequent framing, making the transition between frames smoother.
[0080] Secondly, the weighted energy sequence is divided with a frame length of 200 milliseconds (corresponding to 9600 sampling points) and a frame shift of 100 milliseconds (50% overlap, i.e., 4800 points). Overlapping framing improves temporal resolution.
[0081] Finally, the arithmetic mean is calculated for all energy values within each frame (e.g., the nth frame) (i.e., all E1[n] or E2[n] values corresponding to that frame's time range). Assuming the average vocal energy within the nth frame is 105 and the accompaniment is 65, then E1_frame[n] = 105 and E2_frame[n] = 65. Through this processing, the original high temporal resolution instantaneous energy is transformed into a lower sampling rate (10Hz in this example, since each frame lasts 100 milliseconds) but more stable frame-level energy sequence, which better matches human perception of the average loudness over the duration of a syllable or short musical phrase.
[0082] like Figure 4 As shown, in one embodiment, S4, calculating the accompaniment blending weight and vocal blending weight based on short-time energy and the target vocal proportion under a preset singing mode includes: S42. Obtain the target vocal proportion corresponding to the current singing mode, and calculate the sum of the short-time energy of the vocals and the short-time energy of the accompaniment in each frame; S43. If the sum of a certain frame is greater than zero, the target voice proportion multiplied by the ratio of the short-time energy of the voice in that frame to the sum is used as the voice fusion weight of that frame, and the remaining part is used as the accompaniment fusion weight of that frame. S44. If the sum of a certain frame is equal to zero, then the vocal blending weight and accompaniment blending weight of that frame are both set to 0.5.
[0083] In this embodiment, it should be noted that in S42, the artistic target benchmark for the current audio mixing is determined, namely, the target vocal proportion ρ. This value is not fixed but is bound to the "singing mode" selected by the user. An internal mapping table is preset, for example: "solo mode" is mapped to ρ=0.65, "chorus mode" is mapped to ρ=0.55, and "practice mode" (focusing on hearing one's own voice) is mapped to ρ=0.70. These ρ values (range 0.4-0.7) represent the target proportion of vocal energy to total energy (voice + accompaniment) in the ideal mixed output. This setting is based on psychological research on human ear preferences for the balance between foreground vocals and background accompaniment in different contexts.
[0084] For example, in solo mode, ρ=0.65 means that the vocal energy is expected to account for approximately 65% and the accompaniment approximately 35% in the final output, to highlight the singer. This step is performed before the fusion calculation for each frame, reading the ρ value corresponding to the current activation mode as a constant reference point for subsequent weight calculations. It embodies an interface that allows users to intentionally intervene in the adaptive processing flow.
[0085] In S43, the fusion weights for vocals and accompaniment are calculated in a regular audio frame (total energy greater than zero). For the nth frame, the inputs are the short-time vocal energy E1_frame[n] and short-time accompaniment energy E2_frame[n] calculated in S41, and the target vocal proportion ρ obtained in S42. First, the total energy S[n] = E1_frame[n] + E2_frame[n] is calculated. The formula for calculating the vocal fusion weight w_v[n] is: w_v[n] = ρ * (E1_frame[n] / S[n]). The accompaniment fusion weight is w_a[n] = 1 - w_v[n].
[0086] Taking the aforementioned data as an example (E1_frame[n]=105, E2_frame[n]=65, ρ=0.65), then S[n]=170, and the actual energy proportion of the human voice is 105 / 170≈0.6176. w_v[n]=0.65*0.6176≈0.4014. w_a[n]=1-0.4014=0.5986. This formula uses the target proportion ρ as a scaling factor to adjust the current actual proportion of the human voice. When the actual proportion of the human voice is lower than the target (0.6176<0.65), the calculated w_v[n] (0.4014) will be greater than the result of a simple product of the actual proportion of the human voice (0.6176) and a fixed weight (such as 0.65), thus relatively increasing the gain of the human voice during mixing and bringing it closer to the target proportion; conversely, it will suppress it.
[0087] S44 is a special processing branch of S43, used to handle the extreme case where the total energy S[n] within a frame is zero. In audio signal processing, S[n]=0 means that for a frame of up to 200 milliseconds, the processed signal is numerically considered completely silent (or with negligible energy). This can happen during song interludes, singer pauses, etc. In this case, the calculation formula in S43 will have a division-by-zero error, and the weighting based on energy ratios will become meaningless. Therefore, S44 sets a default, neutral weighting scheme: both the vocal blending weight w_v[n] and the accompaniment blending weight w_a[n] of the current frame are set to 0.5. This means that in silent or near-silent frames, no energy-based biased gain adjustment is applied to the two signals; instead, they are blended with equal weights. This ensures that the weights do not jump unpredictably when transitioning from silent to spoken segments, maintaining the continuity and stability of the processing.
[0088] like Figure 4 As shown, in one embodiment, generating the target audio signal based on the accompaniment blending weight and the vocal blending weight in step S4 includes: S45. Multiply the reverberation and equalization processed accompaniment audio signal and the human voice audio signal frame by the corresponding accompaniment fusion weight and human voice fusion weight respectively, and add the two weighted signals sample by sample to generate the target audio signal.
[0089] In this embodiment, it should be noted that S45 is the final output synthesis stage. Its inputs include: the vocal signal x_vocal_proc(t) and accompaniment signal x_accomp_proc(t) output from S32 and S35, which have undergone environmental adaptive processing (reverb + equalization); and the frame-divided fusion weight sequences w_v[n] and w_a[n] calculated by S43 or S44.
[0090] First, the continuous audio signals x_vocal_proc(t) and x_accomp_proc(t) need to be segmented into frames according to the same timing reference as S41 (i.e., the same 200-millisecond frame length and 50% overlapping frame start points). For the time interval T_n of the nth frame (e.g., from time t_n to t_n+200ms), all vocal and accompaniment samples within this interval are extracted.
[0091] Then, the fusion weights w_v[n] and w_a[n] corresponding to the frame are used as multipliers and applied to all corresponding signal samples within the frame. That is, for each time point t in the T_n interval, the target audio signal y(t) is generated according to the following formula: y(t)=w_v[n]*x_vocal_proc(t)+w_a[n]*x_accomp_proc(t).
[0092] Taking the aforementioned weights (w_v[n]≈0.4014, w_a[n]≈0.5986) as an example, within the corresponding 200 milliseconds, each vocal sample is multiplied by 0.4014, and each accompaniment sample is multiplied by 0.5986, and then they are added point by point to obtain the output sample. This operation realizes real-time dynamic adjustment based on the energy after signal processing and the user's target, and the final y(t) is the target audio signal.
[0093] An audio data processing system is also provided, which is used to implement the audio data processing method in any of the above embodiments. The system includes: The acoustic perception module is used to acquire the spatial impulse response signal of the time period to be processed before the target time through a microphone array deployed in the singing space, and to acquire the reverberation time sequence, reflection energy ratio sequence and background noise density sequence based on the spatial impulse response signal; The parameter configuration module is used to obtain the decay time coefficient of the digital reverberator and the gain offset of the multi-band equalizer at each center frequency point based on the reverberation time sequence, reflection energy ratio sequence and background noise density sequence. The signal processing module is used to process the human voice audio signal and the accompaniment audio signal based on the decay time coefficient and the gain offset, respectively. The fusion control module is used to calculate the short-time energy of the processed human voice audio signal and the accompaniment audio signal, combine the target human voice proportion in the preset singing mode to generate frame-by-frame fusion weights, and generate the target audio signal based on the frame-by-frame fusion weights.
[0094] In this embodiment, the audio data processing system is deployed in a typical KTV singing room. The room measures 5 meters × 4 meters × 2.8 meters. The walls are made of plasterboard, wood veneer, and some sound-absorbing cotton. The floor is carpeted, and the ceiling has a suspended structure. A circular microphone array consisting of eight omnidirectional condenser microphones is installed in this space. The microphones are evenly distributed on a circle with a diameter of 0.6 meters, with the center point located 1.5 meters above the ground. This center point also houses the main microphone for voice pickup. The microphone array is connected to a central processing unit (CPU) via a multi-channel audio interface card. The CPU is an industrial computer equipped with an Intel Core i7-12700 processor, 32GB of memory, and a dedicated audio DSP chip. It internally stores the computer program for implementing the method of this invention.
[0095] After system startup, the acoustic environment perception phase is executed first. The central processing unit controls the microphone array to continuously acquire environmental audio data during a processing period (set to the most recent 30 seconds) before the target time (e.g., the moment the user presses the "Start Singing" button). During this period, the system periodically plays a test signal with known characteristics. This test signal is a composite excitation signal with a length of 1 second, a sampling rate of 48kHz, and containing a logarithmically swept sine wave from 20Hz to 20kHz superimposed with white noise, played through a full-range speaker installed in the middle of the front wall of the room. The microphone array synchronously records the response of this test signal in space, obtaining raw impulse response data. Subsequently, the system performs noise reduction, alignment, and windowing processing on the response signals acquired by each microphone, and uses the maximum length sequence (MLS) deconvolution algorithm or inverse filtering technique to extract the spatial impulse response signal from the raw response. This spatial impulse response signal characterizes the transmission path characteristics between the sound source and each microphone, containing complete information such as direct sound, early reflections, and reverberation tails.
[0096] Based on the acquired spatial impulse response signal, the system further calculates three key acoustic feature sequences: reverberation time sequence, reflection energy ratio sequence, and background noise density sequence. The reverberation time sequence is obtained as follows: the spatial impulse response signal is divided into continuous 500-millisecond non-overlapping segments along the time axis; for each segment, the energy decay curve is calculated using the Schroeder integral method, and the T20 or T30 reverberation time value is fitted to form the reverberation time sequence. The reflection energy ratio sequence is obtained as follows: in the spatial impulse response signal, the arrival time of the direct sound is used as a reference, with the subsequent 50 milliseconds defined as the early reflection window and the period from 50 milliseconds to 300 milliseconds defined as the late reflection window; the root mean square value of the signal energy within each window is calculated, and their ratio is taken as the reflection energy ratio, also forming a sequence according to time segments. The method for obtaining the background noise density sequence is as follows: during the period when the test signal is not played, pure ambient noise is collected using a microphone array, and it is converted to the frequency domain by short-time Fourier transform (STFT). The power spectral density of each frequency band (such as 1 / 3 octave) is calculated, and the entire frequency band is weighted averaged or the average value of a specific sensitive frequency band (such as 1kHz–4kHz) is taken to form the background noise density sequence.
[0097] After obtaining the three sequences mentioned above, the system enters the parameter configuration phase. First, the number of data points in the background noise density sequence that exceed a preset noise density threshold (e.g., -40dBFS) is calculated and denoted as the excess number N_excess; then, it is divided by the total sequence length N_total to obtain the correction index η = N_excess / N_total. This correction index is used to quantify the degree of interference of the current ambient noise on audio processing. Next, the system reads the preset standard reverberation time range [T60_min_std, T60_max_std], for example [0.8s, 1.5s], and calculates the reverberation time dynamic range according to the correction index η and the constant α = 0.5: lower limit T60_low = T60_min_std - α × T60_min_std × η, upper limit T60_high = T60_max_std + α × T60_max_std × η. Similarly, read the standard reflectance energy ratio range [RER_min_std, RER_max_std] (e.g., [0.3, 0.7]), and calculate the dynamic range of reflectance energy ratio based on β=0.5: lower limit RER_low=RER_min_std-β×RER_min_std×η, upper limit RER_high=RER_max_std+β×RER_max_std×η.
[0098] Subsequently, the system statistically analyzes the proportion of data points in the reverberation time series falling within the [T60_low, T60_high] interval, which is taken as the first effective proportion P1; similarly, it statistically analyzes the proportion of data points in the reflection energy ratio (RER) sequence falling within the [RER_low, RER_high] interval, which is taken as the second effective proportion P2. These two proportions reflect the degree of consistency of the current acoustic environment within a reasonable range after dynamic adjustment. The system's built-in mapping function f1(P1) converts P1 into the decay time coefficient τ of the digital reverberation unit. For example, when P1=1, τ=1.0 (i.e., using the standard reverberation time), and when P1=0, τ=0.5 (significantly shortening the reverberation tail). Another mapping function f2(P2) converts P2 into the gain offset ΔG_i of the multi-band equalizer at each center frequency (such as 100Hz, 250Hz, 500Hz, 1kHz, 2kHz, 4kHz, 8kHz, 16kHz). For example, the higher P2 is, the closer ΔG_i is to 0, indicating a smaller equalization compensation. Conversely, a lower P2 applies positive gain in the mid-to-high frequency band to improve clarity.
[0099] After parameter configuration, the system enters the signal processing stage. At this point, the user begins singing, with the vocal audio signal input through the main microphone at a 48kHz sampling rate, and the accompaniment audio signal input synchronously from a local media player or cloud streaming service. For the vocal audio signal, it is first fed into a digital reverb module. This module employs an algorithm structure based on a feedback delay network (FDN), where the internal delay line length and feedback coefficient are dynamically adjusted by the decay time coefficient τ, thereby controlling the decay rate at the end of the reverb. The reverb-processed signal is then fed into a multi-band equalizer, an 8-band parametric equalizer with a fixed center frequency and preset Q value. The gain value is obtained by superimposing ΔG_i onto the original preset value, completing spectrum shaping. The processed vocal audio signal x_vocal_proc(t) is then fed into an energy calculation unit. This unit squares the signal sample by sample and integrates (sums) it within a 50-millisecond sliding window (i.e., 2400 sampling points), outputting the first energy value E1(t). Simultaneously, the accompaniment audio signal undergoes the same processing as the vocal audio signal to obtain the second energy value E2(t).
[0100] Next, the system enters the fusion control phase. The system applies Hanning window weighting to E1(t) and E2(t) respectively, and performs frame-by-frame processing with a frame length of 200 milliseconds (9600 sampling points) and 50% frame overlap (i.e., one frame is output every 100 milliseconds). The system then calculates the arithmetic mean of all energy values within each frame to obtain the short-time energy sequence of the vocals {E1_frame[n]} and the short-time energy sequence of the accompaniment {E2_frame[n]}. Simultaneously, the system reads the target vocal proportion ρ corresponding to the current singing mode (e.g., "solo mode," "chorus mode," "practice mode"). This value is preset in the system configuration file; for example, ρ=0.65 for solo mode and ρ=0.55 for practice mode, with a range limited to 0.4 to 0.7. For each frame n, the total energy S[n] = E1_frame[n] + E2_frame[n] is calculated. If S[n]>0, then the vocal blending weight w_v[n]=ρ×(E1_frame[n] / S[n]), and the accompaniment blending weight w_a[n]=1-w_v[n]. If S[n]=0 (i.e., silent state), then set w_v[n]=w_a[n]=0.5 to avoid division by zero error.
[0101] Finally, the system generates the target audio signal. The processed vocal audio signal x_vocal_proc(t) and the accompaniment audio signal x_accomp_proc(t) are divided into frames, with each frame corresponding to 200 milliseconds of audio data. For the nth frame, all samples of x_vocal_proc(t) in that frame are multiplied by w_v[n], and all samples of x_accomp_proc(t) in that frame are multiplied by w_a[n]. Then, the two weighted signals are added sample by sample to obtain the target audio signal y(t) = w_v[n] × x_vocal_proc(t) + w_a[n] × x_accomp_proc(t), where t belongs to the time interval of the nth frame. This target audio signal is sent to the room speaker system or headphone output through the audio output interface for users to monitor in real time.
[0102] Throughout the processing flow, the acoustic sensing module, parameter configuration module, signal processing module, and fusion control module are all implemented by the same program thread or multiple collaborative threads within the central processing unit (CPU). Data is transferred between modules via a shared memory buffer. The microphone array is physically connected to the main microphone, speakers, and audio interface card via analog or digital audio cables. The audio interface card is connected to the CPU via a PCIe or USB 3.0 bus. All signal processing is performed at a 48kHz sampling rate and 24-bit quantization precision to ensure audio quality. The system updates the acoustic feature sequence every 5 seconds and recalculates operating parameters to adapt to environmental changes (such as changes in acoustic characteristics caused by people moving around or doors and windows opening and closing).
[0103] Without accompanying diagrams, the logical connections of the above modules are as follows: the microphone array output is connected to the acoustic sensing module input; the acoustic sensing module output is connected to the parameter configuration module input; the parameter configuration module output is connected to the parameter input ports of the digital reverb unit and multi-band equalizer in the signal processing module; the vocal and accompaniment signal inputs are connected to the signal inputs of the signal processing module; the two outputs of the signal processing module are connected to the energy calculation unit input of the fusion control module; the weight output of the fusion control module is connected to the control terminal of the weighted superposition unit; the two signal inputs of the weighted superposition unit receive the processed vocal and accompaniment signals, and its output is the target audio signal output.
[0104] To enable those skilled in the art to fully understand and implement this invention, the specific implementation principles of this invention are further supplemented below with a specific application scenario.
[0105] Inside the karaoke room, when a user enters and prepares to sing, the system first uses a ring microphone array deployed 1.5 meters above the ground to continuously collect ambient audio data for 30 seconds before the user presses the "Start Singing" button. During this period, the system periodically plays a composite test signal consisting of a 20Hz to 20kHz logarithmic sweep sine wave superimposed with white noise through a full-range speaker in the center of the front wall. This signal has a wide bandwidth energy distribution and good autocorrelation characteristics, facilitating subsequent deconvolution processing. The microphone array synchronously receives the response signal formed after the test signal propagates in space. Because the eight omnidirectional condenser microphones are evenly distributed in a ring, they can effectively capture spatial reflection path information from different directions, thereby improving the robustness of impulse response extraction. Subsequently, the system preprocesses each response signal, including denoising using spectral subtraction or Wiener filtering, aligning the arrival time of direct sound in each channel based on the cross-correlation function, and applying a Hanning window to reduce spectral leakage. Then, the spatial impulse response signal is deconvoluted from the convolution relationship between the excitation signal and the response signal using the maximum length sequence (MLS) deconvolution algorithm. This signal completely preserves the acoustic transmission path characteristics from the loudspeaker to each microphone position, including the energy distribution and time delay structure of direct sound, early reflections (within 50ms), and reverberation tail (after 50ms).
[0106] Based on the extracted spatial impulse response signal, the system performs sliding analysis in 500-millisecond non-overlapping segments. For each segment, the Schroeder integral method is used to calculate its energy decay curve, and the T60 reverberation time value is extrapolated by linear fitting of T20 (the time required for 20dB decay), forming a reverberation time series. Simultaneously, within each impulse response segment, using the direct peak sound as a reference, a 0–50ms early reflection window and a 50–300ms late reflection window are defined. The root mean square energy of the signal within each window is calculated, and their ratio is used to obtain the reflection energy ratio, forming a reflection energy ratio series. Furthermore, during silent periods when no test signal is played, the system uses the same microphone array to collect pure ambient noise. The time-domain noise is converted to the frequency domain using a short-time Fourier transform (STFT), and frequency bands are divided into 1 / 3 octave bands. The power spectral density of each band is calculated, and a weighted average is applied to the 1kHz–4kHz human-sensitive frequency band to form a background noise density series. These three series together constitute a multi-dimensional quantitative characterization of the current acoustic environment.
[0107] During the parameter configuration phase, the system first counts the number of data points N_excess exceeding the -40 dBFS threshold in the background noise density sequence and divides this number by the total number of points N_total to obtain the correction index η. This index reflects the intensity of environmental noise interference with auditory perception. When η increases, it indicates stronger background noise, requiring the system to compress the reverberation time to avoid masking human voices, while simultaneously expanding the tolerance range of the reflection energy ratio to accommodate the uncertainty of the reflection path. Therefore, the system uses the standard reverberation time range [0.8s, 1.5s] as a basis and applies the formula T60_low=0.8 The reverberation time determination interval is dynamically extended by 0.5×0.8×η and T60_high=1.5+0.5×1.5×η; similarly, the standard reflection energy ratio range [0.3, 0.7] is determined by RER_low=0.3. The system adjusts the values of 0.5×0.3×η and RER_high=0.7+0.5×0.7×η. Then, it calculates the proportion P1 of the reverberation time series falling within the [T60_low, T60_high] interval. If P1 is high, it indicates that the current reverberation characteristics are stable and close to the ideal range. In this case, the mapping function f1(P1) outputs a decay time coefficient τ close to 1.0, allowing the digital reverberator to maintain a longer reverberation tail to enhance the sense of space. If P1 is low, τ decreases to around 0.5, forcibly shortening the reverberation to improve speech clarity. Similarly, the second effective proportion P2 reflects the rationality of the reflection energy structure. When P2 is low, it indicates insufficient early reflection or excessive late reflection. The system applies a positive gain offset ΔG_i (e.g., +2dB to +4dB) to the mid-high frequency bands such as 2kHz and 4kHz of the multi-band equalizer through the mapping function f2(P2) to compensate for the high-frequency attenuation caused by reflection imbalance, thereby restoring the presence and penetration of the human voice.
[0108] After entering the signal processing stage, the user's vocals are picked up by the main microphone and first enter a digital reverb unit based on a Feedback Delay Network (FDN). This reverb unit has a closed-loop feedback structure consisting of eight or more delay lines, and its overall decay rate is dynamically adjusted by τ: the larger τ is, the closer the feedback coefficient is to 1, and the slower the reverb decays at the tail; conversely, the smaller τ is, the faster the decay. The reverb output signal is then sent to an 8-band parametric equalizer, each with a fixed center frequency (e.g., 100Hz, 250Hz…16kHz), a preset Q value of 1.414, and a gain value that is updated in real-time by adding ΔG_i to the factory preset value, achieving spectral shaping to address current acoustic defects. The processed vocal signal is squared and integrated within a 50-millisecond (2400 sampling points) sliding window to output a first energy value E1(t), which reflects the instantaneous loudness of the vocals. Simultaneously, the accompaniment signal, after being input from the media source, is processed by the system in the same way, and also outputs a second energy value E2(t) after energy integration.
[0109] During the fusion control phase, the system divides E1(t) and E2(t) into frames with a frame length of 200 milliseconds and a 50% overlap, applies a Hanning window to smooth the frame boundaries, and then averages all samples within each frame to obtain discrete short-time energy sequences {E1_frame[n]} and {E2_frame[n]}. The system reads the target vocal proportion ρ=0.65 based on the currently selected "solo mode," which represents the vocal proportion of 65% of the total energy under ideal mixing conditions. For each frame n, the total energy S[n]=E1_frame[n]+E2_frame[n] is calculated. If S[n]>0, the vocal fusion weight w_v[n]=0.65×(E1_frame[n] / S[n]). This design ensures that when the vocal energy is low (e.g., soft singing), the system automatically increases its relative gain to maintain the target proportion and avoids being drowned out by the accompaniment; when the vocal energy is too high, it is moderately suppressed to prevent popping sounds. Finally, the system aligns the processed vocal and accompaniment signals in 200-millisecond frames, multiplying all samples within each frame by w_v[n] and w_a[n] = 1 respectively. w_v[n] is then added sample by sample to generate the target audio signal y(t). This signal maintains the dominance of the human voice while taking into account the integrity and spatial coordination of the accompaniment, significantly improving the naturalness and immersion of the singing experience.
[0110] Throughout the process, the system re-executes acoustic sensing and parameter updates every 5 seconds to cope with acoustic changes caused by people moving around, opening and closing doors and windows, etc., to ensure that audio processing is always adapted to the current environment.
[0111] Figure 5 This is a block diagram of an electronic device illustrating an audio data processing method according to an exemplary embodiment. Figure 5As shown, the electronic device 700 may include: a processor 701 and a memory 702. The electronic device 700 may also include one or more of a multimedia component 703, an I / O interface 704 (input / output interface), and a communication component 705.
[0112] The processor 701 controls the overall operation of the electronic device 700 to complete all or part of the steps in the aforementioned audio data processing method. The memory 702 stores various types of data to support the operation of the electronic device 700. This data may include, for example, instructions for any application or method operating on the electronic device 700, and application-related data such as contact data, sent and received messages, pictures, audio, video, etc. The memory 702 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk. The multimedia component 703 may include a screen and audio components. The screen may be, for example, a touchscreen, and the audio component is used to output and / or input audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signals may be further stored in memory 702 or transmitted via communication component 705. The audio component also includes at least one speaker for outputting audio signals. I / O interface 704 provides an interface between processor 701 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual or physical buttons. Communication component 705 is used for wired or wireless communication between the electronic device 700 and other devices. Wireless communication, such as Wi-Fi, Bluetooth, Near Field Communication (NFC), 2G, 3G, 4G, NB-IoT, eMTC, or other 5G technologies, or a combination thereof, is not limited here. Therefore, the corresponding communication component 705 may include: a Wi-Fi module, a Bluetooth module, an NFC module, etc.
[0113] In one exemplary embodiment, the electronic device 700 may be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components to perform the audio data processing method described above.
[0114] In another exemplary embodiment, a computer-readable storage medium including program instructions is also provided, which, when executed by a processor, implement the steps of the audio data processing method described above. For example, the computer-readable storage medium may be the memory 702 including the program instructions described above, which may be executed by the processor 701 of the electronic device 700 to complete the audio data processing method described above.
[0115] The preferred embodiments of this disclosure have been described in detail above with reference to the accompanying drawings. However, this disclosure is not limited to the specific details of the above embodiments. Within the scope of the technical concept of this disclosure, various simple modifications can be made to the technical solutions of this disclosure, and these simple modifications all fall within the protection scope of this disclosure.
[0116] It should also be noted that the various specific technical features described in the above embodiments can be combined in any suitable manner without contradiction. To avoid unnecessary repetition, this disclosure will not describe the various possible combinations separately.
[0117] Furthermore, various different embodiments of this disclosure can be combined in any way, as long as they do not violate the spirit of this disclosure, they should also be regarded as the content disclosed in this disclosure.
[0118] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention, and they should all be covered within the scope of the claims and specification of the present invention.
Claims
1. An audio data processing method, characterized in that the method include: Acquire the spatial impulse response signal during the processing period before the target time, and obtain the reverberation time sequence, reflection energy ratio sequence and background noise density sequence based on the spatial impulse response signal; The operating parameters of the digital reverberator and multi-band equalizer are obtained based on the reverberation time sequence, reflection energy ratio sequence, and background noise density sequence. Based on the operating parameters, the human voice audio signal at the target time is processed and a first energy value is output; based on the operating parameters, the accompaniment audio signal at the target time is processed and a second energy value is output. Calculate the short-time energy of the first energy signal and the second energy signal, and calculate the accompaniment fusion weight and the vocal fusion weight based on the short-time energy and the target vocal proportion in the preset singing mode, and generate the target audio signal based on the accompaniment fusion weight and the vocal fusion weight.
2. The audio data processing method according to claim 1, characterized in that, The process of obtaining the operating parameters of the digital reverberator and multi-band equalizer based on the reverberation time sequence, reflection energy ratio sequence, and background noise density sequence includes: Obtain the excess number of background noise density data exceeding the preset noise density data in the background noise density sequence, divide the excess number by the number of background noise density data in the background noise density sequence, and obtain the correction index. Obtain the standard reverberation time range and, based on the correction index and the standard reverberation time range, obtain the reverberation time dynamic range; obtain the standard reflection energy ratio range and, based on the correction index and the standard reflection energy ratio range, obtain the reflection energy ratio dynamic range. The proportion of all reverberation time data in the reverberation time series that are within the dynamic range of the reverberation time is obtained and used as the first effective proportion; the proportion of all reflection energy ratio data in the reflection energy ratio series that are within the dynamic range of the reflection energy ratio is obtained and used as the second effective proportion. The operating parameters of the digital reverb are obtained based on the first effective ratio mapping, and the operating parameters of the multi-band equalizer are obtained based on the second effective ratio mapping.
3. The audio data processing method according to claim 1, characterized in that, The process of processing the human voice audio signal at the target time based on the operating parameters and outputting the first energy value includes: The human voice audio signal is input into a digital reverb unit, and the operating parameters of the digital reverb unit are used to control the reverb tail decay rate to obtain the reverb-processed human voice audio signal. The reverberated human voice audio signal is input into a multi-band equalizer. The gain of each frequency band is adjusted using the operating parameters of the multi-band equalizer, and the reverberated human voice audio signal is obtained. The first energy value is obtained by squaring the reverberated human voice audio signal and integrating it within a 50-millisecond sliding window.
4. The audio data processing method according to claim 1, characterized in that, The process of processing the accompaniment audio signal at the target time based on the operating parameters and outputting the second energy value includes: The accompaniment audio signal is input into a digital reverb unit. The operating parameters of the digital reverb unit are used to control the decay rate of the reverb tail, and the reverb-processed accompaniment audio signal is obtained. The reverberated accompaniment audio signal is input into a multi-band equalizer. The gain of each frequency band is adjusted using the operating parameters of the multi-band equalizer, and the reverberated accompaniment audio signal is obtained. The second energy value is obtained by squaring the reverberated accompaniment audio signal and integrating it within a 50-millisecond sliding window.
5. The audio data processing method according to claim 1, characterized in that, The calculation of the short-time energy of the first energy signal and the second energy signal includes: The first and second energy values are weighted by Hanning window, and the frames are divided under the conditions of 200 millisecond frame length and 50% overlap. The arithmetic mean of the energy values in each frame is calculated to obtain the short-time energy sequence of the human voice and the short-time energy sequence of the accompaniment.
6. The audio data processing method according to claim 5, characterized in that, The calculation of accompaniment blending weight and vocal blending weight based on short-time energy and the target vocal proportion in a preset singing mode includes: Obtain the target vocal proportion corresponding to the current singing mode, and calculate the sum of the short-time energy of the vocals and the short-time energy of the accompaniment in each frame; If the sum of a certain frame is greater than zero, the target voice proportion multiplied by the ratio of the short-time energy of the voice in that frame to the sum is used as the voice fusion weight of that frame, and the remaining part is used as the accompaniment fusion weight of that frame. If the sum of the values in a certain frame is equal to zero, then the vocal blending weight and the accompaniment blending weight in that frame are both set to 0.
5.
7. The audio data processing method according to claim 1, characterized in that, The generation of the target audio signal based on accompaniment blending weights and vocal blending weights includes: The reverberation and equalization processed accompaniment audio signal and the vocal audio signal are multiplied frame by frame by the corresponding accompaniment blending weight and vocal blending weight, and the two weighted signals are added sample by sample to generate the target audio signal.
8. An audio data processing system, characterized in that, The system is used to implement the audio data processing method as described in any one of claims 1 to 7, the system comprising: The acoustic perception module is used to acquire the spatial impulse response signal of the time period to be processed before the target time through a microphone array deployed in the singing space, and to acquire the reverberation time sequence, reflection energy ratio sequence and background noise density sequence based on the spatial impulse response signal; The parameter configuration module is used to obtain the decay time coefficient of the digital reverberator and the gain offset of the multi-band equalizer at each center frequency point based on the reverberation time sequence, reflection energy ratio sequence and background noise density sequence. The signal processing module is used to process the human voice audio signal and the accompaniment audio signal based on the decay time coefficient and the gain offset, respectively. The fusion control module is used to calculate the short-time energy of the processed human voice audio signal and the accompaniment audio signal, combine the target human voice proportion in the preset singing mode to generate frame-by-frame fusion weights, and generate the target audio signal based on the frame-by-frame fusion weights.
9. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1 to 7.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the method as described in any one of claims 1 to 7.