Dynamic threshold based sound field adjustment method, system, device and medium
By using a dynamic threshold-based sound field adjustment method, the system comprehensively collects and processes human voice, accompaniment, and environmental signals to generate a unified sound field state factor D, which drives a parallel audio processing unit. This solves the problems of fixed parameters and single-dimensional adaptation, and achieves stability and adaptive optimization of human voice quality and listening experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHENGDU XIAOCHANG TECH CO LTD
- Filing Date
- 2026-02-26
- Publication Date
- 2026-06-12
AI Technical Summary
In the real-time processing of human voice audio, existing technologies cannot adapt to the real-time changes of the singer, the dynamic range of the accompaniment music, and the environmental acoustic conditions with fixed parameters. This results in unstable human voice quality, and single-dimensional adaptive decision-making is prone to errors in complex scenarios. Cascaded processing leads to signal distortion and coupling distortion.
A sound field adjustment method based on dynamic thresholds is adopted. By synchronously collecting human voice, accompaniment and environmental signals, multi-dimensional acoustic features are extracted to generate a comprehensive sound field state factor D, which drives the parallel processing unit to process human voice audio signals in parallel, and an output quality feedback mechanism is introduced for self-correction.
It achieves stable vocal quality and consistent listening experience in different scenarios, avoids coupling distortion from cascaded processing, can respond to changes in the sound field in real time and perform self-optimization, and improves adaptability to different room acoustics and accompaniment.
Smart Images

Figure CN122201235A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing technology, and specifically to a sound field adjustment method, system, device, and medium based on dynamic thresholds. Background Technology
[0002] In current technologies, when performing real-time effects processing (such as reverb, equalization, and compression) on human voice audio to enhance its spatiality, clarity, and integration with the accompaniment, existing technologies employ fixed parameter processing strategies. The preset reverb time, equalization curve, and compression threshold cannot adaptively adjust to the singer's real-time changes in volume and pitch, different styles of accompaniment music (with huge differences in dynamic range and spectral distribution), and the acoustic conditions of the playback environment (room reverberation characteristics). This results in unstable vocal quality in different application scenarios, easily leading to dry, blurry, or disconnected vocals from the background music.
[0003] Secondly, some improvement schemes have introduced adaptive mechanisms based on a single signal dimension (such as only the volume of human voice). However, since the sound field perception is actually the result of a complex interaction between human voice, accompaniment and environment, this decision-making method that relies solely on a single feature (such as energy) is prone to making adjustments that are inappropriate in scenarios with complex accompaniment or changing environment. For example, increasing the volume of human voice in a noisy accompaniment may result in an overall harsh sound or the human voice being masked.
[0004] Furthermore, common cascaded audio processing architectures (such as signals passing through compression, equalization, and reverb modules in sequence) can cause subsequent processing units to receive signals whose characteristics have been distorted by the previous stage. This can lead to incorrect decisions based on the distorted input signal, resulting in unnatural processing superposition effects (such as over-compression causing subsequent reverb and equalization effects to deteriorate). This is often referred to in the industry as processing coupling distortion. Summary of the Invention
[0005] In response to the technical problems mentioned in the background art, such as fixed parameter mismatch, one-dimensional adaptive bias, cascaded processing coupling, and lack of long-term self-optimization capability, this invention provides a sound field adjustment method, system, device, and medium based on dynamic thresholds. It can comprehensively sense multi-source acoustic information, collaboratively drive multiple parallel processing units based on unified decision logic, and achieve self-correction through output quality feedback.
[0006] A sound field adjustment method based on dynamic thresholds includes: real-time acquisition of human voice audio signals, accompaniment audio signals, and environmental acoustic reference signals; extraction of a first acoustic feature set from the human voice audio signals, a second acoustic feature set from the accompaniment audio signals, and a third acoustic feature set from the environmental acoustic reference signals; normalization of each feature value in the first, second, and third acoustic feature sets, and linear weighted fusion of the normalized feature values through a configurable weight model to generate a one-dimensional comprehensive sound field state factor D; and inputting the comprehensive sound field state factor D into a set of predefined and mutually independent... The parameter mapping function generates reverberation trigger threshold, target reverberation time, high-frequency equalization gain, and compressor threshold in parallel. Specifically, a reverberation processing unit, a frequency domain equalization processing unit, and a dynamic compression processing unit, all using human voice audio signals as input signals, are acquired and subjected to parallel audio processing of the human voice audio signals based on the reverberation trigger threshold, target reverberation time, high-frequency equalization gain, and compressor threshold, respectively. Based on the parallel processed and synthesized human voice output signal, quality evaluation indicators, including transient distortion and spectral flatness changes, are obtained. The weight coefficients in the weighting model and / or the mapping coefficients of the parameter mapping function are then corrected based on these quality evaluation indicators.
[0007] Optionally, the first acoustic feature set includes short-time energy and fundamental frequency stability indices for human voices; the second acoustic feature set includes accompaniment dynamic range parameters and accompaniment spectral centroid; and the third acoustic feature set includes ambient reverberation time.
[0008] Optionally, the parameter mapping function includes: a first mapping function for calculating the reverberation trigger threshold and target reverberation time based on the comprehensive sound field state factor D; a second mapping function for calculating the high-frequency equalization gain based on the comprehensive sound field state factor D; and a third mapping function for calculating the compressor threshold based on the comprehensive sound field state factor D.
[0009] Optionally, the calculation logic of the first mapping function is configured as follows: when the integrated sound field state factor D is lower than the first activation threshold D_r_on, the reverberation trigger threshold is set to an invalid value that disables the reverberation processing unit; when D is greater than or equal to D_r_on, the reverberation trigger threshold decreases linearly with the increase of D, and the target reverberation time increases linearly with the increase of D; the calculation logic of the second mapping function is configured as follows: when the integrated sound field state factor D is lower than the second activation threshold D_eq_on, the high-frequency equalization gain is set to zero; when D is greater than or equal to D_eq_on, the high-frequency equalization gain increases linearly with the increase of D; the calculation logic of the third mapping function is configured as follows: when the integrated sound field state factor D is lower than the third activation threshold D_c_on, the compressor threshold is set to a default value; when D is greater than or equal to D_c_on, the compressor threshold decreases linearly with the increase of D.
[0010] Optionally, obtaining quality evaluation indicators including transient distortion based on the parallel-processed and synthesized human voice output signal includes: acquiring the original transient region of the human voice audio signal, and acquiring the transient response signals of the reverberation processing unit, the frequency domain equalization processing unit, and the dynamic compression processing unit in the original transient region; calculating transient distortion based on the original transient region, the transient response signals of each processing unit, and weighting coefficients associated with the target reverberation time, high-frequency equalization gain, and compressor threshold, respectively; wherein, the calculation logic of transient distortion is: for each sampling point in the original transient region, calculating the absolute value of the first difference between the transient response signal of the reverberation processing unit and the original transient signal, and the weighting coefficients associated with the target reverberation time, high-frequency equalization gain, and compressor threshold, respectively; The transient response signal of the dynamic compression processing unit is compared with the original transient signal by the second absolute value of the difference between the transient response signal and the original transient signal, and the transient response signal of the dynamic compression processing unit is compared with the original transient signal by the third absolute value of the difference between the first absolute value of the difference and the first weighting coefficient, the second absolute value of the difference and the third absolute value of the difference are compared with the third weighting coefficient, and the three results are added together to obtain the transient distortion component of the sampling point. The transient distortion components of all sampling points are squared, summed and averaged, and their square root is calculated as the final transient distortion degree. The first weighting coefficient, the second weighting coefficient and the third weighting coefficient are respectively associated with the current target reverberation time, the high-frequency equalization gain and the compressor threshold, and all of them are positive numbers.
[0011] Optionally, the quality evaluation index, including the change in spectral flatness, is obtained based on the parallel-processed and synthesized human voice output signal. This includes: calculating the first spectral flatness of the human voice audio signal within a preset time window, and the second spectral flatness of the human voice output signal within the same time window; calculating the change in spectral flatness based on the first spectral flatness, the second spectral flatness, the current normalized accompaniment spectrum centroid, and the current normalized accompaniment dynamic range parameter; wherein, the calculation logic for the change in spectral flatness is: calculating the absolute value of the difference between the second spectral flatness and the first spectral flatness, and multiplying this absolute value by a modulation factor; the modulation factor is configured such that its value is equal to 1 plus a preset sensitivity adjustment coefficient multiplied by the ratio of the current normalized accompaniment spectrum centroid to the sum of the current normalized accompaniment spectrum centroid and the accompaniment dynamic range parameter.
[0012] Optionally, modifying the parameter mapping function includes: when the transient distortion in the quality evaluation index exceeds the first safety threshold, reducing the slope of the high-frequency equalization gain as a function of D in the second mapping function, and / or increasing the baseline value of the compressor threshold in the third mapping function; and providing feedback modification to the weighting model includes: when the change in spectral flatness in the quality evaluation index exceeds the second safety threshold, reducing the weighting coefficients in the weighting model corresponding to the features in the second acoustic feature set.
[0013] A sound field adjustment system based on dynamic thresholds is also provided to implement a sound field adjustment method based on dynamic thresholds. The system includes: a multi-source signal acquisition module, including analog-to-digital converters connected to a human voice microphone interface, an accompaniment audio input interface, and an environmental pickup, for real-time acquisition of human voice audio signals, accompaniment audio signals, and environmental acoustic reference signals; a feature extraction module, for extracting a first acoustic feature set, a second acoustic feature set, and a third acoustic feature set from the three signals respectively; a fusion calculation module, for performing weighted summation on the normalized feature values to generate a comprehensive sound field state factor D; and parameter mapping. The module is used to generate the reverberation trigger threshold Th_r, target reverberation time T60, high-frequency equalization gain G_hf, and compressor threshold Th_c based on D. The parallel processing module includes a reverberation processing unit, a frequency domain equalization processing unit, and a dynamic compression processing unit. All three receive the original human voice audio signal as input and are controlled by Th_r / T60, G_hf, and Th_c, respectively. The feedback analysis module is used to calculate the transient distortion and spectral flatness changes of the output signal of the parallel processing module and transmit the results to the fusion calculation module and the parameter mapping module to update the weighting coefficients and mapping parameters.
[0014] An electronic device is also provided, comprising: a memory having a computer program stored thereon; and a processor for executing the computer program in the memory for implementing a sound field adjustment method based on dynamic thresholds.
[0015] A non-transitory computer-readable storage medium is also provided, on which a computer program is stored, which, when executed by a processor, implements a sound field adjustment method based on dynamic thresholds.
[0016] The beneficial effects of this invention are reflected in: In the entire dynamic threshold-based sound field adjustment method, firstly, key acoustic features from three dimensions—voice, accompaniment, and environment—are simultaneously acquired and extracted, and then integrated into a unified comprehensive sound field state factor D. This provides a comprehensive and consistent decision-making basis for all subsequent audio processing, solving the problems of fixed parameters being unable to adapt to changing scenarios and the one-sidedness of single-dimensional adaptive decision-making. This allows for a comprehensive consideration of the singer's state, accompaniment characteristics, and room acoustics, making an overall judgment that better meets the actual sound field requirements. Secondly, based on the unified factor D, a set of independent mapping functions drives the reverberation, equalization, and compression processing units in parallel, with each unit using the original vocal signal as input for parallel processing. This physical architecture eliminates the processing coupling distortion caused by the step-by-step transmission and distortion of signal features in existing cascaded processing, ensuring that each audio effect can be applied independently and accurately based on pure, original signal features. Finally, a closed-loop feedback mechanism based on output signal quality is introduced. By calculating and processing the transient distortion and spectral flatness changes that are dynamically correlated with intensity and accompaniment context, and by slowly and continuously fine-tuning the internal weighting model and mapping function coefficients, this mechanism can not only respond to changes in the external sound field in real time, but also perform self-checking and self-correction based on its own processing results. This allows for gradual optimization of its internal model during use, improving its long-term adaptability and stability to different room acoustics, user singing habits, and diverse accompaniments. Attached Figure Description
[0017] To more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the accompanying drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. In all the drawings, similar elements or parts are generally identified by similar reference numerals. In the drawings, the elements or parts are not necessarily drawn to scale.
[0018] Figure 1 This is a schematic diagram illustrating the steps of the sound field adjustment method based on dynamic threshold of the present invention; Figure 2 This is a partial flowchart of one embodiment of the sound field adjustment method based on dynamic threshold of the present invention; Figure 3 This is a schematic diagram of another part of the flow of the sound field adjustment method based on dynamic threshold of the present invention in one embodiment; Figure 4 This is a block diagram illustrating an electronic device according to an embodiment of the present invention.
[0019] Figure label: 700 - Electronic device; 701 - Processor; 702 - Memory; 703 - Multimedia component; 704 - I / O interface; 705 - Communication component. Detailed Implementation
[0020] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations.
[0021] Therefore, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the invention without inventive effort are within the scope of protection of the invention.
[0022] It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. Furthermore, the terms "first," "second," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0023] like Figures 1 to 3 As shown, a sound field adjustment method based on dynamic thresholds is provided. In one embodiment, the method includes: S1. Real-time acquisition of human voice audio signal, accompaniment audio signal and environmental acoustic reference signal; extraction of first acoustic feature set from human voice audio signal; extraction of second acoustic feature set from accompaniment audio signal; extraction of third acoustic feature set from environmental acoustic reference signal. S2. Normalize the feature values in the first acoustic feature set, the second acoustic feature set, and the third acoustic feature set, and then perform linear weighted fusion of the normalized feature values through a configurable weight model to generate a one-dimensional comprehensive sound field state factor D. S3. Input the comprehensive sound field state factor D into a set of predefined and independent parameter mapping functions to generate the reverberation trigger threshold, target reverberation time, high-frequency equalization gain, and compressor threshold in parallel. Among them, the reverberation processing unit, frequency domain equalization processing unit, and dynamic compression processing unit, which all take human voice audio signal as input signal, are obtained, and the human voice audio signal is processed in parallel according to the reverberation trigger threshold, target reverberation time, high-frequency equalization gain, and compressor threshold, respectively. S4. Obtain quality evaluation indicators, including transient distortion and spectral flatness changes, from the parallel processed and synthesized human voice output signal, and correct the weight coefficients and / or mapping coefficients of the parameter mapping function in the weight model based on the quality evaluation indicators.
[0024] In this embodiment, it should be noted that in S1, the basic physical quantities of the sound field state are perceived. Three key signals are simultaneously acquired through independent hardware channels: the vocal audio signal from the singer's microphone, the accompaniment audio signal from the sound source device, and the environmental acoustic reference signal from the environmental reference microphone placed in the listening area. The sampling rate is uniformly set to 48kHz, and the bit depth is 24bit to ensure time-domain alignment and high precision.
[0025] Furthermore, feature extraction is performed for each signal. For vocal audio signals, the short-time energy (in dB) is calculated every 20 milliseconds to reflect the singing intensity; simultaneously, the fundamental frequency is estimated and its short-time standard deviation is calculated as a fundamental frequency stability index (in Hz) to reflect the degree of pitch fluctuation in the singing. For accompaniment audio signals, their dynamic range parameter (in dB) is calculated to describe the amplitude span of the accompaniment from the softest to the loudest; and their spectral centroid (in Hz) is calculated to characterize the distribution tendency of the accompaniment energy in the frequency dimension. For the environmental acoustic reference signal, during startup or between quiet intervals, the test signal is played and acquired, and the room reverberation time (RT60, in seconds) is calculated using the Schroeder integral method to quantify the acoustic characteristics of the space.
[0026] For example, in a medium-sized private room setting, the short-time energy of the human voice might be measured as -18dB, with a fundamental frequency stability of 5Hz; the dynamic range of the accompaniment might be 22dB, with a spectral centroid of 2800Hz; and the ambient reverberation time might be 0.8 seconds. These five characteristics constitute a digital snapshot of the current state of the human voice, the accompaniment, and the environment, providing a heterogeneous data foundation for subsequent fusion decisions.
[0027] In S2, the multiple features with different dimensions and numerical ranges extracted from S1 are transformed into a unitless, comparable scalar decision benchmark—the comprehensive sound field state factor D.
[0028] First, each feature is normalized by using a minimum-maximum scaling method to map the original value to the interval [0, 1]. Specifically, the minimum-maximum scaling method can be expressed as follows: for any feature value F_i, its normalized value F_i_norm = (F_i - F_i_min) / (F_i_max - F_i_min), where F_i_min and F_i_max are preset ranges. Among them, the short-time energy range of human voice [E_min, E_max] = [-60dB, -6dB] is based on the fact that -60dB roughly corresponds to the background noise level or a very weak human voice; -6dB is close to the common upper limit margin for digital systems to avoid clipping, and this range covers the typical energy dynamics from almost no sound to full singing; for example, if the raw energy E_raw = -18dB is measured, then the normalized value E_norm = (-18-(-60)) / (-6-(-60)) = 42 / 54≈0.78. The fundamental frequency stability range [F0std_min, F0std_max] = [0Hz, 50Hz]. The values are determined as follows: 0Hz indicates completely stable pitch, while a standard deviation of 50Hz corresponds to very large pitch fluctuations, far exceeding the range of normal singing (including vibrato), and can be considered the boundary of extremely poor stability. For example, if the measured fundamental frequency standard deviation F0std_raw = 5Hz, then F0std_norm = (5-0) / (50-0) = 5 / 50 = 0.1. The accompaniment dynamic range parameter range [DR_min, DR_max] = [0dB, 30dB]. The values are determined as follows: 0dB indicates that the maximum and minimum amplitudes within the accompaniment frame are almost the same (dynamics is extremely compressed), while 30dB is a relatively large dynamic range that may occur in high-quality recordings. For example, if the measured DR_raw = 22dB, then DR_norm = (22-0) / (30-0) = 22 / 30 ≈ 0.73. The accompaniment frequency spectrum center range [SC_min, SC_max] = [200Hz, 8000Hz] is determined by the fact that 200Hz corresponds to accompaniment with energy concentrated in ultra-low frequencies (such as heavy bass), while 8000Hz corresponds to accompaniment with energy concentrated in high frequencies (such as sharp electronic effects). This range covers the main part of the audible frequency spectrum. For example, if SC_raw = 2800Hz is measured, then SC_norm = (2800-200) / (8000-200) = 2600 / 7800 ≈ 0.33.The environmental reverberation time range [RT60_min, RT60_max] = [0.2s, 3.0s] is based on the following: 0.2s represents an anechoic chamber or a room with excellent acoustic treatment, and 3.0s represents a very reverberant space (such as a large church or an untreated warehouse), covering common room types from extremely dry to extremely wet. For example, if RT60_raw = 0.8s is measured, then RT60_norm = (0.8-0.2) / (3.0-0.2) = 0.6 / 2.8 ≈ 0.21.
[0029] Furthermore, for example, if the preset vocal energy range is [-60dB, -6dB], then the above-mentioned -18dB normalized value is 0.78; if the preset fundamental frequency stability range is [0Hz, 50Hz], then 5Hz normalized value is 0.1; if the preset accompaniment dynamic range is [0dB, 30dB], then 22dB normalized value is approximately 0.73; if the preset spectral centroid is [200Hz, 8000Hz], then 2800Hz normalized value is approximately 0.36; if the preset ambient reverberation time is [0.2s, 3.0s], then 0.8s normalized value is approximately 0.21.
[0030] Subsequently, the normalized feature values are linearly weighted and summed using a pre-defined weighting model. The initial weight vector W_initial=[w1, w2, w3, w4, w5]=[0.30, 0.20, 0.15, 0.20, 0.15] is chosen based on the following: the vocal energy (w1=0.30) and fundamental frequency stability (w2=0.20) are given higher weights because the singer's state is central; the accompaniment features (w3, w4) and environmental features (w5) are used as moderating factors and have slightly lower weights. The sum of all weights is 1 to ensure consistent D-value scaling. This initial value can be determined through regression analysis using data from a limited number of typical scenarios. For example, continuing from the previous example, using the aforementioned normalized value to calculate D: D = 0.30*0.78 + 0.20*0.1 + 0.15*0.73 + 0.20*0.33 + 0.15*0.21 ≈ 0.234 + 0.02 + 0.1095 + 0.066 + 0.0315 ≈ 0.461. The value of D varies continuously between 0 and 1. An increase in its value comprehensively reflects the overall trend of high vocal intensity, stable pitch, large dynamic range of the accompaniment with a high frequency bias, and a more active environment, and vice versa. This single comprehensive sound field state factor D serves as the consensus for the entire system and will be used to coordinate the behavior of all subsequent processing units, avoiding the problem of conflicting decisions made by each unit based on different characteristics.
[0031] In S3, the unified decision factor D obtained in the previous step is transformed into specific control parameters for multiple audio processing units, and they are driven to process the original human voice in parallel.
[0032] Furthermore, three mutually independent mapping functions are predefined. The first mapping function is responsible for reverberation control: set the activation threshold D_r_on = 0.4. When D = 0.45, since D ≥ D_r_on, calculate the reverberation trigger threshold Th_r (e.g., Th_r = -15 * 0.45 + 16 ≈ 9.25dBFS) and the target reverberation time T60 (e.g., T60 = 1.0 * 0.45 + 0.3 = 0.75 seconds) according to a linear relationship.
[0033] The second mapping function is responsible for high-frequency equalization control: set the activation threshold D_eq_on = 0.5. Since the current D = 0.45 < D_eq_on, the high-frequency equalization gain G_hf is set to zero.
[0034] The third mapping function is responsible for compression control: set the activation threshold D_c_on = 0.6. Since D = 0.45 < D_c_on, the compressor threshold Th_c remains at the default value of -10dBFS.
[0035] Subsequently, the original human voice audio signal is sent to three independent working units simultaneously: the reverberation processing unit determines whether to add reverberation according to Th_r (if the human voice energy exceeds 9.25dBFS, add it, and the reverberation time is 0.75 seconds), the frequency-domain equalization processing unit applies a gain of G_hf = 0dB (i.e., no change), and the dynamic compression processing unit works at a threshold of -10dBFS. These three units run in parallel, and their outputs are linearly superimposed after time alignment to generate a preliminarily processed signal. This architecture of homologous decision-making and parallel processing ensures that the processing basis for each effect is the original human voice signal without being contaminated by other effects.
[0036] In S4, monitor the processing results and fine-tune the internal parameters to improve long-term adaptability and stability.
[0037] First, perform real-time quality analysis on the final human voice signal output by S3 and calculate two key metrics.
[0038] The first is the transient distortion degree (TD), and its calculation focuses on the transient parts such as the attack of the human voice: separately obtain the original human voice transient region and the transient response signals of the reverberation, equalization, and compression units after separately processing this region; then, calculate the absolute value of the difference between the original signal and the response signals of each unit, and divide them by the weighting coefficients α, β, γ associated with the current T60, G_hf, Th_c values respectively (these weighting coefficients are configured to increase as the corresponding processing parameters are enhanced, for example, the larger T60 is, the larger the α value is, which means a higher tolerance for transient changes caused by strong reverberation processing and a relatively lower contribution weight in the distortion evaluation), and then perform a comprehensive operation to obtain the TD value.
[0039] Among them, the association rules for the weighted coefficients of transient distortion are as follows. Base values and slopes, such as base_α and k_α, are used: for example, α is set to 1.0 + 0.625 * T60. When T60 = 0.8s, α = 1.0 + 0.625 * 0.8 = 1.5. Similarly, β = 1.0 + 0.1 * G_hf (G_hf in dB), when G_hf = 3dB, β = 1.3; γ = 1.0 + 0.2 * (Th_c_default - Th_c), when Th_c_default = -10dBFS and Th_c = -12dBFS, then γ = 1.0 + 0.2 * 2 = 1.4. These association rules are fitted experimentally to make the weighted coefficients increase with increasing processing intensity, achieving adaptive tolerance.
[0040] Spectral flatness variation modulation factor parameter. Sensitivity adjustment coefficient λ=0.5: used to control the degree to which accompaniment features are amplified by ΔSFM. This value is adjusted experimentally to achieve a balance between "sensitive to accompaniment" and "overreacting".
[0041] Feedback correction coefficients. Slope decay factor α = 0.9 (for k_eq): When TD exceeds the limit, multiply the current k_eq by 0.9. For example, if the original k_eq = 10, the corrected k_eq' = 9. This factor is less than 1 but close to 1, ensuring a slow and stable correction. Baseline offset Δ = +1dB (for Th_c0): When TD exceeds the limit, increase Th_c0 by 1dB. For example, if the original Th_c0 = -10dBFS, the corrected Th_c0' = -9dBFS. Weight decay factor β = 0.85 (for w3, w4): When ΔSFM exceeds the limit, multiply w3 and w4 by 0.85. For example, if the original w4 = 0.20, the decayed w4 is 0.17. W needs to be renormalized after correction.
[0042] The second is the spectral flatness change (ΔSFM), which calculates the difference in spectral flatness of the entire vocal frame before and after processing, and multiplies it by a modulation factor that includes the spectral center of gravity information (SC_norm) of the current accompaniment. For example, if the current accompaniment is rich in high frequencies (higher SC_norm), this factor will increase the calculated ΔSFM value accordingly, thus paying more attention to the changes in the vocal spectrum that may be caused by the accompaniment masking effect during evaluation.
[0043] Furthermore, preset safety thresholds are set, such as δ_td=0.12 and δ_sf=0.2. If the analysis finds that TD exceeds 0.12, it is determined that the current high-frequency equalization or compression settings may cause excessive transient distortion. Therefore, the slope k_eq of G_hf as D increases in the second mapping function is automatically reduced (e.g., adjusted from 10 to 9), and / or the baseline value of Th_c in the third mapping function is increased (e.g., adjusted from -10dBFS to -9dBFS), making the subsequent equalization enhancement more conservative and the compression start later.
[0044] If ΔSFM exceeds 0.2, it is determined that the current dependence on accompaniment features (especially the spectral centroid) may cause the processing strategy to deviate from the optimal. Therefore, the weight coefficients w3 and w4 corresponding to the accompaniment dynamic range and spectral centroid in the weight model are automatically reduced (for example, by multiplying them by a 0.85 attenuation factor and then renormalizing), thereby appropriately reducing the influence of accompaniment features in subsequent decisions.
[0045] This slow, output-based coefficient correction allows for continuous fine-tuning of its internal model during use, better adapting to specific room acoustics, user singing habits, and accompaniment types, achieving a degree of self-optimization.
[0046] In summary, the dynamic threshold-based sound field adjustment method firstly acquires and extracts key acoustic features from three dimensions—voice, accompaniment, and environment—and integrates them into a unified comprehensive sound field state factor D. This provides a comprehensive and consistent decision-making basis for all subsequent audio processing, solving the problems of fixed parameters being unable to adapt to changing scenarios and the one-sidedness of single-dimensional adaptive decision-making. It enables a comprehensive consideration of the singer's state, accompaniment characteristics, and room acoustics to make an overall judgment that better meets the actual sound field requirements. Secondly, based on the unified factor D, a set of independent mapping functions drives the reverberation, equalization, and compression processing units in parallel. Each unit uses the original vocal signal as input for parallel processing. This physical architecture eliminates the processing coupling distortion caused by the step-by-step transmission and distortion of signal features in existing cascaded processing, ensuring that each audio effect can be applied independently and accurately based on pure, original signal features. Finally, a closed-loop feedback mechanism based on output signal quality is introduced. By calculating and processing the transient distortion and spectral flatness changes that are dynamically correlated with intensity and accompaniment context, and by slowly and continuously fine-tuning the internal weighting model and mapping function coefficients, this mechanism can not only respond to changes in the external sound field in real time, but also perform self-checking and self-correction based on its own processing results. This allows for gradual optimization of its internal model during use, improving its long-term adaptability and stability to different room acoustics, user singing habits, and diverse accompaniments.
[0047] In one implementation, the first acoustic feature set in S1 includes short-time human voice energy and fundamental frequency stability indices; The second acoustic feature set includes the accompaniment dynamic range parameters and the accompaniment spectral centroid. The third acoustic feature set includes ambient reverberation time.
[0048] In this embodiment, it should be noted that S1 specifically defines the acoustic features extracted from the three signals. The sound field perception is the result of the physical interaction between the human voice, the accompaniment, and the environment. Therefore, it is necessary to select measurable physical quantities that can directly reflect the core state from these three dimensions.
[0049] For vocal audio signals, short-time vocal energy is selected as the primary feature because it directly quantifies the singer's vocal intensity and is the fundamental basis for determining whether the vocal needs enhancement, whether it can penetrate the accompaniment, and how much processing should be applied. For example, the gain, compression, and reverb strategies required for a weak vocal of -30dB are drastically different from those required for a loud vocal of -6dB. Simultaneously, fundamental frequency stability is selected as a supplementary feature because pitch fluctuations (such as vibrato or out-of-tune notes) are themselves a form of expression but can also affect the perceived sound of the processed audio. A stable fundamental frequency (e.g., a standard deviation of 2Hz) may indicate a flat performance, suitable for cleaner processing, while larger fluctuations (e.g., a standard deviation of 15Hz) may correspond to an emotionally charged performance, potentially requiring more relaxed reverb or dynamic control to accommodate these fluctuations and avoid a harshness in the processing.
[0050] Regarding the accompaniment audio signal, the dynamic range parameter is selected because it describes the fluctuation of the accompaniment itself. An accompaniment with a large dynamic range (such as classical music, which may have a dynamic range of up to 25dB) has drastic loudness variations, and its masking effect on the vocals varies in strength. Therefore, vocal processing (especially compressors) is required to have more sensitive tracking. On the other hand, an accompaniment with a small dynamic range (such as compressed pop music, which may only have a dynamic range of 10dB) will always provide a relatively uniform masking effect on the vocals. The spectral centroid of the accompaniment is selected to characterize the frequency distribution tendency of the accompaniment energy. An accompaniment with a high spectral centroid (such as rock music dominated by cymbals and electric guitars, where the centroid may be above 4000Hz) means that its high-frequency energy is rich, which will have a strong masking effect on the high-frequency parts of the vocals (such as sibilance and details). In this case, it may be necessary to intelligently boost the high frequencies of the vocals to maintain clarity, but excessive boosting should be avoided to prevent harshness.
[0051] Regarding the environmental acoustic reference signal, the ambient reverberation time (RT60) was selected as the core feature because it fundamentally defines the attenuation characteristics of sound in space and serves as the absolute reference benchmark for determining the amount of added artificial reverberation. In a dry room with an RT60 of 0.3 seconds, even if the singer wishes to achieve a sense of space, the added artificial reverberation time should be significantly shorter than that used in a reverberation room with an RT60 of 2.0 seconds to mask the excessively long reverberation of the original room. Combining these five features provides a relatively complete characterization of the objective state of the sound field at the current moment from five aspects: energy, pitch stability, dynamic contrast, spectral distribution, and spatial attenuation, laying a multi-dimensional data foundation for subsequent intelligent decision-making.
[0052] In one implementation, the parameter mapping function in S3 includes: The first mapping function is used to calculate the reverberation trigger threshold and the target reverberation time based on the comprehensive sound field state factor D; The second mapping function is used to calculate the high-frequency equalization gain based on the comprehensive sound field state factor D. The third mapping function is used to calculate the compressor threshold based on the comprehensive sound field state factor D.
[0053] In this embodiment, it should be noted that in S3, this embodiment defines a specific execution architecture for transforming the unified decision factor D into control parameters for each processing unit, namely a set of parameter mapping functions. The key to this design lies in dedicated functions for specific purposes and input from the same source.
[0054] The first mapping function is specifically responsible for generating two parameters related to the reverb effect: the reverb trigger threshold (Th_r) and the target reverb time (T60). Th_r determines the intensity of the voice required to trigger the reverb effect, preventing weak speech or gaps from being contaminated by unnecessary reverb tails; T60 sets the length of the reverb tail.
[0055] The second mapping function is specifically responsible for generating the high-frequency equalization gain (G_hf), which controls the amount of boost to the high-frequency range of the original vocals (e.g., a preset center frequency of 5kHz) to adjust the brightness and penetration of the vocals.
[0056] The third mapping function is specifically responsible for generating the compressor threshold (Th_c), which is used to set the level threshold at which the dynamic compressor starts working, in order to control the dynamic range of the human voice, prevent peak overload, and improve the average loudness.
[0057] All three functions take the same integrated sound field state factor D as input. For example, when the current value of D is calculated to be 0.7, this value will be input to these three functions simultaneously and in parallel.
[0058] The first function might calculate Th_r = -15*0.7 + 16 = 5.5 dBFS and T60 = 1.0*0.7 + 0.3 = 1.0 seconds; the second function calculates G_hf = 10*(0.7 - 0.5) = 2.0 dB; and the third function calculates Th_c = -8*(0.7 - 0.6) - 10 = -10.8 dBFS. This design ensures that the parameter changes of reverberation, equalization, and compression are logically consistent, all responding to the same comprehensive assessment of the sound field state (increased D value). It avoids the contradictions that may occur in existing schemes: for example, one algorithm may determine that strong compression (lowering the threshold) is needed based solely on vocal energy, while another algorithm may determine that weakening processing (reducing equalization gain) is needed based on the accompaniment spectrum, leading to conflicting effects.
[0059] In one implementation, the calculation logic of the first mapping function in S3 is configured as follows: when the integrated sound field state factor D is lower than the first activation threshold D_r_on, the reverberation trigger threshold is set to an invalid value that makes the reverberation processing unit inactive; when D is greater than or equal to D_r_on, the reverberation trigger threshold decreases linearly with the increase of D, and the target reverberation time increases linearly with the increase of D. The calculation logic of the second mapping function in S3 is configured as follows: when the integrated sound field state factor D is lower than the second activation threshold D_eq_on, the high-frequency equalization gain is set to zero; when D is greater than or equal to D_eq_on, the high-frequency equalization gain increases linearly with the increase of D. The calculation logic of the third mapping function in S3 is configured as follows: when the integrated sound field state factor D is lower than the third activation threshold D_c_on, the compressor threshold is set to a default value; when D is greater than or equal to D_c_on, the compressor threshold decreases linearly as D increases.
[0060] In this embodiment, it should be noted that S3 specifies the specific calculation logic inside each parameter mapping function. Its core is a piecewise linear mapping strategy, which provides clear and implementable rules for the conversion of D value to control parameter.
[0061] Taking the first mapping function as an example, it sets the activation threshold D_r_on for the reverberation effect. Preferably, the activation threshold D_r_on = 0.4. When D is below this value, it is considered that the current sound field has a low demand for artificial reverberation. The linear coefficients a_r = -15 and b_r = 16 (for Th_r = a_r * D + b_r), designed so that when D = 0.4, Th_r ≈ -15 * 0.4 + 16 = 10 dBFS (a relatively high threshold); when D = 1.0, Th_r ≈ 1 dBFS (a relatively low threshold). The linear coefficients a_t = 1.0 and b_t = 0.3 (for T60 = a_t * D + b_t), designed so that the reverberation time increases linearly from 0.7s when D = 0.4 to 1.3s when D = 1.0. For example, when D=0.65, Th_r=-15*0.65+16=6.25dBFS; T60=1.0*0.65+0.3=0.95s.
[0062] Furthermore, when the overall assessment indicates that the current sound field demand is low (D < 0.4), such as in a quiet environment, with soft singing and simple accompaniment, it is determined that no artificial reverb needs to be added. In this case, the reverb trigger threshold Th_r is set to an extremely high invalid value (e.g., +20dBFS), far exceeding the normal volume of human voice, thus ensuring that the reverb processing unit is not triggered and the human voice remains dry. Once the D value reaches or exceeds 0.4, indicating that the sound field state is becoming more active (perhaps the human voice is louder, the accompaniment is more complex, or the environment is more reverberant), reverb is activated. At this point, Th_r decreases linearly with increasing D (e.g., Th_r = -15*D + 16), meaning that even with higher demands (larger D), reverb can be triggered even with very low human voice energy, making the application of reverb effects more frequent and significant. Simultaneously, the target reverb time T60 increases linearly with D (e.g., T60 = 1.0*D + 0.3), meaning that with higher demands, the added reverb tail will be longer to create a stronger sense of space.
[0063] The second and third mapping functions follow similar logic but are designed for different processing units. The second function sets an activation threshold D_eq_on=0.5, slightly higher than the reverberation threshold, indicating that high-frequency compensation is only activated when a stronger sound field is required. The linear slope k_eq_initial=10 (unit: dB / (unit D)) determines the rate at which G_hf increases with D. The initial value is based on listening experiments, causing G_hf to produce a typical gain change of 0dB to +5dB in the range of D=0.5 to D=1.0; for example, when D=0.75, G_hf=10*(0.75-0.5)=2.5dB.
[0064] When D < 0.5, the high-frequency equalization gain G_hf remains at 0dB and no processing is performed. When D ≥ 0.5, G_hf begins to increase linearly with D, providing more high-frequency boost to enhance penetration when demand is high.
[0065] The third function sets the compression activation threshold D_c_on=0.6, the highest setting, indicating that more aggressive compression is only enabled when the sound field demand is very high. The default threshold Th_c_default=-10dBFS, a relatively lenient compression starting point suitable for general cases. The linear slope k_c=-8 (unit: dB / (unit D)) determines the rate at which Th_c decreases with D. It is designed so that when D increases from 0.6 to 1.0, Th_c decreases from -10dBFS to approximately -13.2dBFS; for example, when D=0.8, Th_c=-8*(0.8-0.6)-10=-1.6-10=-11.6dBFS.
[0066] When D < 0.6, the compressor threshold Th_c maintains a relatively lenient default value (e.g., -10dBFS), resulting in a lighter compression. When D ≥ 0.6, Th_c decreases linearly with D (e.g., Th_c = -8*(D-0.6)-10), which means that when demand is high, the compressor will intervene earlier (at a lower threshold), providing more aggressive dynamic control and helping to improve the stability and loudness of vocals.
[0067] This piecewise linear mapping design makes the behavior clear and predictable, and the threshold values of the three functions (0.4, 0.5, 0.6) constitute a progressive response sequence: reverberation is considered first, followed by equalization, and finally more aggressive compression. This is consistent with the logical hierarchy in sound processing, which usually involves shaping space and timbre first, and then finely controlling the dynamics.
[0068] In one implementation, obtaining quality evaluation metrics, including transient distortion, from the parallel-processed and synthesized human voice output signal in step S4 includes: The original transient region of the human voice audio signal is obtained, and the transient response signals of the reverberation processing unit, frequency domain equalization processing unit and dynamic compression processing unit in the original transient region are obtained respectively. The transient distortion is calculated based on the original transient region, the transient response signals of each processing unit, and the weighting coefficients associated with the target reverberation time, high-frequency equalization gain, and compressor threshold, respectively. The calculation logic for transient distortion is as follows: For each sampling point in the original transient region, calculate the absolute value of the first difference between the transient response signal of the reverberation processing unit and the original transient signal, the absolute value of the second difference between the transient response signal of the frequency domain equalization processing unit and the original transient signal, and the absolute value of the third difference between the transient response signal of the dynamic compression processing unit and the original transient signal; divide the absolute value of the first difference by the first weighting coefficient, divide the absolute value of the second difference by the second weighting coefficient, divide the absolute value of the third difference by the third weighting coefficient, and add the three results to obtain the transient distortion component of that sampling point; after squaring, summing, and averaging the transient distortion components of all sampling points, calculate the square root as the final transient distortion. Among them, the first weighting coefficient, the second weighting coefficient, and the third weighting coefficient are respectively associated with the current target reverberation time, the high-frequency equalization gain, and the compressor threshold, and are all positive numbers (so that the contribution weight of each processing unit to the transient distortion is dynamically adjusted according to its processing intensity).
[0069] Furthermore, the transient distortion TD is calculated according to the following formula: ; Where x(n) represents the normalized amplitude of the human voice audio signal at the nth sampling point in the original transient region; y_rev(n), y_eq(n), and y_comp(n) represent the normalized amplitude of the nth sampling point in the transient response signal output by the reverberation processing unit, the frequency domain equalization processing unit, and the dynamic compression processing unit, respectively. N represents the total number of sampling points in the original transient region; α, β, and γ are weighting coefficients associated with the current target reverberation time T60, the high-frequency equalization gain G_hf, and the compressor threshold Th_c, and satisfy that α, β, and γ are all greater than zero.
[0070] In this embodiment, it should be noted that in S4, the specific calculation method of the quality evaluation index transient distortion (TD) is defined in detail. Its core design lies in separately evaluating the impact of each independent processing unit on the transient characteristics of the original signal under the parallel processing architecture, and integrating it with a weighting coefficient that is adaptive to the current processing intensity, thereby achieving accurate and reasonable quantification of the transient distortion caused by processing.
[0071] First, it is necessary to identify and acquire the original transient region. This is usually achieved by detecting the fast rising edge of the amplitude in the original human voice audio signal. For example, when the amplitude difference between two consecutive sampling points exceeds a certain threshold (such as 5% of the full scale), it is determined to be the transient start point, and the signal within a certain time window (such as a total of 10 milliseconds) before and after the point is extracted as the analysis object.
[0072] Subsequently, instead of directly analyzing the final synthesized output signal, the transient response signals generated by the reverberation processing unit, frequency domain equalization processing unit, and dynamic compression processing unit, respectively, after processing the original transient region signal individually, are obtained. This means that the same original transient signal needs to be sent to three processing units (the parameters of which are set in real time by step S3) and their independent outputs recorded.
[0073] For example, suppose the parameters set in the current S3 step are: target reverberation time T60 = 0.8 seconds, high-frequency equalization gain G_hf = +3dB, and compressor threshold Th_c = -12dBFS. The original transient region is a signal segment with a length of N = 480 sampling points (corresponding to 10 milliseconds at a 48kHz sampling rate), and its normalized amplitude sequence is denoted as x(n). After this signal segment is fed into the three processing units with the parameters as above, three output sequences are obtained: the transient response y_rev(n) of the reverberation unit (which may include early reflections of reverberation), the transient response y_eq(n) of the equalization unit (high frequencies are boosted), and the transient response y_comp(n) of the compression unit (the portion greater than -12dBFS is attenuated). The purpose of this step is to observe what changes each processing unit makes to the transient signal individually, laying the foundation for subsequent distortion decomposition and evaluation.
[0074] After obtaining the above signals, the core transient distortion calculation is performed. The calculation formula is: TD=sqrt((1 / N)*Σ[(|y_rev(n)]) x(n)| / α+|y_eq(n) x(n)| / β+|y_comp(n) x(n)| / γ)]^2).
[0075] The calculation process is as follows: For each sampling point n within the transient region, calculate the absolute value of the difference between the original signal x(n) and the response signals y_rev(n), y_eq(n), and y_comp(n) of the three processing units. These three absolute differences represent the instantaneous changes caused by reverberation, equalization, and compression processing at that sampling point.
[0076] Furthermore, dynamic weighting coefficients α, β, and γ are introduced. These coefficients are not fixed values but are associated with the control parameters generated in real time in step S3: target reverberation time T60, high-frequency equalization gain G_hf, and compressor threshold Th_c. When the parameters of a processing unit indicate a high processing intensity, the resulting transient signal changes are somewhat expected and should therefore be partially tolerated when evaluating overall distortion; that is, a larger coefficient is assigned to reduce the weight of that path difference in the sum. For example, α can be defined as base_α + k_α * T60, where base_α and k_α are positive constants. Assuming T60 = 0.8 seconds, α = 1.5 is calculated. Similarly, β = base_β + k_β * G_hf is defined, and when G_hf = +3dB, β = 1.3 is calculated. Define γ = base_γ + k_γ * (Th_c_default - Th_c), because the lower Th_c is, the more aggressive the compression. Assuming Th_c_default = -10dBFS and the current Th_c = -12dBFS, then we can calculate γ = 1.4.
[0077] Then, the three absolute differences of each sampling point are divided by the corresponding α, β, and γ, respectively, and then summed to obtain the instantaneous distortion component of that sampling point. The instantaneous distortion components of all N sampling points are squared, summed, divided by N, and then the square root is taken to finally obtain the transient distortion degree TD. Taking a specific sampling point as an example, assuming |x(n)|=0.5, |y_rev(n)|=0.52, |y_eq(n)|=0.58, |y_comp(n)|=0.48, the unweighted sum of differences is |0.02|+|0.08|+|0.02|=0.12. After weighting, the contribution of this point becomes 0.02 / 1.5+0.08 / 1.3+0.02 / 1.4≈0.0133+0.0615+0.0143≈0.0891.
[0078] As can be seen, due to the large equalization gain G_hf, its change (0.08) is partially compensated by the coefficient β=1.3, and its proportion in the total decreases from 66.7% to approximately 69%, while the proportions of reverberation and compression paths relatively increase. By performing this type of calculation and statistics on all points, the final TD value (e.g., 0.15) can more reasonably reflect the degree of transient distortion caused by parallel processing that exceeds the expected tolerance range under the current specific processing intensity setting, providing a more accurate basis for subsequent feedback correction.
[0079] In one implementation, the quality evaluation index, including the change in spectral flatness, obtained from the parallel-processed and synthesized human voice output signal in step S4 includes: Calculate the first spectral flatness of the human voice audio signal within a preset time window, and the second spectral flatness of the human voice output signal within the same time window; The change in spectral flatness is calculated based on the first spectral flatness, the second spectral flatness, the centroid of the current normalized accompaniment spectrum, and the dynamic range parameter of the current normalized accompaniment. The calculation logic for the change in spectral flatness is as follows: calculate the absolute value of the difference between the second spectral flatness and the first spectral flatness, and multiply the absolute value by a modulation factor. The modulation factor is configured such that its value is equal to 1 plus a preset sensitivity adjustment coefficient multiplied by the ratio of the current normalized accompaniment spectrum centroid to the sum of the current normalized accompaniment spectrum centroid and the accompaniment dynamic range parameter (so that the change in spectral flatness is amplified when the accompaniment has rich high-frequency components, so as to improve the sensitivity to changes in the spectral uniformity of the sound field).
[0080] Furthermore, the spectral flatness variation ΔSFM is calculated according to the following formula: ; Where SF{M_{in}} is the first spectral flatness, and SF{M_{out}} is the second spectral flatness; S{C_{norm}} and D{R_{norm}}} are the normalized accompaniment spectrum centroid and accompaniment dynamic range parameters at the current time, respectively; λ is the preset sensitivity adjustment coefficient, and λ≥0.
[0081] In this embodiment, it should be noted that in S4, a method for calculating the quality evaluation index of spectral flatness change (ΔSFM) is specifically defined. Its core innovation lies in combining objective signal spectrum change measurement with contextual awareness of the current accompaniment acoustic characteristics, enabling a more intelligent judgment on whether the changes in vocal timbre caused by processing are reasonable.
[0082] First, the spectral flatness (SFM) of the human voice audio before and after processing needs to be calculated. Spectral flatness is an indicator of how uniform the power spectrum of a signal is in frequency distribution. It is defined as the logarithm of the ratio of the geometric mean to the arithmetic mean of the power spectrum, and its value is between 0 and 1 (or converted to dB). A signal with rich harmonics and uniform energy distribution (such as white noise) has an SFM close to 1 (0 dB), while signals with energy concentrated in a specific frequency band (such as human voice or musical notes) have a lower SFM value.
[0083] In the specific calculation, the same preset time window (e.g., a frame of 20 milliseconds) is taken for both the human voice output signal from step S3 and the original human voice audio signal. A short-time Fourier transform is performed on each frame to obtain the power spectrum P_in(k) and P_out(k). The first spectral flatness SFM_in and the second spectral flatness SFM_out are calculated using the formula: SFM=exp((1 / K)*Σln(P(k))) / ((1 / K)*ΣP(k)), where K is the number of frequency points, usually expressed in dB after taking the logarithm. For example, in a scenario, the SFM_in of the original human voice frame is calculated to be -5.5dB, while the SFM_out of the processed output frame is -7.2dB. This indicates that the processed human voice spectrum becomes relatively less flat (lower value), possibly because high-frequency equalization boosting or reverberation / compression alters the spectral distribution. Simply calculating the difference |ΔSFM|=|-7.2-(-5.5)|=1.7dB can quantify the magnitude of the change, but it cannot determine whether such a change is acceptable in the current sound field environment.
[0084] Furthermore, to endow the aforementioned objective measurements with contextualized judgment capabilities, this invention introduces a modulation factor dynamically correlated with the accompaniment characteristics. This modulation factor is defined as: Modulation Factor = 1 + λ * (SC_norm / (SC_norm + DR_norm)). Here, SC_norm and DR_norm are the normalized accompaniment spectral centroid and dynamic range parameters from step S2 at the current time. λ is a preset sensitivity adjustment coefficient, for example, set to 0.5. The technical logic of this design is that the spectral centroid (SC_norm) of the accompaniment reflects the frequency distribution tendency of the accompaniment energy.
[0085] When the SC_norm value is high, it means that the accompaniment contains more high-frequency components (such as cymbals and the high range of strings). These high-frequency energies will have a strong masking effect on the mid-to-high frequency parts of the human voice (such as sibilance and details).
[0086] In this situation, in order to maintain the clarity and intelligibility of the human voice in the mixture, it may be a necessary and reasonable processing strategy to moderately boost the high frequencies of the human voice through equalization. This will naturally reduce the spectral flatness of the human voice signal (making its spectrum more uneven).
[0087] Therefore, when a high SC_norm is detected, amplifying the calculated value of ΔSFM by modulating the modulation factor essentially increases the sensitivity to the spectral variation indicator. This prompts the feedback mechanism to intervene earlier and more proactively, preventing the continuous making of decisions that could lead to timbre imbalance due to over-reliance on accompaniment spectral characteristics.
[0088] For example: Suppose the current accompaniment is an electronic dance track. S2 calculations yield SC_norm = 0.8 (rich high frequencies) and DR_norm = 0.3 (moderate dynamics). λ is set to 0.5. The modulation factor = 1 + 0.5 * (0.8 / (0.8 + 0.3)) ≈ 1 + 0.5 * 0.727 ≈ 1.3635. If |SFM_out - SFM_in| = 1.7dB, then the final ΔSFM = 1.7dB * 1.3635 ≈ 2.32dB. The preset second safety threshold δ_sf is 0.2dB. Clearly, 2.32dB far exceeds the threshold. At this point, we don't simply assume that excessive spectral changes are bad. Instead, considering the accompaniment context, this threshold-exceeding signal is interpreted as follows: In the current high-frequency-rich accompaniment environment, the processing strategy guided by the current weighted model (which may assign a high weight to SC_norm) has caused a sufficiently significant change in vocal timbre, requiring review and adjustment of the decision-making criteria.
[0089] Therefore, a correction to the weighting model is triggered, namely, reducing the weight coefficients w3 (dynamic range) and w4 (spectral centroid) in the weight vector W corresponding to the second acoustic feature set (accompaniment features), for example, by multiplying each by an attenuation factor of 0.9.
[0090] After correction, the influence of accompaniment features on the subsequent calculation of the integrated sound field state factor D is reduced, and the decision-making will rely more on the characteristics of the human voice itself and environmental features. This helps to find a more balanced spectral processing strategy in subsequent processing. This mechanism tightly couples quality assessment with the real-time operating state and input environment, achieving more context-aware adaptive optimization.
[0091] In one implementation, modifying the parameter mapping function in S4 includes: when the transient distortion in the quality evaluation index exceeds the first safety threshold, reducing the slope of the high-frequency equalization gain as a function of D in the second mapping function, and / or increasing the reference value of the compressor threshold in the third mapping function. The feedback correction of the weighting model in S4 includes: when the change in spectral flatness in the quality evaluation index exceeds the second safety threshold, reducing the weight coefficients in the weighting model corresponding to the features in the second acoustic feature set.
[0092] In this embodiment, it should be noted that in S4, specific logic is defined for correcting the parameter mapping function when the transient distortion (TD) exceeds a preset first safety threshold (δ_td). The first safety threshold is a transient distortion safety threshold δ_td = 0.12, determined through a limited number of listening tests. When the calculated TD value exceeds this threshold, most listeners can perceive significant transient degradation.
[0093] The goal of this correction mechanism is to address the transient signal degradation that may be caused by excessive high-frequency equalization or dynamic compression processing. The correction does not directly and abruptly disable an effect, but rather fine-tunes the rules for generating control parameters to make subsequent behavior more conservative.
[0094] The technical reasoning process is as follows: The abnormal increase in transient distortion (TD) indicates that when processing rapidly changing parts of the human voice, such as the attack and plosive sounds, the current processing (especially equalization boosting and compression) has led to perceptible, tolerable nonlinear distortion or loss of detail.
[0095] For example, in a musical phrase where the singing intensity suddenly increases, a higher overall sound field state factor D (assuming D=0.75) generates stronger processing parameters: the high-frequency equalization gain G_hf might reach +4dB, and the compressor threshold Th_c might be as low as -14dBFS. If the subsequently calculated TD value reaches 0.18, exceeding δ_td=0.15, then the processing intensity-state factor D relationship currently defined by the second mapping function (equalization) and the third mapping function (compression) might be too aggressive and needs to be moderately weakened.
[0096] The correction involves two aspects: First, reducing the slope k_eq of the high-frequency equalization gain G_hf as a function of D in the second mapping function. Assume the original mapping function is G_hf = k_eq * (D - D_eq_on), where the initial value of k_eq is 10 dB / (unit D). When TD exceeds the limit, k_eq is multiplied by an attenuation factor less than 1, such as 0.9, resulting in a new slope k_eq' of 9 dB / (unit D). This means that for the same future D value (e.g., 0.75), the calculated G_hf will decrease from the original 2.5 dB (10 * (0.75 - 0.5)) to 2.25 dB (9 * (0.75 - 0.5)), making the equalization improvement slightly more moderate.
[0097] Second, increase the baseline value Th_c0 of the compressor threshold Th_c in the third mapping function. Assume the original mapping function is Th_c = k_c * (D - D_c_on) + Th_c0 when D ≥ D_c_on, where Th_c0 has an initial value of -10 dBFS and k_c is a negative slope (e.g., -8 dB / (unit D)). During the correction, add a positive offset Δ to Th_c0, for example, +1 dB, making the new baseline value Th_c0' -9 dBFS. Thus, for the same future D value (0.75), the calculated Th_c will change from -11.2 dBFS (-8 * (0.75 - 0.6) - 10) to -10.2 dBFS (-8 * (0.75 - 0.6) - 9), raising the compressor startup threshold by 1 dB and making compression processing less aggressive. This coordinated correction of reducing the slope and raising the baseline allows for a more restrained equalization and compression strategy to be automatically adopted when encountering similar high sound field requirements (high D value) in the future. This helps to reduce the risk of transient distortion while retaining the basic ability to make adaptive adjustments based on the D value.
[0098] Furthermore, this implementation defines the specific logic for correcting the weighting model when the change in spectral flatness (ΔSFM) exceeds a preset second safety threshold (δ_sf). The second safety threshold is a spectral flatness change safety threshold δ_sf = 0.20 dB, derived from the assumption that the change in SFM after a limited number of human voices undergoes reasonable processing and falls within this range; exceeding this threshold may indicate that the spectral balance has been excessively disrupted.
[0099] This correction mechanism aims to address the problem of timbre processing decisions that may deviate from the optimal solution due to over-reliance on accompaniment acoustic features. The core of the correction is to adjust the weights of different features in the fusion calculation, thereby changing the focus of the decision-making process.
[0100] The technical inference process is based on the interpretation of ΔSFM values: an abnormal increase in ΔSFM, especially in scenarios with rich high-frequency components in the accompaniment (high SC_norm), indicates that current processing strategies guided by features such as the accompaniment's spectral centroid (e.g., significantly boosting high frequencies in vocals to combat high-frequency masking) have caused significant changes in the spectral distribution (spectral flatness) of the vocals themselves. This change may have approached or exceeded the boundary of maintaining a natural listening experience.
[0101] For example, in the current weighting model W=[w1, w2, w3, w4, w5]=[0.30, 0.20, 0.15, 0.20, 0.15], the accompaniment's spectral centroid SC_norm is very high (e.g., 0.85), and its corresponding weight w4=0.20, making the contribution of SC_norm to the overall sound field state factor D significant, thus driving the generation of a large high-frequency equalization gain. Assume that the calculated ΔSFM value is 0.25dB, exceeding δ_sf=0.20dB. This does not directly negate the strategy of boosting high frequencies to penetrate the accompaniment, but rather infers that the current weighting allocation may make it overly sensitive to the accompaniment's spectral characteristics, leading to an over-amplification of the influence of these characteristics in the processing decision, potentially causing unnecessary tonal imbalance.
[0102] Therefore, the correction involves reducing the weight coefficients in the weight model corresponding to features within the second acoustic feature set (i.e., accompaniment features). Specifically, this applies to w3 corresponding to the accompaniment dynamic range parameter and w4 corresponding to the accompaniment spectral centroid. These two coefficients are multiplied by a decay factor β less than 1, for example, β = 0.85. Thus, w3 changes from 0.15 to 0.15 * 0.85 ≈ 0.1275, and w4 changes from 0.20 to 0.20 * 0.85 = 0.17. To maintain the sum of all weight coefficients at 1, the entire weight vector needs to be renormalized. Assuming the other weights w1, w2, and w5 remain unchanged for now, the new unnormalized vector is [0.30, 0.20, 0.1275, 0.17, 0.15], with a sum of 0.9475. After normalization, the new weight vector W' is approximately [0.316, 0.211, 0.135, 0.179, 0.158].
[0103] As can be seen, the absolute values and relative proportions of w3 and w4 have decreased, while the weights of vocal features (w1, w2) and environmental features (w5) have relatively increased. This means that in subsequent fusion calculations, the influence of accompaniment features on the final decision factor D is weakened, and processing decisions will be made more based on the singer's own state and the room's acoustic conditions. This slow, effect-based weight adjustment allows for adaptive learning and adjustment of attention allocation during use, reducing reliance on input features that may cause problems, thus evolving towards a more robust and personalized sound field adjustment strategy.
[0104] A sound field adjustment system based on dynamic thresholds is also provided, used to implement the sound field adjustment method based on dynamic thresholds in any of the above embodiments. The system includes: The multi-source signal acquisition module includes analog-to-digital converters connected to a human voice microphone interface, an accompaniment audio input interface, and an environmental microphone, respectively, for real-time acquisition of human voice audio signals, accompaniment audio signals, and environmental acoustic reference signals; The feature extraction module is used to extract the first acoustic feature set, the second acoustic feature set, and the third acoustic feature set from the three signals, respectively. The fusion calculation module is used to perform weighted summation on the normalized eigenvalues to generate the comprehensive sound field state factor D; The parameter mapping module is used to generate the reverberation trigger threshold Th_r, target reverberation time T60, high-frequency equalization gain G_hf, and compressor threshold Th_c based on D. The parallel processing module includes a reverberation processing unit, a frequency domain equalization processing unit, and a dynamic compression processing unit. All three receive the original human voice audio signal as input and are controlled by Th_r / T60, G_hf, and Th_c, respectively. The feedback analysis module is used to calculate the transient distortion and spectral flatness changes of the output signal of the parallel processing module, and transmit the results to the fusion calculation module and the parameter mapping module to update the weight coefficients and mapping parameters.
[0105] In this embodiment, it should be noted that the specific manner in which the above-mentioned sound field adjustment system, electronic device and non-transitory computer-readable storage medium based on dynamic threshold are performed has been described in detail in the embodiments of the sound field adjustment method based on dynamic threshold, and will not be elaborated here.
[0106] Figure 4 This is a block diagram of an electronic device illustrating a sound field adjustment method based on a dynamic threshold, according to an exemplary embodiment. Figure 4 As shown, the electronic device 700 may include: a processor 701 and a memory 702. The electronic device 700 may also include one or more of a multimedia component 703, an I / O interface 704 (input / output interface), and a communication component 705.
[0107] The processor 701 controls the overall operation of the electronic device 700 to complete all or part of the steps in the dynamic threshold-based sound field adjustment method described above. The memory 702 stores various types of data to support the operation of the electronic device 700. This data may include, for example, instructions for any application or method operating on the electronic device 700, and application-related data such as contact data, sent and received messages, pictures, audio, video, etc. The memory 702 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk. The multimedia component 703 may include a screen and audio components. The screen may be, for example, a touchscreen, and the audio component is used to output and / or input audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signals may be further stored in memory 702 or transmitted via communication component 705. The audio component also includes at least one speaker for outputting audio signals. I / O interface 704 provides an interface between processor 701 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual or physical buttons. Communication component 705 is used for wired or wireless communication between the electronic device 700 and other devices. Wireless communication, such as Wi-Fi, Bluetooth, Near Field Communication (NFC), 2G, 3G, 4G, NB-IoT, eMTC, or other 5G technologies, or a combination thereof, is not limited here. Therefore, the corresponding communication component 705 may include: a Wi-Fi module, a Bluetooth module, an NFC module, etc.
[0108] In one exemplary embodiment, the electronic device 700 may be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components to perform the above-described dynamic threshold-based sound field adjustment method.
[0109] In another exemplary embodiment, a computer-readable storage medium including program instructions is also provided, which, when executed by a processor, implement the steps of the dynamic threshold-based sound field adjustment method described above. For example, the computer-readable storage medium may be the memory 702 including the program instructions described above, which may be executed by the processor 701 of the electronic device 700 to complete the dynamic threshold-based sound field adjustment method described above.
[0110] In another exemplary embodiment, a computer program product is also provided, the computer program product comprising a computer program executable by a programmable device, the computer program having a code portion for performing the above-described dynamic threshold-based sound field adjustment method when executed by the programmable device.
[0111] The preferred embodiments of this disclosure have been described in detail above with reference to the accompanying drawings. However, this disclosure is not limited to the specific details of the above embodiments. Within the scope of the technical concept of this disclosure, various simple modifications can be made to the technical solutions of this disclosure, and these simple modifications all fall within the protection scope of this disclosure.
[0112] It should also be noted that the various specific technical features described in the above embodiments can be combined in any suitable manner without contradiction. To avoid unnecessary repetition, this disclosure will not describe the various possible combinations separately.
[0113] Furthermore, various different embodiments of this disclosure can be combined in any way, as long as they do not violate the spirit of this disclosure, they should also be regarded as the content disclosed in this disclosure.
[0114] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention, and they should all be covered within the scope of the claims and specification of the present invention.
Claims
1. A sound field adjustment method based on dynamic threshold, characterized in that, include: Real-time acquisition of human voice audio signal, accompaniment audio signal and environmental acoustic reference signal; extraction of a first acoustic feature set from the human voice audio signal; extraction of a second acoustic feature set from the accompaniment audio signal; extraction of a third acoustic feature set from the environmental acoustic reference signal. The feature values in the first acoustic feature set, the second acoustic feature set, and the third acoustic feature set are normalized, and the normalized feature values are linearly weighted and fused through a configurable weight model to generate a one-dimensional comprehensive sound field state factor D. The integrated sound field state factor D is input to a set of predefined and independent parameter mapping functions to generate reverberation trigger threshold, target reverberation time, high-frequency equalization gain, and compressor threshold in parallel. Specifically, a reverberation processing unit, a frequency domain equalization processing unit, and a dynamic compression processing unit, all using the human voice audio signal as input, are acquired and subjected to parallel audio processing of the human voice audio signal based on the reverberation trigger threshold, target reverberation time, high-frequency equalization gain, and compressor threshold, respectively. Based on the human voice output signal after parallel processing and synthesis, a quality evaluation index including transient distortion and spectral flatness variation is obtained, and the weight coefficients in the weight model and / or the mapping coefficients of the parameter mapping function are corrected according to the quality evaluation index.
2. The sound field adjustment method based on dynamic threshold according to claim 1, characterized in that, The first acoustic feature set includes short-time energy and fundamental frequency stability indices of human voice; The second acoustic feature set includes the accompaniment dynamic range parameters and the accompaniment spectral centroid; The third acoustic feature set includes the ambient reverberation time.
3. The sound field adjustment method based on dynamic threshold according to claim 1, characterized in that, The parameter mapping function includes: The first mapping function is used to calculate the reverberation trigger threshold and the target reverberation time based on the integrated sound field state factor D. The second mapping function is used to calculate the high-frequency equalization gain based on the integrated sound field state factor D. The third mapping function is used to calculate the compressor threshold based on the integrated sound field state factor D.
4. The sound field adjustment method based on dynamic threshold according to claim 3, characterized in that, The calculation logic of the first mapping function is configured as follows: when the integrated sound field state factor D is lower than the first activation threshold D_r_on, the reverberation trigger threshold is set to an invalid value that makes the reverberation processing unit inactive; when D is greater than or equal to D_r_on, the reverberation trigger threshold decreases linearly with the increase of D, and the target reverberation time increases linearly with the increase of D. The calculation logic of the second mapping function is configured as follows: when the integrated sound field state factor D is lower than the second activation threshold D_eq_on, the high-frequency equalization gain is set to zero; when D is greater than or equal to D_eq_on, the high-frequency equalization gain increases linearly with the increase of D. The calculation logic of the third mapping function is configured as follows: when the integrated sound field state factor D is lower than the third activation threshold D_c_on, the compressor threshold is set to a default value; when D is greater than or equal to D_c_on, the compressor threshold decreases linearly as D increases.
5. The sound field adjustment method based on dynamic threshold according to claim 1, characterized in that, The quality evaluation indicators, including transient distortion, obtained from the parallel-processed and synthesized human voice output signal include: The original transient region of the human voice audio signal is obtained, and the transient response signals of the reverberation processing unit, the frequency domain equalization processing unit and the dynamic compression processing unit in the original transient region are obtained respectively. The transient distortion is calculated based on the original transient region, the transient response signals of each processing unit, and the weighting coefficients associated with the target reverberation time, high-frequency equalization gain, and compressor threshold, respectively. The calculation logic for the transient distortion is as follows: For each sampling point in the original transient region, calculate the absolute value of the first difference between the transient response signal of the reverberation processing unit and the original transient signal, the absolute value of the second difference between the transient response signal of the frequency domain equalization processing unit and the original transient signal, and the absolute value of the third difference between the transient response signal of the dynamic compression processing unit and the original transient signal; divide the absolute value of the first difference by a first weighting coefficient, divide the absolute value of the second difference by a second weighting coefficient, divide the absolute value of the third difference by a third weighting coefficient, and add the three results to obtain the transient distortion component of that sampling point; after squaring, summing, and averaging the transient distortion components of all sampling points, calculate the square root as the final transient distortion. The first weighting coefficient, the second weighting coefficient, and the third weighting coefficient are respectively associated with the current target reverberation time, the high-frequency equalization gain, and the compressor threshold, and all of them are positive numbers.
6. The sound field adjustment method based on dynamic threshold according to claim 1, characterized in that, The quality evaluation index, including the spectral flatness variation, obtained from the parallel-processed and synthesized human voice output signal includes: Calculate the first spectral flatness of the human voice audio signal within a preset time window, and the second spectral flatness of the human voice output signal within the same time window; The change in spectral flatness is calculated based on the first spectral flatness, the second spectral flatness, the centroid of the current normalized accompaniment spectrum, and the dynamic range parameter of the current normalized accompaniment. The calculation logic for the change in spectral flatness is as follows: calculate the absolute value of the difference between the second spectral flatness and the first spectral flatness, and multiply the absolute value by a modulation factor; The modulation factor is configured such that its value is equal to 1 plus a preset sensitivity adjustment coefficient multiplied by the ratio of the current normalized accompaniment spectrum centroid to the sum of the current normalized accompaniment spectrum centroid and the accompaniment dynamic range parameter.
7. The sound field adjustment method based on dynamic threshold according to claim 1, characterized in that, The modification of the parameter mapping function includes: when the transient distortion in the quality evaluation index exceeds the first safety threshold, reducing the slope of the high-frequency equalization gain as a function of D in the second mapping function, and / or increasing the reference value of the compressor threshold in the third mapping function; Feedback correction of the weighting model includes: when the change in spectral flatness in the quality evaluation index exceeds the second safety threshold, reducing the weight coefficients in the weighting model corresponding to the features in the second acoustic feature set.
8. A sound field adjustment system based on dynamic threshold, characterized in that, The system is used to implement the sound field adjustment method based on dynamic threshold according to any one of claims 1 to 7, the system comprising: The multi-source signal acquisition module includes analog-to-digital converters connected to a human voice microphone interface, an accompaniment audio input interface, and an environmental microphone, respectively, for real-time acquisition of human voice audio signals, accompaniment audio signals, and environmental acoustic reference signals; The feature extraction module is used to extract the first acoustic feature set, the second acoustic feature set, and the third acoustic feature set from the three signals, respectively. The fusion calculation module is used to perform weighted summation on the normalized eigenvalues to generate the comprehensive sound field state factor D; The parameter mapping module is used to generate the reverberation trigger threshold Th_r, target reverberation time T60, high-frequency equalization gain G_hf, and compressor threshold Th_c based on D. The parallel processing module includes a reverberation processing unit, a frequency domain equalization processing unit, and a dynamic compression processing unit. All three receive the original human voice audio signal as input and are controlled by Th_r / T60, G_hf, and Th_c, respectively. The feedback analysis module is used to calculate the transient distortion and spectral flatness changes of the output signal of the parallel processing module, and transmit the results to the fusion calculation module and the parameter mapping module to update the weight coefficients and mapping parameters.
9. An electronic device, characterized in that, include: A memory on which computer programs are stored; A processor for executing the computer program in the memory for implementing the sound field adjustment method based on dynamic threshold as described in any one of claims 1 to 7.
10. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the program implements the sound field adjustment method based on dynamic thresholds as described in any one of claims 1 to 7.