Audio processing method and apparatus
By acquiring audio processing requests, matching template parameters and features in the audio template library, and performing parameter compensation based on feature differences, the problem of adapting audio effects to individual differences and diverse scenarios is solved, achieving high-quality, low-threshold audio processing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGZHOU SEASUN ENTERTAINMENT NETWORK TECHCO
- Filing Date
- 2026-04-02
- Publication Date
- 2026-06-19
Smart Images

Figure CN122245341A_ABST
Abstract
Description
Technical Field
[0001] This specification relates to the field of software technology for creating digital cultural products, and in particular to an audio processing method. Background Technology
[0002] With the widespread application of virtual products, voice effects processors, as core tools for optimizing voice timbre and enhancing auditory performance, have seen their application scenarios expand significantly from traditional broadcast and film post-production to diverse fields such as live interactive broadcasting, real-time game communication, intelligent voice assistants, and virtual human-driven applications. Correspondingly, users' demands for real-time performance, intelligence, and personalized timbre customization in voice processing are growing daily. This urgently requires audio processing technology to adapt to massive amounts of heterogeneous voice data and maintain excellent listening quality in complex and changing acoustic environments, becoming a key link in driving innovation in the audio ecosystem.
[0003] However, existing audio effects technology generally adopts a matching mode based on fixed preset values, that is, several sets of equalization, compression and reverb parameters optimized for regular human voices are pre-set for users to choose from. This static processing mechanism is difficult to cope with the significant individual differences and scene diversity in actual applications, lacks targeted feature adaptation capabilities, and can hardly achieve the ideal effect through simple fine-tuning, which seriously limits the creative space and application boundaries of special sound effects.
[0004] Therefore, there is an urgent need to develop a speech processing method that can automatically and specifically process audio, in order to break through the limitations of existing preset modes and achieve high-quality, low-threshold, and widely adaptable intelligent audio processing. Summary of the Invention
[0005] In view of the above, embodiments of this specification provide an audio processing method. One or more embodiments of this specification also relate to an audio processing apparatus, a computing device, a computer-readable storage medium, and a computer program product, to address the technical deficiencies existing in the prior art.
[0006] According to a first aspect of the embodiments of this specification, an audio processing method is provided, comprising: Obtain an audio processing request for the initial audio, wherein the audio processing request carries the initial audio signal, initial audio features, and a request description of the audio processing request; From the audio template library, match the template audio parameters and template audio features corresponding to the request description; The audio parameter compensation amount is determined based on the feature difference between the template audio features and the initial audio features; Based on the audio parameter compensation amount, the template audio parameters are compensated to obtain the target processing parameters; Based on the target processing parameters, the initial audio signal is processed to obtain the target audio.
[0007] According to a second aspect of the embodiments of this specification, an audio processing apparatus is provided, comprising: The request acquisition module is configured to acquire an audio processing request for the initial audio, wherein the audio processing request carries the initial audio signal, initial audio features, and a request description of the audio processing request. The template matching module is configured to match the template audio parameters and template audio features corresponding to the request description from the audio template library; The compensation amount determination module is configured to determine the audio parameter compensation amount based on the feature difference between the template audio features and the initial audio features. The parameter compensation module is configured to perform parameter compensation on the template audio parameters based on the audio parameter compensation amount to obtain the target processing parameters. The signal processing module is configured to perform signal processing on the initial audio signal based on the target processing parameters to obtain the target audio.
[0008] According to a third aspect of the embodiments of this specification, a computing device is provided, comprising: Memory and processor; The memory is used to store computer programs / instructions, and the processor is used to execute the computer programs / instructions, which, when executed by the processor, implement the steps of the above-described audio processing method.
[0009] According to a fourth aspect of the embodiments of this specification, a computer-readable storage medium is provided that stores a computer program / instructions that, when executed by a processor, implement the steps of the above-described audio processing method.
[0010] According to a fifth aspect of the embodiments of this specification, a computer program product is provided, including a computer program / instructions that, when executed by a processor, implement the steps of the above-described audio processing method.
[0011] The audio processing method provided in one or more embodiments of this specification includes: obtaining an audio processing request for initial audio, wherein the audio processing request carries an initial audio signal, initial audio features, and a request description of the audio processing request; matching template audio parameters and template audio features corresponding to the request description from an audio template library; determining an audio parameter compensation amount based on the feature difference between the template audio features and the initial audio features; performing parameter compensation on the template audio parameters based on the audio parameter compensation amount to obtain target processing parameters; and performing signal processing on the initial audio signal based on the target processing parameters to obtain target audio. By acquiring an audio processing request carrying the initial audio signal, initial audio features, and a request description, the system first uses the request description to accurately match the corresponding template audio parameters and features from the audio template library, establishing a processing benchmark. Then, by calculating the feature difference between the template audio features and the initial audio features, the system quantifies the specific differences between the target speech and the ideal template in timbre, frequency band distribution, and other features, and determines the audio parameter compensation amount accordingly, achieving dynamic adaptation of the processing scheme. Subsequently, the compensation amount is used to perform targeted parameter compensation on the template audio parameters to obtain target processing parameters that eliminate the influence of individual differences. Finally, the initial audio signal is processed based on these target processing parameters to obtain the target audio. This solves the problem of the inapplicability of general preset values caused by individual speaker differences or special role speech without relying on manual fine-tuning. It effectively avoids phenomena such as low-frequency redundancy, high-frequency distortion, and dynamic imbalance, significantly improving the consistency and processing efficiency of batch processing multi-role speech, lowering the professional threshold, and expanding the application boundaries of speech effects in special synthesized speech scenarios. Attached Figure Description
[0012] Figure 1 A flowchart illustrating an audio processing method provided in one embodiment of this specification; Figure 2 A flowchart illustrating the processing procedure of an audio processing method provided in one embodiment of this specification; Figure 3 A flowchart illustrating the dynamic adaptation process of an audio processing method provided in one embodiment of this specification; Figure 4 This is a schematic diagram of the structure of an audio processing device provided in one embodiment of this specification; Figure 5 This is a structural block diagram of a computing device provided for one embodiment of this specification. Detailed Implementation
[0013] Many specific details are set forth in the following description to provide a full understanding of this specification. However, this specification can be implemented in many other ways than those described herein, and those skilled in the art can make similar extensions without departing from the spirit of this specification. Therefore, this specification is not limited to the specific implementations disclosed below.
[0014] The terminology used in one or more embodiments of this specification is for the purpose of describing particular embodiments only and is not intended to limit the scope of the one or more embodiments of this specification. The singular forms “a,” “described,” and “the” used in one or more embodiments of this specification and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used in one or more embodiments of this specification refers to and includes any or all possible combinations of one or more associated listed items. The term “at least one” in one or more embodiments of this application means “one or more,” and “a plurality of” means “two or more.” The term “comprising” is an open-ended description and should be understood as “including but not limiting,” and may include other content in addition to what has been described.
[0015] It should be understood that although the terms first, second, etc., may be used to describe various information in one or more embodiments of this specification, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, first may also be referred to as second without departing from the threshold of one or more embodiments of this specification, and similarly, second may also be referred to as first. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to a determination."
[0016] First, the terms and concepts used in one or more embodiments of this specification will be explained.
[0017] Audio analog-to-digital converter (ADC) conversion is the core process of digitizing continuous analog audio signals. Its basic principle follows a three-step process of sampling, quantization, and encoding. First, the analog signal is discretized and sampled at a fixed frequency using a sample-and-hold circuit to satisfy the Nyquist sampling theorem and avoid aliasing. Then, the sampled amplitude values are quantized to a set number of bits and mapped to a finite digital level. The number of quantization bits directly determines the dynamic range and signal-to-noise ratio, and it is widely used in recording, playback, and speech processing scenarios.
[0018] The Fast Fourier Transform (FFT) is a highly efficient algorithm that quickly converts a time-domain signal (a waveform that changes over time, such as audio, vibration, or current) into a frequency-domain signal, visually displaying which frequencies are present in the signal and the amplitude of each frequency component. In audio, FFT is commonly used for spectrum analysis, equalization, noise reduction, and voiceprint recognition to examine the frequency distribution and harmonic characteristics of sound, making it a core tool in digital signal processing.
[0019] Linear Predictive Coding (LPC) is a dedicated computational module in speech / audio signal processing used to extract the spectral envelope of the vocal tract, achieve efficient compression, and perform parametric modeling. Its core is based on an autoregressive (AR) model, which solves for p-order linear prediction coefficients using the least squares criterion and outputs the prediction residuals and LPC parameters.
[0020] It should be noted that the audio processing methods provided in this manual can be applied to various industries or scenarios, including virtual reality processing software, digital cultural product production software, digital cultural creative software, digital cultural creative design, education, news, cultural content industry software, digital publishing software, digital music development and production, and digital mobile multimedia development and production.
[0021] This specification provides an audio processing method, and also relates to an audio processing apparatus, a computing device, a computer-readable storage medium, and a computer program product, which will be described in detail in the following embodiments.
[0022] See Figure 1 , Figure 1 A flowchart of an audio processing method provided in one embodiment of this specification specifically includes the following steps: Step 102: Obtain an audio processing request for the initial audio, wherein the audio processing request carries the initial audio signal, initial audio features, and a request description of the audio processing request.
[0023] The initial audio is a set of raw sound data that has not yet been subjected to the target processing parameters. It is used as the basic input and processing object for subsequent extraction of initial audio features and execution of signal processing procedures. For example, it includes real-time voice recordings of speakers during live interactive sessions, multi-character dialogues recorded in real-time game communication scenarios, user command audio received during intelligent voice assistant interaction, synthesized speech waveforms generated in virtual human-driven processes, or pure audio materials imported into post-production of broadcasting and film that have not undergone equalization and compression processing.
[0024] An audio processing request is a control instruction or data packet that triggers the execution of an audio processing method. It carries the initial audio signal, initial audio characteristics, and a request description so that the system can match parameters from the audio template library and calculate compensation based on this information. Examples include task instructions generated after a user clicks to confirm through a graphical interface, processing commands sent by an application through an interface call, or real-time sound effect optimization requests dynamically initiated during the operation of a virtual product.
[0025] The initial audio signal is the specific data stream of the initial audio. It can be included in the audio processing request and represents the sound vibration waveform in digital form. It serves as the physical carrier for matching, feature difference judgment, and the application of target processing parameters to finally convert it into target audio. Examples include pulse code modulation data obtained after analog signal to digital conversion, a clean speech sequence that has been preprocessed to remove background interference, a set of time-domain sampling points directly acquired and quantized from a microphone array, or raw recording file data read from a storage medium without any timbre optimization, etc.
[0026] The initial audio features are a multi-dimensional data set used to quantify the sound attributes of the initial audio. They are extracted by performing spectral analysis or model calculation on the initial audio signal and are used to align with and calculate the difference between the initial audio features and the template audio features to determine the amount of audio parameter compensation. Examples include frequency band distribution feature dimension data of the ratio of sub-energy to total energy of each frequency band calculated according to the preset frequency band division threshold, timbre feature dimension information of spectral envelope and harmonic distribution data obtained by using a linear predictive coding model, gender feature dimension data of gender category identification determined based on the fundamental frequency value and gender fundamental frequency threshold matching results, or average energy feature dimension value representing the overall loudness level of the sound, etc.
[0027] The request description is descriptive information used to express the user's desired auditory effect or style transfer intention for the initial audio. It can be carried in the audio processing request and used to retrieve and accurately match the corresponding template audio parameters and template audio features in the audio template library to establish a processing benchmark. Examples include style transfer requirement descriptions for converting a deep male voice into a clear female voice, scene effect instructions for adding monster roars or fantasy character traits to ordinary human voices, timbre optimization text specifying specific emotional colors, or parameter configuration instructions for defining reverberation effects for specific acoustic environments, etc.
[0028] There are several ways to obtain audio processing requests for the initial audio. One possible way is to obtain audio processing requests for the initial audio based on a preset audio library.
[0029] Another possible approach is to obtain the user's audio processing request for the initial audio through the client.
[0030] This step establishes a complete data foundation and execution basis for subsequent processing by acquiring an audio processing request carrying the initial audio signal, initial audio features, and request description. This enables the system to accurately locate the matching template audio parameters and template audio features from the audio template library based on the request description. Thus, without relying on manual intervention, it achieves automated adaptation to individual differences of different speakers, different acoustic environments, and diverse style requirements. This lays the necessary input conditions for generating target processing parameters that eliminate the influence of individual differences and ultimately obtaining high-quality target audio.
[0031] Step 104: From the audio template library, match the template audio parameters and template audio features corresponding to the request description.
[0032] The audio template library is a collection that stores template audio parameters and corresponding template audio features corresponding to various request descriptions. When an audio processing request is received, it is used to search for and match the benchmark data according to the request description, thereby providing a reference standard for calculating the difference between the initial audio features and the template audio features and determining the amount of audio parameter compensation. For example, it may contain parameter sets of different gender timbre standards, feature data groups containing various emotional style settings, index tables recording reverberation configurations for specific acoustic environments, or databases covering various virtual character sound effects, etc.
[0033] Template audio parameters are audio processing configurations stored in the audio template library. They can be matched with a specific request description and configured with corresponding template audio features to anchor the processing reference object. Template audio parameters can serve as a reference value for calculating audio parameter compensation amounts, and together with the compensation amounts, generate the target processing parameters finally applied to the initial audio signal to correct the initial audio signal. These parameters include preset equalizer band gain values, compressor threshold and ratio settings, reverb effect time and decay parameters, or fundamental frequency offset and formant adjustment coefficients in voice changing processing, etc.
[0034] Template audio features are multidimensional data sets stored in the audio template library that represent idealized or standardized sound attributes. They are used to perform dimensional alignment and interpolation operations with the initial audio features to determine the amount of audio parameter compensation, thereby eliminating the attribute differences between the template audio parameters and the processing parameters corresponding to the desired effect. Examples include the spectral envelope data of standard male or female timbres, the harmonic distribution coefficients under specific emotional states, the frequency band energy distribution ratio under ideal acoustic environments, or the fundamental frequency range and timbre brightness values set by the target virtual character, etc.
[0035] The above can be understood as follows: the audio template library stores template audio signals, each corresponding to template audio features and template audio parameters that adjust the template audio signals based on preset requirements. These template audio parameters can adjust the template audio signals to achieve the desired audio performance. In other words, the template audio parameters can process the template audio signals into audio signals with the target performance. Therefore, when there is an audio processing request for an initial audio signal that requires the desired performance, the template audio parameters can be adjusted based on the differences between the initial audio signal and the template audio signal to obtain the target audio.
[0036] There are several ways to match the template audio parameters and features corresponding to the request description from the audio template library. One possible way is to segment the request description and extract keywords, and then perform string matching and hit statistics with the pre-labeled tags in the audio template library. The template with the highest keyword overlap is used to match the template audio parameters and features corresponding to the request description.
[0037] Another possible approach is to convert the request description and the labeled text of each audio file in the template library into feature vectors using a semantic encoding model, calculate the cosine similarity between the request vector and each template text vector, select the template with the highest similarity from the audio template library, and match the template audio parameters and template audio features corresponding to the request description.
[0038] This step establishes a benchmark reference system for transforming the initial audio into the target audio by accurately matching the corresponding template audio parameters and features from the audio template library based on the request description. This allows the system to quantify the attribute differences between the initial audio and the desired effect using the template audio features as a benchmark. Then, based on these differences, the template audio parameters, which serve as the starting point for adjustment, are specifically modified to generate target processing parameters that are adapted to the current initial audio signal. This enables the flexible migration and adaptation of general or standardized preset sound effect configurations to diverse and personalized actual input scenarios, effectively ensuring that the final generated target audio can not only meet the user's specified style or environmental requirements, but also eliminate sound quality deviations caused by individual speaker differences or different recording environments.
[0039] Step 106: Determine the audio parameter compensation amount based on the feature difference between the template audio features and the initial audio features.
[0040] The feature difference is a set of numerical differences obtained after aligning the initial audio features and the template audio features in the corresponding dimensions. It is used to quantify the degree of deviation between the current input sound attributes and the expected target sound attributes, and provides a direct basis for calculating the compensation amount of audio parameters, such as the high and low offset of the fundamental frequency value, the difference in the energy distribution ratio of the spectral envelope in each frequency band, the deviation value of the harmonic structure coefficient, or the difference in the strength of the average loudness level, etc.
[0041] Audio parameter compensation is an adjustment value or curve used to correct template audio parameters. It can be calculated based on feature differences and is used to eliminate attribute differences between the initial audio features and template audio features. This adapts the general template audio parameters to the target processing parameters for the current initial audio signal. Examples include frequency band corrections superimposed on preset gain values, voice coefficient adjustment values calculated based on fundamental frequency offset, environmental compensation factors set for reverberation time, or dynamic range compression ratio fine-tuning values generated based on energy distribution differences.
[0042] There are several ways to determine the audio parameter compensation amount based on the feature difference between the template audio features and the initial audio features. One possible way is to determine the audio parameter compensation amount based on the Euclidean distance between the features of the template audio features and the initial audio features.
[0043] Another possible approach is to determine the audio parameter compensation amount based on the vector difference between the template audio features and the initial audio features.
[0044] This step utilizes the deviation between the initial audio and the desired target, quantified by the feature difference, to drive the generation of targeted audio parameter compensation. This dynamically corrects the standardized template audio parameters to target processing parameters that adapt to the characteristics of the current input signal. In this way, while preserving the style or environmental characteristics indicated by the request description, it effectively offsets the deviations introduced by individual speaker differences, recording condition fluctuations, or different original timbre bases. This ensures that the final processing strategy applied to the initial audio signal can accurately reproduce the expected auditory effect while maintaining the naturalness and consistency of the sound, avoiding the problems of insufficient adaptability or distortion caused by directly applying fixed parameters.
[0045] Step 108: Based on the audio parameter compensation amount, perform parameter compensation on the template audio parameters to obtain the target processing parameters.
[0046] Parameter compensation is a process of numerical correction or curve adjustment of template audio parameters based on feature differences. It is used to eliminate the attribute differences between the initial audio features and the template audio features, thereby transforming the general benchmark configuration into target processing parameters that adapt to the characteristics of the current input signal. Examples include frequency band fine-tuning of preset equalizer gain values, updating of voice coefficients calculated based on fundamental frequency deviation, environmental adaptive correction for reverberation time, or dynamic compression ratio adjustment based on energy distribution differences.
[0047] The target processing parameters are template audio parameters corrected by audio parameter compensation. They are used as the configuration basis for performing the final audio processing operation on the initial audio signal to adapt to the specific attributes of the current input sound and reproduce the desired auditory effect. Examples include the corrected frequency band gain curve, the adjusted fundamental frequency transformation coefficient, the reverberation time setting adapted to the current environment, or the dynamic range compression threshold finely adjusted based on energy distribution differences.
[0048] This step, through the execution of parameter compensation, applies the audio parameter compensation amount to a general template audio parameter, realizing the dynamic transformation from a standardized benchmark configuration to target processing parameters adapted to the characteristics of the current initial audio signal. This effectively offsets deviations introduced by individual speaker differences, recording condition fluctuations, or different original timbre bases while preserving the desired sound style or environmental characteristics. It ensures that the final generated frequency band gain curve, fundamental frequency conversion coefficient, reverberation time setting, or dynamic range compression threshold can accurately match the physical properties of the current input signal. This avoids sound quality distortion or style incongruity caused by directly applying a fixed template, and improves the naturalness and consistency of the audio processing results.
[0049] Step 110: Based on the target processing parameters, perform signal processing on the initial audio signal to obtain the target audio.
[0050] Signal processing involves applying computational operations based on target processing parameters to an initial audio signal. This is used to modify the physical properties of the signal according to a configuration that adapts to the characteristics of the current input sound in order to reproduce the desired auditory effect. Examples include spectral energy adjustment based on a modified frequency band gain curve, pitch shifting performed according to updated fundamental frequency conversion coefficients, spatial rendering using an environmental reverberation time setting, or amplitude control based on a finely tuned dynamic range compression threshold.
[0051] The target audio is the output data generated after signal processing of the initial audio signal. It is used to transform the processing logic of the target processing parameters into a final acoustic performance that meets the requirements, presenting the auditory content corrected by the target processing parameters to achieve the desired sound style or characteristics described in the request. Examples include speech segments with spectral energy adjustment, singing signals after pitch shifting, environmental recordings with spatial rendering, dynamic range compression results after amplitude control, clear vocals without background noise interference, or synthesized sound effects incorporating specific instrument timbre characteristics. The target audio can be used for subsequent playback, storage, transmission, or further analysis to adapt to the playback characteristics of different terminal devices or meet diverse application scenario needs.
[0052] This specification's embodiments acquire an audio processing request carrying an initial audio signal, initial audio features, and a request description. First, using the request description, it accurately matches the corresponding template audio parameters and features from an audio template library, establishing a processing benchmark. Then, by calculating the feature difference between the template audio features and the initial audio features, it quantifies the specific differences between the target speech and the ideal template in features such as timbre and frequency band distribution, and determines the audio parameter compensation amount accordingly, achieving dynamic adaptation of the processing scheme. Subsequently, it uses this compensation amount to perform targeted parameter compensation on the template audio parameters, obtaining target processing parameters that eliminate the influence of individual differences. Finally, based on these target processing parameters, it processes the initial audio signal to obtain the target audio. This solves the problem of inapplicability of general preset values caused by individual speaker differences or special role speech without relying on manual fine-tuning. It effectively avoids phenomena such as low-frequency redundancy, high-frequency distortion, and dynamic imbalance, significantly improving the consistency and processing efficiency of batch processing multi-role speech, lowering the professional threshold, and expanding the application boundaries of speech effects in special synthesized speech scenarios.
[0053] In one optional embodiment of this specification, the initial audio features have multiple dimensions, and the template audio features have multiple dimensions. Based on the feature difference between the template audio features and the initial audio features, the audio parameter compensation amount is determined, including: Align multi-dimensional template audio features with multi-dimensional initial audio features to determine the feature differences across multiple dimensions; Based on preset weights for multiple dimensions, the feature differences in multiple dimensions are weighted separately to obtain the comprehensive parameter compensation amount; Based on the comprehensive parameter compensation amount, the audio parameter compensation amount is generated.
[0054] The multiple dimensions of the initial audio features are multiple attribute dimensions of the initial audio features, which may include frequency band distribution feature dimension, timbre feature dimension, gender feature dimension, average energy feature dimension, etc., used to construct a multi-dimensional data space describing the physical state of the input sound, so as to support subsequent comparison analysis with template features or derivation calculation of processing parameters.
[0055] The multiple dimensions of template audio features are multiple attribute dimensions of template audio features, which can also include frequency band distribution feature dimension, timbre feature dimension, gender feature dimension, average energy feature dimension, etc., to construct a multi-dimensional data space describing the physical state of the reference sound, so as to support subsequent comparison analysis with the initial audio features or derivation calculation of processing parameters.
[0056] Alignment is the process of comparing initial audio features with template audio features to establish the correspondence between the two. It can be point-by-point alignment or segment-by-segment alignment. It is used to associate and identify the matching status of two feature sequences on the time axis or feature space, thereby providing a benchmark for subsequent difference quantification or parameter adjustment. Examples include amplitude comparison for single-dimensional differences in frequency band distribution features, instantaneous intensity comparison based on peak values of speech signals, and spectral structure comparison based on harmonic distribution features.
[0057] Multiple dimension weights are numerical proportional coefficients assigned to each attribute dimension in the initial audio features or template audio features. They are used to adjust the contribution of different feature dimensions to the final result during the derivation and calculation of processing parameters. For example, they can be configured according to the weights of timbre 40%-50%, frequency band distribution 30%-40%, and gender 10%-20%, or set corresponding dynamic proportional coefficients based on the average energy feature dimension, frequency band distribution feature, and other attribute dimensions.
[0058] Weighting is a process of applying weights from multiple dimensions to each attribute dimension to generate a comprehensive value or a comprehensive vector. It is used to reflect the degree of difference in contribution of different feature dimensions to the final result in comparative analysis or parameter derivation calculations. For example, it can be used to perform linear superposition calculations based on preset proportional coefficients of dimensions such as timbre, frequency band distribution, and gender.
[0059] The comprehensive parameter compensation amount is a numerical increment or correction vector used to correct the initial audio features or adjust the synthesis parameters. It can be generated based on the feature differences of multiple dimensions after weighted processing. It is used to comprehensively consider the influence of multi-dimensional deviations between the initial state and the target state in audio conversion, style transfer and other processing to determine the overall adjustment range.
[0060] Audio parameter compensation is a numerical increment or correction vector used to correct specific audio attributes or adjust synthesis control variables. It can be generated based on the comparative analysis of initial audio features and template audio features. It is used to process template audio parameters into target processing parameters in audio signal processing, speech conversion, or style transfer, so that the target processing parameters match the actual correction requirements. Examples include pitch offset calculated based on fundamental frequency trajectory differences, formant frequency adjustment values generated for spectral envelope mismatch, and dynamic gain correction values derived from energy distribution characteristics.
[0061] This specification describes an embodiment that constructs multi-dimensional initial audio features and template audio features, performs feature alignment to establish a matching benchmark for feature dimensions, and then uses preset multiple dimension weights to weight the feature differences of each dimension to generate a comprehensive parameter compensation amount. Finally, based on this, the audio parameter compensation amount required to generate the target processing parameters is derived. This compensation amount is used to quantify and offset the multi-dimensional deviation between the source data and the target state in audio conversion, style transfer, or speech synthesis scenarios, thereby achieving fine-grained correction of multiple attributes. This allows the output target processing parameters to adaptively match actual correction requirements and optimize the auditory consistency and naturalness of the final audio signal.
[0062] This specification provides an optional embodiment in which multiple dimensions include frequency band distribution characteristics, and audio parameter compensation amounts include equalization curve offset. Based on the comprehensive parameter compensation amount, audio parameter compensation amounts are generated, including: Based on the feature difference of the frequency band distribution feature dimension, the energy adjustment direction of each frequency band is determined; Based on the weights corresponding to the comprehensive parameter compensation amount and the frequency band distribution characteristic dimension, the basic curve offset of each frequency band is determined under the constraint of the energy adjustment direction. Based on the difference components of each frequency band in the difference value of frequency band distribution feature dimension, the basic curve offset of each frequency band is calibrated to obtain the single frequency band curve offset of each frequency band. The equalization curve offset is obtained based on the single-band curve offset of each frequency band.
[0063] Frequency band distribution feature dimension is an attribute variable that describes the energy distribution state or spectral structure of an audio signal in different frequency ranges. It is used to characterize the frequency domain physical properties of the sound signal in a multi-dimensional feature space to support comparative analysis or parameter derivation calculations with other audio features, such as the spectral envelope vector generated based on the energy proportions of low-frequency, mid-frequency and high-frequency regions.
[0064] The difference components of each frequency band are quantized data units generated by comparing the initial audio features and the template audio features in a specific frequency range. They are used to characterize the degree of deviation between the two in terms of energy intensity, spectral shape or distribution state in that frequency band. They also provide a basis for differentiated correction of frequency bands in the process of multi-dimensional feature analysis or parameter compensation derivation, so as to support targeted adjustment of the equalization curve offset at the frequency band level, such as the difference vector of energy amplitude in the low frequency band, the deviation of the formant frequency position in the mid frequency band, etc.
[0065] The energy adjustment direction of each frequency band is a vector identifier or polarity state determined based on the energy difference between the target audio characteristics and the initial audio characteristics in a specific frequency range. It is used to indicate the trend of gain enhancement or attenuation. It is used to guide the initial audio signal of each frequency to perform directional amplitude calibration in the process of audio signal processing, spectrum correction or style transfer so as to approach the expected spectral envelope shape. For example, a positive enhancement trend based on the energy deficiency in the low frequency band, a negative suppression trend set for redundant noise in the mid and high frequency bands, etc.
[0066] The base curve offset is a numerical displacement used to correct the function trajectory or adjust the trend of parameter changes. It can be calculated based on the feature differences between the initial audio features and the template audio features. It is used to determine the overall adjustment range of the frequency curve of the template audio parameters during audio signal processing, thus establishing a baseline value for adjusting the template audio parameters. This further optimizes the adaptability of the template audio parameters to audio processing requests and enables automated generation of target processing parameters. The base curve offset can include frequency curve adjustments for corresponding frequency bands such as low frequency, mid frequency, and high frequency, or frequency curve adjustments for frequency bands divided according to other standards.
[0067] Based on the difference components of each frequency band in the feature difference of the frequency band distribution feature dimension, the basic curve offset of each frequency band is calibrated to obtain the single-band curve offset of each frequency band. This can be expressed as the energy value ratio of the difference components of each frequency band in the feature difference of the frequency band distribution feature dimension. The basic curve offset of each frequency band is calibrated to obtain the single-band curve offset of each frequency band.
[0068] That is, in addition to obtaining the difference components of each single frequency band, it also includes obtaining the proportion of the energy of each single frequency band in the initial audio features to the total energy of each single frequency band, to obtain the energy ratio of each single frequency band corresponding to the initial audio features, and the proportion of the energy of each single frequency band in the template audio features to the total energy of each single frequency band, to obtain the energy ratio of each single frequency band corresponding to the template audio features. Based on the difference between the energy ratio of each single frequency band corresponding to the initial audio features and the energy ratio of each single frequency band corresponding to the template audio features, the basic curve offset of each frequency band is calibrated to obtain the single frequency band curve offset of each frequency band.
[0069] Each single frequency band can be split according to the actual situation. In one possible case, the frequency band can be split into a low frequency range, a mid frequency range, and a high frequency range. For example, the low frequency range is 100-500Hz, the mid frequency range is 500Hz-2kHz, and the high frequency range is 2-8kHz. Accordingly, each single frequency band includes a low frequency band, a mid frequency band, and a high frequency band.
[0070] In another possible scenario, the frequency band can be divided into multiple consecutive sub-bands based on the center frequency, such as independent frequency bands with center frequencies of 125Hz, 250Hz, 500Hz, 1kHz, 2kHz, 4kHz, 8kHz, etc. Correspondingly, each single frequency band includes the frequency band corresponding to each center frequency based on the standard octave band division, etc.
[0071] Single-band curve offset is a numerical displacement that adjusts the curve offset of the corresponding frequency band within a single frequency band. It is used to calibrate the curve offset of each frequency band in the basic curve offset of the standard processing parameters. For example, it can be used for amplitude correction based on the energy distribution difference in the low-frequency band, or frequency shift values derived from the resonant peak position deviation in the mid-frequency band.
[0072] Equalization curve offset is a numerical shift sequence used to correct the gain distribution pattern across the entire frequency band or adjust the spectral balance. It is generated based on the difference in overall frequency response characteristics between the target audio signal and the reference audio signal. It is used to eliminate the systematic deviation between the initial frequency response curve of the template processing parameters and the expected frequency response curve of the target processing parameters in order to optimize the overall listening experience or signal consistency. Examples include the longitudinal shift of the average gain across the entire frequency band, the tilt correction value of the transition slope from low frequency to high frequency, the overall boost or attenuation amplitude of a specific formant region, or the overall misalignment of the frequency response profile derived from the reflection characteristics of different listening environments, etc.
[0073] There are several ways to obtain the equalization curve offset based on the single-band curve offset of each frequency band. One possible way is to stitch together the single-band curve offsets to obtain the equalization curve offset.
[0074] Another possible approach is to superimpose the offsets of each single-band curve to obtain the offset of the equalization curve.
[0075] This specification's embodiments introduce the frequency band distribution feature dimension, refining the comprehensive parameter compensation amount into energy adjustment direction and basic curve offset for each single frequency band. It then uses the energy ratio difference between the initial audio and template audio in each frequency band to calibrate the basic curve offset to generate single-frequency band curve offsets, ultimately synthesizing the equalization curve offset. This achieves a transition from overall adjustment across the entire frequency band to refined correction by frequency band, enabling the generated audio parameter compensation amount to more accurately match the energy distribution and spectral structure of the target audio in different frequency ranges. This effectively eliminates systematic deviations between the initial signal and the expected signal in each single frequency band, improving the fidelity of spectral balance and auditory consistency during audio processing.
[0076] This specification provides an optional embodiment where multiple dimensions include timbre feature dimension and gender feature dimension, and the audio parameter compensation amount includes compression threshold adjustment value; Based on the comprehensive parameter compensation amount, audio parameter compensation amounts are generated, including: Based on the difference values in the timbre feature dimension, determine the adjustment direction of the compression threshold adjustment value; Based on the weights corresponding to the comprehensive parameter compensation amount and the timbre feature dimension, the initial compression threshold adjustment value is determined under the constraint of the adjustment direction. Based on the difference values of the gender feature dimension, the initial compression threshold adjustment value is calibrated to obtain the compression threshold adjustment value.
[0077] The timbre feature dimension describes the attributes of an audio signal in the timbre dimension, specifically manifested in differences in harmonic structure, spectral envelope shape, or transient response characteristics. It is used to characterize the texture, color, or distinctiveness of sound in a multi-dimensional feature space to support comparative analysis or style transfer operations between different sound sources. Examples include harmonic richness values generated based on the ratio of fundamental frequency to overtone energy, brightness values calculated based on the spectral centroid position, impact indices determined by the energy rise rate during the onset phase, and warmth coefficients derived from the width of formant distribution. Correspondingly, the difference in the timbre feature dimension represents the difference in a sound signal in the timbre dimension, manifested in harmonic structure, spectral envelope shape, or transient response characteristics.
[0078] The compression threshold adjustment value is a numerical displacement used to correct the dynamic processor's start level. It is used to flexibly set the amplitude threshold at which the signal begins to be attenuated or limited during audio dynamic control to adapt to the transient characteristics or listening requirements of different materials. For example, it can be the reference level offset derived from the average level fluctuation of human voice dialogue, the instantaneous trigger threshold correction value set for the transient peak intensity of percussion, the static noise threshold adjustment amount determined in combination with the background noise floor level, and so on.
[0079] The direction of the compression threshold adjustment value is represented by the positive or negative value, indicating the trend of the level threshold change: when the adjustment value is positive, it means that the level threshold for compression to start is increased, so that more signals are in an uncompressed state to preserve the dynamic range; when the adjustment value is negative, it means that the level threshold for compression to start is decreased, so that more signals enter the compression process to suppress peaks or increase the overall loudness.
[0080] The initial compression threshold adjustment value is a numerical shift unit generated by the preliminary comparison to establish the dynamic processor start-up level reference correction amount. It is used to provide an initial threshold calibration basis for the overall dynamic range of the signal at the beginning of the audio processing flow to support subsequent iteration of fine parameters.
[0081] The gender characteristic dimension is an attribute variable used to distinguish the audio signals corresponding to voice sources of different biological sexes. It is usually based on the fundamental frequency distribution range, formant frequency position or vocal cord vibration mode. It is used in sound analysis to characterize the speaker's gender orientation to support role classification, speech conversion or personalized timbre generation. For example, the low-frequency and high-frequency interval vectors are divided according to the fundamental frequency average, or the sound brightness coefficient is distinguished according to the spectral tilt, etc.
[0082] Based on the difference values of gender characteristics, the initial compression threshold adjustment value is calibrated to obtain the compression threshold adjustment value. For example, male deep voices have higher low-frequency peaks, so the adjustment amplitude can be increased by an additional 10%-20%; for special voices with large dynamics such as monster roars, the adjustment amplitude can be increased by an additional 30%-50%, thus obtaining the compression threshold adjustment value.
[0083] This specification's embodiments utilize the difference values of the timbre feature dimension to lock the adjustment direction, ensuring that the dynamic processing strategy matches the harmonic structure, spectral envelope, or transient characteristics of the sound; establish an initial compression threshold adjustment value under weight constraints to provide basic calibration; introduce the difference values of the gender feature dimension for amplitude calibration, adaptively correcting the adjustment amplitude for scenarios such as enhanced low-frequency control for deep male voices or significantly increased suppression for special voices with large dynamic ranges, such as monster roars, thereby effectively adapting to the dynamic range requirements of different physiological genders and special sound sources while preserving sound quality and recognizability. It constructs a collaborative processing mechanism of the timbre feature dimension and the gender feature dimension, realizing the refined generation of the compression threshold adjustment value, and improving the targeting and naturalness of audio dynamic control.
[0084] This specification includes one optional embodiment, in which multiple dimensions further include an average energy characteristic dimension, and the audio parameter compensation amount includes a gain correction value; Based on the comprehensive parameter compensation amount, audio parameter compensation amounts are generated, including: Based on the comprehensive parameter compensation, the basic curve offset and compression threshold adjustment value for each frequency band are determined respectively; The initial gain adjustment is determined based on the difference in the average energy characteristic dimension. The average curve offset is determined based on the single-band curve offset of each frequency band. Based on the current compression threshold and the compression threshold adjustment value of the initial audio signal, the gain change is estimated; Based on the initial gain adjustment, average curve offset, and gain change, predict the average energy value of the processed audio. The gain correction value is obtained based on the difference between the predicted average energy value and the preset target output energy value.
[0085] The average energy feature dimension is an attribute variable that characterizes the average energy distribution of an audio signal within a preset time window. It can be determined based on the statistical distribution of the root mean square (RMS) level and is used to quantify the overall loudness intensity or energy density of the sound signal during dynamic analysis to support gain standardization, dynamic range assessment, or noise threshold setting. Examples include the moving average energy value calculated based on a short frame sequence, the global root mean square level derived for a long recording segment, and so on.
[0086] The initial gain adjustment is a numerical displacement unit generated based on a preliminary comparison between the input signal level and the target loudness reference. It is used to establish the starting point for signal amplitude correction and provides a basic calibration basis for overall volume balance at the beginning of the audio processing link to support subsequent dynamic range control or fine-tuning of tone.
[0087] The average curve offset is a statistical value obtained by weighting the equalization offset values of each frequency band. It is used to characterize the overall tilt trend of the spectrum or the displacement of the center of gravity of the energy distribution. It is used to provide a macroscopic calibration reference for the timbre balance state during frequency domain processing to support the prediction of gain correction values.
[0088] Gain change is a difference variable that characterizes the numerical displacement of the signal amplitude correction parameter within a preset time interval or processing stage. It is used to quantify the impact of curve offset on gain during volume adjustment to support the prediction of gain correction value.
[0089] The average curve offset reflects the macroscopic change in signal energy caused by the overall tilt trend of the spectrum, while the gain change characterizes the dynamic displacement of amplitude caused by the adjustment of the compression threshold. These two variables work together to affect the final energy state of the audio signal, and thus directly affect the average energy value of the audio after prediction processing. Since the gain correction value is calculated based on the difference between the predicted energy value and the preset target output energy value, any change in the average curve offset and the gain change will lead to a corresponding adjustment of the gain correction value by changing the predicted energy value, thereby achieving closed-loop calibration of the final output energy.
[0090] The embodiments in this specification predict the average energy value of the processed audio by comprehensively considering the initial gain adjustment, the average curve offset, and the gain change, and then derive the gain correction value in reverse. This can effectively offset the energy deviation introduced by the equalization curve offset and the compression threshold adjustment, so that the final output audio signal energy converges more accurately to the preset target output energy value, and improves the accuracy and stability of volume balance in multi-dimensional parameter collaborative adjustment scenarios.
[0091] In one optional embodiment of this specification, the step of obtaining initial audio features includes: Acquire the initial audio signal; Spectral analysis is performed on the initial audio signal to extract its initial audio features.
[0092] Spectrum analysis is a signal processing procedure that converts a time-domain audio signal to the frequency domain to analyze the amplitude and phase distribution of its frequency components. It is used to reveal the internal frequency composition characteristics of a sound signal to support equalization adjustment, noise suppression, or timbre recognition. Examples include instantaneous frequency energy distribution maps generated based on fast Fourier transform, time-frequency energy evolution matrices constructed based on short-time Fourier transform, power spectral density curves calculated by energy integrals for a specific frequency band, or multi-resolution frequency band coefficient sequences obtained by wavelet transform decomposition, etc.
[0093] The embodiments in this specification perform spectral analysis on the initial audio signal, transforming the time-domain waveform into frequency-domain data containing amplitude and phase distribution information. This enables more precise extraction of initial audio features that reflect the internal frequency composition characteristics of the sound, providing a rich and accurate data foundation for subsequent processing steps such as equalization adjustment, noise suppression, or timbre recognition based on frequency-domain characteristics, thereby improving the dimensionality and targeting of audio feature extraction.
[0094] In one optional embodiment of this specification, spectral analysis is performed on the initial audio signal to extract initial audio features, including: The spectrum of the initial audio signal is divided into various frequency bands according to the preset threshold values for each frequency band. Calculate the total energy of the initial audio signal, and the sub-energy of each frequency band; The initial audio features of the frequency band distribution feature dimension are determined based on the ratio of sub-energy to total energy in each frequency band.
[0095] The preset threshold values for each frequency band are numerical limits of the boundaries of different pre-divided frequency regions. These are used to divide the audio spectrum into several independent intervals so that the audio features of each frequency band can be extracted and characterized separately. Examples include low-to-mid frequency transition points set according to the range of human hearing, and frequency band segmentation values customized for specific application scenarios.
[0096] The total energy of the initial audio signal is a numerical indicator that characterizes the cumulative amplitude intensity of the initial audio signal in the time or frequency domain. It is used as a basic reference quantity to evaluate the overall loudness, dynamic range, or signal-to-noise ratio of the signal to support gain calculation and energy ratio calculation of each frequency band. The total energy can be the instantaneous power integral value calculated based on the sum of squares of time-domain sampling points, the sum of frequency domain energy obtained by summing the squares of the amplitudes of spectral components, the total value of the entire frequency band after accumulating the energy of each sub-band for low, medium, and high frequencies, etc.
[0097] The sub-energy of each frequency band is the cumulative amplitude intensity of the initial audio signal within a specific frequency range. It is used to characterize and distinguish the unique audio features of each frequency band to support frequency band equalization adjustment, dynamic compression control, etc. For example, the low-frequency energy value separated from the total energy based on the frequency band division threshold, the mid-frequency power integral, the high-frequency energy component extracted for a specific frequency, etc.
[0098] The embodiments in this specification divide the spectrum of the initial audio signal into several independent intervals according to preset frequency band division thresholds, calculate the total energy and the sub-energy of each frequency band, and then determine the initial audio features of the frequency band distribution feature dimension based on the ratio of sub-energy to total energy. This is used to quantify the energy proportion distribution of different frequency regions in the overall signal, so as to support fine calibration of curve offset and generate a balanced curve offset that better fits the initial audio features and audio processing requirements.
[0099] In one optional embodiment of this specification, spectral analysis is performed on the initial audio signal to extract initial audio features, including: The linear prediction coding model is used to calculate the linear prediction coefficients from the initial audio signal. The spectral envelope of the initial audio signal is determined based on the linear prediction coefficients. Extract the harmonic distribution data of the initial audio signal; Based on spectral envelope and harmonic distribution data, the initial audio features of the timbre feature dimension are determined.
[0100] Linear predictive coding is a signal processing architecture based on the fact that the current audio sample value can be approximated by a linear combination of several past sample values. It is used to characterize the spectral envelope characteristics of audio signals to support speech compression coding, pitch period detection, or formant frequency extraction.
[0101] Linear prediction coefficients are numerical parameters that characterize the linear weighted relationship between the current audio sample value and several past sample values. They are used to construct linear prediction models to approximate the spectral envelope or channel transmission characteristics of audio signals.
[0102] Harmonic distribution data is the information on the arrangement and intensity of the fundamental frequency and its integer multiples of frequency components in terms of amplitude, phase, or energy in an audio signal. It is used to characterize the timbre texture, pitch stability, or harmonic structure features of a sound to support speech synthesis, sound quality enhancement processing, and so on.
[0103] The spectral envelope is a smooth curve characterizing the overall frequency response trend of a signal. It is used to define the relative energy strength and formant positions of different frequency bands, such as low, mid, and high frequencies, thereby determining the macroscopic auditory attributes of sound, such as brightness, fullness, or sharpness. Harmonic distribution data, on the other hand, provides detailed information characterizing the specific amplitude, phase, or energy arrangement of the fundamental frequency and its integer multiples. It is used to depict the internal texture of sound, the richness of overtones, and the mixing ratio of harmonics and noise, thus distinguishing sound sources with similar envelopes but different sound-producing mechanisms. Combining the two, the former provides the outline of the sound, while the latter fills in the details and texture, thus forming a complete representation of timbre.
[0104] The embodiments in this specification utilize a linear predictive coding model to calculate linear prediction coefficients to determine the spectral envelope, and combine this with extracted harmonic distribution data to construct initial audio features in the timbre dimension. This method can integrate the macroscopic contours characterizing the overall frequency response trend of the signal with the microscopic textures depicting the arrangement details of the fundamental frequency and integer multiple frequency components, thereby comprehensively reflecting multiple attributes of the sound, such as brightness, fullness, overtone richness, and the mixing ratio of harmonics and noise. This enables effective differentiation of sound sources with similar spectral shapes but different sound production mechanisms, improving the completeness and accuracy of the timbre characteristics characterization of the initial audio signal.
[0105] In one optional embodiment of this specification, spectral analysis is performed on the initial audio signal to extract initial audio features, including: Extract the fundamental frequency value of the initial audio signal; The fundamental frequency value is matched with a preset gender fundamental frequency threshold to determine the gender category corresponding to the initial audio signal; Based on gender category, initial audio features are extracted from the gender feature dimension.
[0106] The fundamental frequency is the lowest frequency component in the periodic vibration of an audio signal. It is used to characterize the basic pitch of a sound, distinguish the range of different sound sources, or serve as a reference for generating harmonic sequences. It can also determine the user's perception of the pitch of an audio signal. The fundamental frequency can be the lowest significant spectral peak frequency identified through short-time Fourier transform spectral analysis, the frequency value corresponding to the fundamental period calculated according to cepstral analysis, and so on.
[0107] The preset gender fundamental frequency threshold is a pre-set reference limit used to distinguish the fundamental frequency range of different gender voices. It is used to help determine the gender category of the audio signal, filter voice data of a specific gender, or serve as a basis for gender conversion processing. The preset gender fundamental frequency threshold can be set based on the average fundamental frequency range of adult males, refer to the average fundamental frequency range of adult females, be divided based on the differences in voice between children and adults, or be determined according to the statistical characteristics of different age groups, etc.
[0108] Gender categories are identity classification identifiers for audio sound sources. They can be determined based on the physiological characteristics or acoustic properties of the sound-producing body. They are used to distinguish different gender groups corresponding to audio signals, guide timbre selection in speech synthesis, or serve as auxiliary constraints for speech recognition systems. Examples include male and female categories based on fundamental frequency distribution statistics, child and adult categories identified by combining formant features, neutral categories covering non-binary gender expressions, or multi-dimensional gender labels customized according to specific application scenarios.
[0109] This specification describes an embodiment that extracts the fundamental frequency value of the initial audio signal and matches it with a preset gender fundamental frequency threshold to determine the gender category. Then, based on the gender category, it extracts the initial audio features of the gender feature dimension. It can use the fundamental frequency as a basic acoustic parameter as a discrimination criterion, flexibly adapt to the frequency distribution differences of vocal bodies of different age groups or physiological characteristics, effectively distinguish different gender groups corresponding to audio signals, and provide a classification basis for timbre selection in subsequent speech synthesis, auxiliary constraints of speech recognition systems, or screening of specific gender speech data, thereby improving the pertinence and applicability of audio feature extraction in the gender dimension.
[0110] In one optional embodiment of this specification, the audio parameter compensation amount includes at least one of equalization curve offset, compression threshold adjustment value, and gain correction value; Based on the audio parameter compensation amount, parameter compensation is performed on the template audio parameters to obtain the target processing parameters, including at least one of the following: Based on the equalizer curve offset, frequency band compensation adjustment is performed on the equalizer response curve in the template audio parameters to obtain the target processing parameters. Based on the compression threshold adjustment value, the dynamic compressor threshold in the template audio parameters is adjusted to obtain the target processing parameters; Based on the gain correction value, the overall gain value in the template audio parameters is corrected to obtain the target processing parameters.
[0111] An equalizer response curve is a graph that describes the relationship between the gain or attenuation of an audio signal at different frequency points. It is used to characterize the audio processing system's adjustment strategy for the energy of each frequency band, correct frequency response defects in the acoustic environment, or shape a specific sound signature, such as a bass boost and treble attenuation curve set according to user preferences.
[0112] Frequency band compensation adjustment is an operation process that performs gain correction or attenuation processing on the equalizer response curve in the template audio parameters based on the equalizer curve offset. It is used to transform general template configurations into target processing parameters that adapt to the current acoustic environment or equipment characteristics, eliminate frequency response differences between the reference model and the actual scene, or optimize the timbre balance of the final output audio, such as boosting and correcting the mid-to-high frequency bands, dynamically applying spectral characteristic calibration according to different audio content types, etc.
[0113] The dynamic compressor threshold is the level reference point for setting the dynamic range processor to initiate gain control. It is used to control low-level signals that need to be boosted and high-level signals that need to be suppressed, control the upper and lower limits of the audio dynamic range, or prevent signal distortion caused by instantaneous peaks. For example, it is used to set a level threshold to avoid plosive triggering in vocal processing, to set a start level to maintain the consistency of average loudness in broadcast transmission, to set a peak limit trigger point for the transient characteristics of percussion, or to set an automatic gain control start value that is adjusted according to the dynamic characteristics of the music style, etc.
[0114] The equalizer response curve frequency band compensation adjustment is a process of applying specific frequency band gain correction or attenuation to the dynamic compressor threshold based on the equalizer curve offset. This is used to correct the deviation between the reference curve and the actual requirements, and to optimize the difference between the processing effect of the dynamic compressor threshold in the template processing parameters and the actual processing requirements of the initial audio signal.
[0115] The overall gain value is a linear adjustment parameter that amplifies or attenuates the amplitude of the audio signal across the entire frequency band in a uniform manner. It is used to adjust the absolute level strength of the output signal, match the input and output sensitivity between different devices, or compensate for energy loss caused by pre-processing, such as restoring the overall loudness after equalization processing, or increasing the volume according to the background noise level of the playback environment.
[0116] The embodiments in this specification compensate for template audio parameters based on at least one of equalization curve offset, compression threshold adjustment value, and gain correction value. This enables the general template configuration to be adaptively transformed into target processing parameters that fit the initial audio. Thus, precise calibration and optimization of template processing parameters can be achieved without manually constructing a complete parameter set. Furthermore, dynamic adaptation to audio processing requests can be achieved, improving the actual audio processing effect while expanding the types of audio that the system can process.
[0117] In one optional embodiment of this specification, the steps for obtaining an initial audio signal include: Acquire analog speech signals; The analog speech signal is preprocessed to obtain an initial audio signal, wherein the preprocessing includes at least one of digital conversion and noise cancellation.
[0118] Analog speech signals are electrical signals that characterize the changes in sound wave amplitude and frequency over time using continuously varying physical quantities such as voltage or current. They are used in the analog domain as input sources to directly carry human voice information before analog-to-digital conversion. Examples include the induced electromotive force generated by microphone diaphragm vibration, the continuous current waveform transmitted in traditional telephone lines, and the changes in magnetic flux recorded on magnetic tape recording media. Analog speech signals can be acquired by using a microphone array as a speech acquisition terminal.
[0119] Digital-to-digital conversion is the process of generating a digital signal consisting of a binary numerical sequence based on a continuously changing analog signal. Specifically, it can involve discretizing the analog signal in the time dimension and quantizing and encoding it in the amplitude dimension. This is used to realize the storage, transmission, editing, or anti-interference processing of analog information in digital systems. For example, it can convert the analog voice waveform captured by the microphone into a pulse code modulation data stream, convert the analog video signal into a digital video file, and perform analog-to-digital conversion on the continuous voltage values output by the sensor for microprocessor reading, etc.
[0120] Noise cancellation is a signal processing mechanism that suppresses non-target sound source components. It is usually achieved by generating an inverse signal that is opposite in phase to the interfering noise or by using a statistical model to separate and suppress it. It is used to reduce the impact of background ambient noise on the clarity of the target speech, improve the signal-to-noise ratio, or improve the listening quality of the audio receiver.
[0121] The embodiments of this specification acquire analog speech signals and preprocess them, including at least one of digital conversion and noise cancellation, which can transform the original sound waves, characterized by continuously changing physical quantities, into discrete data sequences adapted for digital system processing. At the same time, it effectively suppresses non-target sound source components to reduce background ambient sound interference, thereby improving the signal-to-noise ratio and audio clarity while preserving the integrity of human voice information, and providing high-quality initial audio signals for subsequent audio analysis or processing steps.
[0122] The following is in conjunction with the appendix Figure 2 This document uses a practical application of the audio processing method provided in this manual as an example to further illustrate the audio processing method. Figure 2 A flowchart illustrating the processing steps of an audio processing method provided in one embodiment of this specification includes the following steps: Step 202: Voice Acquisition: Microphone array acquisition, ADC conversion, noise reduction and normalization.
[0123] The microphone array is an acoustic signal pickup device composed of multiple spatially distributed microphone units. It is used to synchronously capture the acoustic wave vibrations to be processed in the analog domain and output multi-channel analog electrical signals. The sampling rate of the microphone array can be 44.1kHz.
[0124] An ADC (Analog-to-Digital Converter) is a hardware circuit or processing module that discretely samples continuously changing analog voltage waveforms in the time dimension and quantizes and encodes them in the amplitude dimension to generate binary digital sequences. It is used to map analog information to digital systems and can improve the signal-to-noise ratio through oversampling.
[0125] Noise reduction and normalization is a preprocessing mechanism for digital speech signals. The noise reduction unit uses an adaptive noise cancellation algorithm to filter out steady-state noise components in the environment, and the normalization unit adjusts the amplitude of the digital speech signal to a preset value range to avoid abnormal signal amplitude affecting subsequent analysis.
[0126] Meanwhile, the initial audio signal obtained from voice acquisition can be transmitted to the spectrum analysis module through a high-speed serial interface. The high-speed serial interface is a communication link used to transmit data between the preprocessing module and the spectrum analysis module, and is used to complete the transmission of the preprocessed digital voice signal within a limited time delay.
[0127] Step 204: Spectrum Analysis: Frame processing, FFT spectrum analysis, LPC timbre extraction, fundamental frequency determination of gender, and calculation of frequency band energy percentage.
[0128] This step is mainly implemented through the spectrum analysis module, which has built-in FFT and LPC operation units and uses hardware acceleration to improve computational efficiency.
[0129] The FFT operation unit is a hardware acceleration module that performs Fast Fourier Transform (FFT). It is used to perform frame-based processing on the input digital speech signal based on FFT and calculate the spectral amplitude and phase information of each frame to obtain the frequency band energy distribution. Optionally, the frame length is set to 1024 points and the frame shift is set to 512 points.
[0130] The LPC (Linear Prediction Processing) unit is a computational component based on a linear prediction model. It is used to calculate the prediction coefficients for each frame of speech signal, and then back-calculate the spectral envelope using the prediction coefficients. Simultaneously, it extracts harmonic distribution data and generates timbre features by combining the spectral envelope and harmonic distribution.
[0131] The fundamental frequency detection unit is a processing module that uses the autocorrelation method to detect the fundamental frequency value of each frame of speech. It is used to detect the fundamental frequency value of each frame of speech by using the autocorrelation method and compare the detected fundamental frequency value with a preset range. Optionally, the fundamental frequency determination range for males is 85-180Hz, and the fundamental frequency determination range for females is 165-255Hz. If the value is outside this range, it is determined to be special synthesized speech.
[0132] The frequency band distribution feature implementation mechanism is to divide the signal into three frequency bands: low frequency band, mid frequency band, and high frequency band, and calculate the proportion of energy in each frequency band to the total energy. This is used to form the frequency band distribution feature. Optionally, the three frequency bands are: low frequency band (100-500Hz), mid frequency band (500Hz-2kHz), and high frequency band (2-8kHz).
[0133] Finally, the system packages the timbre features, gender features, and frequency band distribution features into a feature parameter set, and transmits it to the preset compensation module to support subsequent processing.
[0134] Step 206: Preset Compensation: Call the basic preset parameter library.
[0135] The basic preset parameter library is a data set built based on different speech processing needs, such as clarity enhancement, warmth enhancement, and penetration enhancement. It is used to store the standard preset parameters and standard speech features corresponding to each type, corresponding to the audio template library mentioned above.
[0136] Standard preset parameters are baseline adjustment values set for specific speech optimization goals, used to guide the direction and amplitude of subsequent signal compensation. Standard speech features are reference data models characterizing the ideal speech state, covering the standard spectral envelope, standard fundamental frequency, and standard frequency band energy proportion, used as a comparison benchmark to evaluate the difference between the current speech signal and the target effect, thus supporting targeted audio compensation processing. Standard preset parameters correspond to the template audio parameters mentioned above, and standard speech features correspond to the template audio features mentioned above.
[0137] Step 208: Sub-process: Feature comparison, weight calculation, generation of compensation parameters, and correction of preset parameters.
[0138] This step is mainly implemented through a feature comparison unit, a weight configuration unit, and a comprehensive compensation calculation mechanism.
[0139] The feature comparison unit is a logic component built into the processing module. It is used to compare the feature parameter set input by the spectrum analysis module with the standard speech features of the selected type dimension by dimension, and to use the Euclidean distance algorithm to calculate the difference value of each feature dimension to quantify the degree of deviation between the current signal and the target standard.
[0140] The weight configuration unit is a parameter allocation module that supports user-defined or default configurations. The default settings are 40%-50% for timbre feature weight, 30%-40% for frequency band distribution feature weight, and 10%-20% for gender feature weight. The timbre feature weight can also be changed by importing parameters or manually inputting them, and is used to determine the contribution ratio of each feature dimension in the final decision.
[0141] The comprehensive compensation calculation mechanism is based on the difference values of each feature dimension and their corresponding weights. The processing logic is to obtain a single numerical index through weighted summation, which represents the adjustment intensity required for the overall speech signal.
[0142] The compensation parameters based on the comprehensive compensation output include the equalization curve offset value, the compression threshold adjustment value, and the gain correction value, and the adjustment range of each parameter is positively correlated with the comprehensive compensation amount.
[0143] The preset parameter correction mechanism is a process that uses the generated compensation parameters to fine-tune the baseline data in the basic preset parameter library point by point. Its output is the target processing parameters that adapt to the characteristics of the speech to be processed, and it is transmitted to the effect processing module through the interface to perform the final audio rendering.
[0144] In one possible implementation, the core calculation logic for the comprehensive compensation amount is: "single-dimensional difference value quantification → weight allocation → weighted summation → amplitude limiting calibration". The complete implementation and calculation details are as follows: Step 1: Single-dimensional feature difference value quantization (Euclidean distance algorithm) For the three core feature dimensions, the difference values of the features between the speech to be processed and the selected standard speech are calculated respectively, so as to realize the quantifiable calculation of the difference: 1. Timbre characteristic difference value D1 The timbre characteristics are composed of the spectral envelope and harmonic distribution. The Euclidean distance between each set of sequences and the standard sequence is calculated, and the average value is taken to obtain the final difference value.
[0145] in, The speech spectrum envelope sequence to be processed. It is a standard spectral envelope sequence; The speech harmonic distribution sequence to be processed. It is a standard harmonic distribution sequence; This represents the number of sample points in the sequence, corresponding to the number of FFT frame points.
[0146] 2. Differences in frequency band distribution characteristics
[0147] Based on the energy proportions of the low, medium, and high frequency bands, calculate the Euclidean distance from the standard frequency band proportion:
[0148] in, , , The energy percentage of the three frequency bands of the speech to be processed. , , The standard voice three-band energy ratio.
[0149] 3. Gender characteristic differences
[0150] Based on the relative differences calculated from the fundamental frequency values, normalization was performed.
[0151] in, The average fundamental frequency value of the speech to be processed. The denominator is the standard speech reference frequency value; if it is determined to be special synthesized speech, the denominator is replaced with the preset reference frequency value for that type of speech.
[0152] Step 2: Feature Weight Assignment Assign weight coefficients to the three dimensions according to the weight range defined in the document, satisfying the constraints:
[0153] -Timbre feature weighting Default is 45%, adjustable range is 40%-50%, and user customization is supported; - Frequency band distribution feature weight Default is 35%, adjustable range is 30%-40%, and user customization is supported; - Gender characteristic weight Default is 20%, adjustable range is 10%-20%, and user customization is supported.
[0154] Step 3: Calculate the initial comprehensive compensation amount by weighted summation. The initial comprehensive compensation amount is obtained by multiplying the difference values of each dimension by their corresponding weights and then summing the results. :
[0155] Step 4: Normalization and amplitude limiting calibration to obtain the final comprehensive compensation amount. To avoid audio distortion due to excessive compensation, a final calibration is performed on the initial values: 1. First, Normalize to the [0,1] interval, with the normalization benchmark being the preset maximum difference threshold; 2. Next, set upper and lower limits. The standard human voice processing limit range is [0, 6dB]. For special synthesized speech, the upper limit can be extended according to the scenario. Finally, a comprehensive compensation amount that can be directly used for parameter generation is obtained. .
[0156] Step 210: Effects processing: equalization adjustment, dynamic compression, gain control.
[0157] This step is implemented through an algorithm unit, which is configured as a serial processing architecture integrating an equalization adjustment unit, a dynamic compression unit, and a gain control unit. It supports multi-channel parallel processing, with each channel corresponding to a segment of speech signal, meeting the needs of batch processing scenarios such as multi-character speech in games and virtual interactions.
[0158] The equalization adjustment unit is a functional module that receives the equalization curve offset value in the target processing parameters and performs frequency band gain fine-tuning. It uses a digital filter with a second-order infinite impulse response (IIR) structure to achieve accurate reshaping of speech timbre.
[0159] The dynamic compression unit is a processing component that sets the compression threshold and compression ratio based on the compression threshold adjustment value in the target processing parameters. It uses a soft inflection point compression algorithm to smoothly compress the dynamic range of the speech signal, aiming to suppress drastic fluctuations in signal amplitude and improve auditory stability.
[0160] The gain control unit is an amplitude calibration module located at the end of the processing link. It adjusts the overall level of the compressed signal according to the gain correction value in the target processing parameters to ensure that the amplitude of the final output signal strictly conforms to the preset standard.
[0161] Step 212: Signal output: Output to a digital interface or save as a WAV / MP3 file.
[0162] This step is implemented through a digital signal interface and a file storage unit.
[0163] The digital signal interface is a data transmission structure equipped with USB and SPI interfaces, which is used to directly transmit the processed digital audio signal to the digital signal input port of the audio playback device to achieve low-latency real-time playback.
[0164] The file storage unit is a built-in file storage component that supports encapsulating and saving processed digital voice signals as standard audio file formats such as WAV and MP3. It also provides configurable bitrate options during the saving process to adapt to storage space limitations and sound quality requirements in different scenarios.
[0165] In one implementation method, the above content can be achieved as follows: 1. Voice Acquisition Steps The microphone array is activated to acquire the analog speech signal to be processed, which is then converted into a digital speech signal by a 16-bit, 44.1kHz ADC converter. The noise reduction unit uses an adaptive noise cancellation algorithm to filter out steady-state noise, and the normalization unit adjusts the signal amplitude to a preset range. The preprocessed digital speech signal is then transmitted to the spectrum analysis module through a high-speed serial interface.
[0166] 2. Spectrum Analysis Steps The spectrum analysis module processes the digital voice signal in frames, with a frame length of 1024 points and a frame shift of 512 points; the FFT operation unit performs spectrum analysis on each frame to obtain the frequency band energy distribution; the LPC operation unit extracts the spectrum envelope and harmonic distribution to form timbre features; the fundamental frequency detection unit detects the fundamental frequency using the autocorrelation method to determine gender characteristics; it calculates the energy proportions of the low-frequency, mid-frequency, and high-frequency bands to form frequency band distribution features; and it summarizes the feature parameter set and transmits it to the preset compensation module.
[0167] 3. Preset compensation steps Based on the user's voice processing needs, the system retrieves the corresponding standard preset parameters and standard voice features from the basic preset parameter library. The feature comparison unit uses the Euclidean distance algorithm to calculate the difference between the feature parameter set and the standard voice features. The system then calculates the comprehensive compensation amount by weighting the timbre (40%-50%), frequency band distribution (30%-40%), and gender (10%-20%). Based on the comprehensive compensation amount, the system generates equalization curve offset value, compression threshold adjustment value, and gain correction value. The standard preset parameters are then corrected to obtain the target processing parameters, which are then transmitted to the effects processing module.
[0168] 4. Effects Processing Steps The equalization adjustment unit of the effects processing module adjusts the gain of each frequency band according to the offset value of the equalization curve through a second-order infinite-length unit impulse response filter; the dynamic compression unit adjusts the dynamic range of the signal according to the compression threshold adjustment value and uses a soft inflection point compression algorithm; the gain control unit adjusts the overall amplitude of the signal according to the gain correction value to complete the audio optimization processing.
[0169] This specification's embodiments acquire an audio processing request carrying an initial audio signal, initial audio features, and a request description. First, using the request description, it accurately matches the corresponding template audio parameters and features from an audio template library, establishing a processing benchmark. Then, by calculating the feature difference between the template audio features and the initial audio features, it quantifies the specific differences between the target speech and the ideal template in features such as timbre and frequency band distribution, and determines the audio parameter compensation amount accordingly, achieving dynamic adaptation of the processing scheme. Subsequently, it uses this compensation amount to perform targeted parameter compensation on the template audio parameters, obtaining target processing parameters that eliminate the influence of individual differences. Finally, based on these target processing parameters, it processes the initial audio signal to obtain the target audio. This solves the problem of inapplicability of general preset values caused by individual speaker differences or special role speech without relying on manual fine-tuning. It effectively avoids phenomena such as low-frequency redundancy, high-frequency distortion, and dynamic imbalance, significantly improving the consistency and processing efficiency of batch processing multi-role speech, lowering the professional threshold, and expanding the application boundaries of speech effects in special synthesized speech scenarios.
[0170] The following is in conjunction with the appendix Figure 3 The audio processing method will be further explained below. Figure 3 A flowchart illustrating the dynamic adaptation process of an audio processing method provided in one embodiment of this specification includes the following: The entire intelligent speech processing flow begins with the weighted analysis and difference quantification of multidimensional features. The system first receives timbre features (weight 40%-50%), frequency band distribution features (weight 30%-40%), and gender features (weight 10%-20%), and inputs these parameter sets with specific weights into the Euclidean distance algorithm module to calculate the difference value.
[0171] In this stage, the algorithm compares the features of the speech to be processed with the features of the standard speech in each dimension, accurately calculating the degree of deviation of each feature. Subsequently, the system integrates the difference values of each dimension into a core indicator—the comprehensive compensation amount (C)—through a weighted summation operation. This value intuitively represents the overall adjustment intensity required for the current speech signal.
[0172] Based on this comprehensive compensation amount, the system enters the stage of generating equalization / compression / gain compensation parameters, dynamically calculating three key execution parameters, the specific definitions and calculation logic of which are as follows: Equilibrium curve offset value: The equalization curve offset value is a gain adjustment amount (in dB) set for multiple single frequency bands, used to correct the frequency band energy difference between the speech to be processed and the standard speech, so as to achieve precise timbre optimization.
[0173] Calculation logic: First, determine the adjustment direction based on the differences in frequency band distribution characteristics (increase energy if it is too low, decrease energy if it is too high); then use the comprehensive compensation amount. Based on the benchmark, combined with the weights of frequency band distribution characteristics. Including industry experience adjustment factors (the default value for standard vocals is 2), calculate the single-band basic offset (formula: single-band basic offset = × ×Adjustment coefficient single-band base offset= × ×Adjustment coefficient); Finally, a fine-tuning calibration is performed based on the actual difference ratio of energy in each frequency band to obtain the final offset value of the equalization curve for each frequency band. For example, if the mid-frequency energy of the male hero's voice is 10% lower than the standard value, and the calculated base offset is 1dB, then the final mid-frequency equalization curve offset value is determined to be +1dB.
[0174] Compression threshold adjustment value: The compression threshold adjustment value is a correction amount for the compression threshold (in dBFS) of the dynamic compression unit. A positive value represents increasing the threshold (reducing compression), and a negative value represents decreasing the threshold (increasing compression). Its core function is to adapt to the dynamic range differences of the speech being processed, avoid signal peak distortion, and ensure loudness stability.
[0175] Calculation logic: First, determine the dynamic range based on the peak value and harmonic distribution of the speech signal (if the dynamic range is too large, lower the threshold); then, use the comprehensive compensation amount... Based on the baseline, combined with timbre feature weights Including an experience coefficient (default value is 3 for standard vocals), calculate the base adjustment value (formula: base adjustment value = × ×Adjustment coefficient base adjustment value= × × Adjustment coefficient, the negative sign reflects the logic that the greater the dynamic range, the stronger the compression); finally, the final calibration is performed in combination with the fundamental frequency characteristics. For example, the adjustment range can be increased by 10%-20% for deep male voices, while the adjustment range can be increased by 30%-50% for voices with large dynamic range, such as monster roars, so as to obtain the final compression threshold adjustment value.
[0176] Gain correction value: The gain correction value is a uniform gain adjustment (in dB) of the overall amplitude of the speech signal, which aims to complete the final loudness normalization, ensure that the processed speech meets the preset output standard, and avoid uneven volume.
[0177] Calculation logic: First, the root mean square (RMS) value of the preprocessed speech is detected and compared with the target RMS value to calculate the initial gain difference; then, the RMS change of the signal after equalization adjustment and dynamic compression is estimated. Formula: The estimated RMS after processing is equal to the RMS of the original signal plus the weighted average of the equalization offset values of each frequency band plus the gain change caused by compression. Ultimately, based on the target loudness, the formula is: Gain correction value = target output RMS Estimated RMS gain correction value after processing = Target output RMS Post-processing estimated RMS; Final calibration completed.
[0178] The final stage of the process is the synthesis of target processing parameters. The system applies the compensation parameters (equalization offset, compression threshold adjustment, gain correction) generated above to the basic preset parameters. By progressively correcting and superimposing the compensation parameters on the basic preset parameters, the system finally generates target processing parameters that adapt to the characteristics of the speech being processed. These parameters are then transmitted to the effects processing module, driving the digital filters, dynamic compressor, and gain controller to perform specific audio rendering, thereby achieving a closed-loop process from feature analysis to sound optimization.
[0179] Corresponding to the above method embodiments, this specification also provides embodiments of audio processing apparatus. Figure 4 This is a schematic diagram of an audio processing device provided in one embodiment of this specification. Figure 4 As shown, the device includes: The request acquisition module 402 is configured to acquire an audio processing request for the initial audio, wherein the audio processing request carries the initial audio signal, initial audio features and a request description of the audio processing request. The template matching module 404 is configured to match the template audio parameters and template audio features corresponding to the request description from the audio template library. The compensation amount determination module 406 is configured to determine the audio parameter compensation amount based on the feature difference between the template audio features and the initial audio features. The parameter compensation module 408 is configured to perform parameter compensation on the template audio parameters based on the audio parameter compensation amount to obtain the target processing parameters. The signal processing module 410 is configured to perform signal processing on the initial audio signal based on the target processing parameters to obtain the target audio.
[0180] The compensation amount determination module 406 is further configured to align the multi-dimensional template audio features with the multi-dimensional initial audio features to determine the feature differences in the multi-dimensional features; based on the preset multi-dimensional weights, the feature differences in the multi-dimensional features are weighted respectively to obtain the comprehensive parameter compensation amount; and based on the comprehensive parameter compensation amount, the audio parameter compensation amount is generated.
[0181] The compensation amount determination module 406 is further configured to determine the energy adjustment direction of each frequency band based on the difference components of each frequency band in the feature difference of the frequency band distribution feature dimension; determine the basic curve offset of each frequency band based on the comprehensive parameter compensation amount and the weights corresponding to the frequency band distribution feature dimension, under the constraint of the energy adjustment direction; calibrate the basic curve offset of each frequency band based on the difference components of each frequency band in the feature difference of the frequency band distribution feature dimension to obtain the single-band curve offset of each frequency band; and obtain the equalization curve offset based on the single-band curve offset of each frequency band.
[0182] The compensation amount determination module 406 is further configured to determine the adjustment direction of the compression threshold adjustment value based on the difference value of the timbre feature dimension; determine the initial compression threshold adjustment value under the constraint of the adjustment direction based on the weights corresponding to the comprehensive parameter compensation amount and the timbre feature dimension; and perform amplitude calibration on the initial compression threshold adjustment value based on the difference value of the gender feature dimension to obtain the compression threshold adjustment value.
[0183] The compensation amount determination module 406 is further configured to determine the basic curve offset and compression threshold adjustment value for each frequency band based on the comprehensive parameter compensation amount; determine the initial gain adjustment amount based on the difference value of the average energy characteristic dimension; determine the average curve offset based on the single-band curve offset of each frequency band; estimate the gain change amount based on the current compression threshold and compression threshold adjustment value of the initial audio signal; predict the processed audio average energy value based on the initial gain adjustment amount, average curve offset, and gain change amount; and obtain the gain correction value based on the difference between the predicted average energy value and the preset target output energy value.
[0184] The request acquisition module 402 is further configured to acquire the initial audio signal; perform spectrum analysis on the initial audio signal, and extract the initial audio features of the acquired initial audio signal.
[0185] The request acquisition module 402 is further configured to divide the spectrum of the initial audio signal into various frequency bands according to preset threshold values for each frequency band; calculate the total energy of the initial audio signal and the sub-energy of each frequency band; and determine the initial audio features of the frequency band distribution feature dimension based on the ratio of the sub-energy of each frequency band to the total energy.
[0186] The request acquisition module 402 is further configured to calculate the initial audio signal using a linear predictive coding model to obtain linear prediction coefficients; determine the spectral envelope of the initial audio signal based on the linear prediction coefficients; extract the harmonic distribution data of the initial audio signal; and determine the initial audio features of the timbre feature dimension based on the spectral envelope and harmonic distribution data.
[0187] The request acquisition module 402 is further configured to extract the fundamental frequency value of the initial audio signal; match the fundamental frequency value with a preset gender fundamental frequency threshold to determine the gender category corresponding to the initial audio signal; and extract the initial audio features of the gender feature dimension based on the gender category.
[0188] The parameter compensation module 408 is further configured to perform frequency band compensation adjustment on the equalizer response curve in the template audio parameters based on the equalizer curve offset to obtain the target processing parameters; adjust the dynamic compressor threshold in the template audio parameters based on the compression threshold adjustment value to obtain the target processing parameters; and correct the overall gain value in the template audio parameters based on the gain correction value to obtain the target processing parameters.
[0189] The request acquisition module 402 is further configured to acquire an analog speech signal; preprocess the analog speech signal to obtain an initial audio signal, wherein the preprocessing includes at least one of digital conversion and noise cancellation.
[0190] This specification's embodiments acquire an audio processing request carrying an initial audio signal, initial audio features, and a request description. First, using the request description, it accurately matches the corresponding template audio parameters and features from an audio template library, establishing a processing benchmark. Then, by calculating the feature difference between the template audio features and the initial audio features, it quantifies the specific differences between the target speech and the ideal template in features such as timbre and frequency band distribution, and determines the audio parameter compensation amount accordingly, achieving dynamic adaptation of the processing scheme. Subsequently, it uses this compensation amount to perform targeted parameter compensation on the template audio parameters, obtaining target processing parameters that eliminate the influence of individual differences. Finally, based on these target processing parameters, it processes the initial audio signal to obtain the target audio. This solves the problem of inapplicability of general preset values caused by individual speaker differences or special role speech without relying on manual fine-tuning. It effectively avoids phenomena such as low-frequency redundancy, high-frequency distortion, and dynamic imbalance, significantly improving the consistency and processing efficiency of batch processing multi-role speech, lowering the professional threshold, and expanding the application boundaries of speech effects in special synthesized speech scenarios.
[0191] The above is an illustrative scheme of an audio processing apparatus according to this embodiment. It should be noted that the technical solution of this audio processing apparatus and the technical solution of the above-described audio processing method belong to the same concept. For details not described in detail in the technical solution of the audio processing apparatus, please refer to the description of the technical solution of the above-described audio processing method.
[0192] It should be noted that the audio processing methods provided in this manual can be applied to various industries or scenarios, including virtual reality processing software, home entertainment product software, digital cultural product production software, digital cultural creative software, digital cultural creative design, education, news, cultural content industry software, digital publishing software, game and animation software; digital music development and production, digital mobile multimedia development and production, etc. In some cases, they can also be applied to animation and game production engine software and development systems, game and animation software, animation and game production engine software and development systems, game and animation software, animation and game production engine software and development systems, animation and game digital content services, digital film and television development and production, digital performance development and production, etc.
[0193] Figure 5 This is a structural block diagram of a computing device according to one embodiment of this specification. The components of the computing device 500 include, but are not limited to, a memory 510 and a processor 520. The processor 520 is connected to the memory 510 via a bus 530, and a database 540 is used to store data.
[0194] The computing device 500 also includes an access device 540, which enables the computing device 500 to communicate via one or more networks 560. Examples of these networks include Public Switched Telephone Network (PSTN), Local Area Network (LAN), Wide Area Network (WAN), Personal Area Network (PAN), or combinations of communication networks such as the Internet. The access device 540 may include one or more of any type of wired or wireless network interface (e.g., Network Interface Card (NIC)), such as an IEEE 802.11 Wireless Local Area Networks (WLAN) interface, a Wi-MAX (World Interoperability for Microwave Access) interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a Bluetooth interface, a Near Field Communication (NFC) interface, and so on.
[0195] In one embodiment of this specification, the above-described components of the computing device 500 and Figure 5 Other components, not shown, can also be connected to each other, for example, via a bus. It should be understood that... Figure 5 The block diagram of the computing device shown is for illustrative purposes only and is not intended to limit the scope of this specification. Those skilled in the art can add or replace other components as needed.
[0196] The computing device 500 can be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (e.g., tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (e.g., smartphones), wearable computing devices (e.g., smartwatches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or personal computers (PCs). The computing device 500 can also be a mobile or stationary server.
[0197] The processor 520 is used to execute computer programs / instructions, which, when executed by the processor, implement the steps of the above-described audio processing method.
[0198] The above is an illustrative scheme of a computing device according to this embodiment. It should be noted that the technical solution of this computing device and the technical solution of the audio processing method described above belong to the same concept. For details not described in detail in the technical solution of the computing device, please refer to the description of the technical solution of the audio processing method described above.
[0199] An embodiment of this specification also provides a computer-readable storage medium storing a computer program / instructions that, when executed by a processor, implement the steps of the above-described audio processing method.
[0200] The above is an illustrative embodiment of a computer-readable storage medium. It should be noted that the technical solution of this storage medium and the technical solution of the aforementioned audio processing method belong to the same concept. Details not described in detail in the technical solution of the storage medium can be found in the description of the technical solution of the aforementioned audio processing method.
[0201] An embodiment of this specification also provides a computer program product, including a computer program / instructions that, when executed by a processor, implement the steps of the above-described audio processing method.
[0202] The above is an illustrative scheme of a computer program product according to this embodiment. It should be noted that the technical solution of this computer program product and the technical solution of the above-described audio processing method belong to the same concept. For details not described in detail in the technical solution of the computer program product, please refer to the description of the technical solution of the above-described audio processing method.
[0203] The foregoing has described specific embodiments of this specification. Other embodiments fall within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
[0204] The computer instructions include computer program code, which may be in the form of source code, object code, executable file, or certain intermediate forms. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the computer-readable medium may be appropriately added or removed according to the requirements of patent practice. For example, in some regions, according to patent practice, computer-readable media may not include electrical carrier signals and telecommunication signals.
[0205] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments in this specification are not limited to the described order of actions, because according to the embodiments in this specification, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in this specification are all preferred embodiments, and the actions and modules involved are not necessarily essential to the embodiments in this specification.
[0206] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0207] The preferred embodiments disclosed above are merely illustrative of this specification. Optional embodiments do not exhaustively describe all details, nor do they limit the invention to the specific implementations described. Clearly, many modifications and variations can be made based on the embodiments described herein. These embodiments are selected and specifically described in this specification to better explain the principles and practical applications of the embodiments, thereby enabling those skilled in the art to better understand and utilize this specification. This specification is limited only by the claims and all their equivalents.
Claims
1. An audio processing method, characterized in that, include: Obtain an audio processing request for the initial audio, wherein the audio processing request carries the initial audio signal, initial audio features, and a request description of the audio processing request; From the audio template library, match the template audio parameters and template audio features corresponding to the request description; Based on the feature difference between the template audio features and the initial audio features, the audio parameter compensation amount is determined; Based on the audio parameter compensation amount, the template audio parameters are compensated to obtain the target processing parameters; Based on the target processing parameters, the initial audio signal is processed to obtain the target audio.
2. The method according to claim 1, characterized in that, The initial audio features have multiple dimensions, and the template audio features also have multiple dimensions. The step of determining the audio parameter compensation amount based on the feature difference between the template audio features and the initial audio features includes: Align the multiple-dimensional template audio features with the multiple-dimensional initial audio features, and determine the feature differences across the multiple dimensions; Based on preset weights for multiple dimensions, the feature differences in multiple dimensions are weighted separately to obtain the comprehensive parameter compensation amount; Based on the comprehensive parameter compensation amount, an audio parameter compensation amount is generated.
3. The method according to claim 2, characterized in that, The multiple dimensions include the frequency band distribution feature dimension, and the audio parameter compensation amount includes the equalization curve offset. The process of generating audio parameter compensation amounts based on the comprehensive parameter compensation amounts includes: Based on the difference components of each frequency band in the feature difference of the frequency band distribution feature dimension, the energy adjustment direction of each frequency band is determined. Based on the comprehensive parameter compensation amount and the weights corresponding to the frequency band distribution feature dimensions, the basic curve offset of each frequency band is determined under the constraint of the energy adjustment direction. Based on the difference components of each frequency band in the feature difference of the frequency band distribution feature dimension, the basic curve offset of each frequency band is calibrated to obtain the single-band curve offset of each frequency band. The offset of the equalization curve is obtained based on the single-band curve offset of each frequency band.
4. The method according to claim 2, characterized in that, The multiple dimensions include timbre feature dimension and gender feature dimension, and the audio parameter compensation amount includes compression threshold adjustment value; The process of generating audio parameter compensation amounts based on the comprehensive parameter compensation amounts includes: Based on the difference values of the timbre feature dimensions, determine the adjustment direction of the compression threshold adjustment value; Based on the comprehensive parameter compensation amount and the weights corresponding to the timbre feature dimensions, the initial compression threshold adjustment value is determined under the constraint of the adjustment direction; Based on the difference value of the gender feature dimension, the initial compression threshold adjustment value is calibrated to obtain the compression threshold adjustment value.
5. The method according to claim 2, characterized in that, The multiple dimensions also include the average energy characteristic dimension, and the audio parameter compensation amount includes the gain correction value; The process of generating audio parameter compensation amounts based on the comprehensive parameter compensation amounts includes: Based on the comprehensive parameter compensation, the basic curve offset and compression threshold adjustment value for each frequency band are determined respectively; The initial gain adjustment amount is determined based on the difference value of the average energy characteristic dimension; The average curve offset is determined based on the single-band curve offset of each frequency band. Based on the current compression threshold of the initial audio signal and the compression threshold adjustment value, the gain change is estimated; Based on the initial gain adjustment, the average curve offset, and the gain change, predict the average energy value of the processed audio. The gain correction value is obtained based on the difference between the predicted average energy value and the preset target output energy value.
6. The method according to any one of claims 1-5, characterized in that, The steps for obtaining initial audio features include: Acquire the initial audio signal; Spectral analysis is performed on the initial audio signal to extract its initial audio features.
7. The method according to claim 6, characterized in that, The step of performing spectral analysis on the initial audio signal to extract the initial audio features of the initial audio signal includes: The spectrum of the initial audio signal is divided into various frequency bands according to preset threshold values for each frequency band. Calculate the total energy of the initial audio signal and the sub-energy of each frequency band; Based on the ratio of the sub-energy of each frequency band to the total energy, the initial audio features of the frequency band distribution feature dimension are determined.
8. The method according to claim 6, characterized in that, The step of performing spectral analysis on the initial audio signal to extract the initial audio features of the initial audio signal includes: The initial audio signal is calculated using a linear predictive coding model to obtain linear prediction coefficients; Based on the linear prediction coefficients, the spectral envelope of the initial audio signal is determined; Extract the harmonic distribution data of the initial audio signal; Based on the spectral envelope and the harmonic distribution data, the initial audio features of the timbre feature dimension are determined.
9. The method according to claim 6, characterized in that, The step of performing spectral analysis on the initial audio signal to extract the initial audio features of the initial audio signal includes: Extract the fundamental frequency value of the initial audio signal; The fundamental frequency value is matched with a preset gender fundamental frequency threshold to determine the gender category corresponding to the initial audio signal; Based on the gender category, the initial audio features of the gender feature dimension are extracted.
10. The method according to claim 1, characterized in that, The audio parameter compensation amount includes at least one of equalization curve offset, compression threshold adjustment value, and gain correction value; The step of compensating the template audio parameters based on the audio parameter compensation amount to obtain the target processing parameters includes at least one of the following: Based on the equalization curve offset, the equalizer response curve in the template audio parameters is adjusted by frequency band compensation to obtain the target processing parameters; Based on the compression threshold adjustment value, the dynamic compressor threshold in the template audio parameters is adjusted to obtain the target processing parameters; Based on the gain correction value, the overall gain value in the template audio parameters is corrected to obtain the target processing parameters.
11. The method according to claim 1, characterized in that, The steps for acquiring the initial audio signal include: Acquire analog speech signals; The analog speech signal is preprocessed to obtain an initial audio signal, wherein the preprocessing includes at least one of digital conversion and noise cancellation.
12. An audio processing device, characterized in that, include: The request acquisition module is configured to acquire an audio processing request for the initial audio, wherein the audio processing request carries the initial audio signal, initial audio features, and a request description of the audio processing request. The template matching module is configured to match the template audio parameters and template audio features corresponding to the request description from the audio template library; The compensation amount determination module is configured to determine the audio parameter compensation amount based on the feature difference between the template audio features and the initial audio features; The parameter compensation module is configured to perform parameter compensation on the template audio parameters based on the audio parameter compensation amount to obtain the target processing parameters; The signal processing module is configured to perform signal processing on the initial audio signal based on the target processing parameters to obtain the target audio.
13. A computing device, characterized in that, include: Memory and processor; The memory is used to store computer programs / instructions, and the processor is used to execute the computer programs / instructions, which, when executed by the processor, implement the steps of the method according to any one of claims 1 to 11.
14. A computer-readable storage medium, characterized in that, The device stores a computer program / instructions that, when executed by a processor, implement the steps of the method according to any one of claims 1 to 11.
15. A computer program product comprising a computer program / instructions, characterized in that, When the computer program / instructions are executed by the processor, they implement the steps of the method according to any one of claims 1 to 11.