A live volume adaptive control method, system, device and storage medium

By acquiring scene information and signal energy of the target audio and performing smoothing processing, the signal gain is estimated and the volume is adjusted, solving the problem of poor volume control stability in live streaming and improving the dynamic range of audio and user experience.

CN116614668BActive Publication Date: 2026-06-12BIGO TECH PTE LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BIGO TECH PTE LTD
Filing Date
2023-04-07
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing volume control solutions cannot effectively adapt to volume differences in different scenarios during live streaming, resulting in poor volume control stability in voice and music scenarios, which affects the user's listening experience.

Method used

By acquiring scene information of the target audio, calculating signal energy and smoothing it, estimating signal gain, adjusting volume to suit the needs of different scenes, and using signal detection, tracking, and gain estimation modules for adaptive volume control.

🎯Benefits of technology

It enables precise and stable volume control during live streaming, improves the clarity of audio signals in voice scenarios and the dynamic range in music scenarios, and optimizes the user's listening experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116614668B_ABST
    Figure CN116614668B_ABST
Patent Text Reader

Abstract

The embodiment of the application discloses a live broadcast volume adaptive control method, system, device and storage medium. The technical scheme provided by the embodiment of the application comprises the following steps: obtaining a target audio, determining scene information of the target audio; then calculating signal energy of the target audio, performing smoothing processing on the signal energy based on the scene information to obtain a smoothed signal of the target audio; then estimating signal gain of the target audio according to the scene information, the signal energy and the smoothed signal, and adjusting the volume of the target audio based on the signal gain. By using the above technical means, the target audio is smoothed by combining the scene information, and the smooth representation of the audio signal can be realized for different scenes. Then, the signal gain is estimated according to the smoothed signal, and the volume of the audio is adjusted, so that the audio signal of the voice scene is clearer and more stable, the audio signal of the music scene is smoother, the volume gain control demand in different scenes is met, and the listening experience of the user is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of audio processing technology, and in particular to an adaptive control method, system, device and storage medium for live streaming volume. Background Technology

[0002] Currently, in online live streaming scenarios, volume control is necessary to accommodate the volume differences in recordings by different broadcasters and to maintain a relatively stable volume across the entire platform. This volume control involves real-time tracking of the audio signal and applying dynamic gain to adjust the broadcaster's voice volume, ensuring that the peak volume of the broadcaster's voice remains stable at a consistent amplitude.

[0003] However, existing volume control schemes are only suitable for voice scenarios. In order to keep the broadcaster's volume peak at a stable amplitude, the dynamic range of the processed audio signal is reduced. For music scenarios, voice mixed with music, and other scenarios, the stability is relatively poor, which will cause dynamic damage to the music signal and result in a poor listening experience for the user. Summary of the Invention

[0004] This application provides an adaptive control method, system, device, and storage medium for live streaming volume, which can improve the stability and scene adaptability of volume control and solve the technical problem of poor volume control stability in different scenarios.

[0005] In a first aspect, embodiments of this application provide an adaptive control method for live streaming volume, comprising:

[0006] Acquire the target audio and determine the scene information of the target audio;

[0007] Calculate the signal energy of the target audio, and smooth the signal energy based on scene information to obtain a smoothed signal of the target audio.

[0008] The signal gain of the target audio is estimated based on scene information, signal energy, and smoothing signal, and the volume of the target audio is adjusted based on the signal gain.

[0009] In a second aspect, embodiments of this application provide an adaptive control system for live streaming volume, comprising:

[0010] The signal detection module is configured to acquire the target audio and determine the scene information of the target audio.

[0011] The signal tracking module is configured to calculate the signal energy of the target audio and smooth the signal energy based on scene information to obtain a smoothed signal of the target audio.

[0012] The gain estimation module is configured to estimate the signal gain of the target audio based on scene information, signal energy, and smoothed signal.

[0013] The gain processing module is configured to adjust the volume of the target audio based on the signal gain.

[0014] In a third aspect, embodiments of this application provide an adaptive control device for live streaming volume, comprising:

[0015] Memory and one or more processors;

[0016] The memory is configured to store one or more programs;

[0017] When the one or more programs are executed by the one or more processors, the one or more processors implement the adaptive control method for live volume as described in the first aspect.

[0018] In a fourth aspect, embodiments of this application provide a computer-readable storage medium storing computer-executable instructions configured, when executed by a computer processor, to perform the adaptive control method for live volume as described in the first aspect.

[0019] In a fifth aspect, embodiments of this application provide a computer program product containing instructions that, when executed on a computer or processor, cause the computer or processor to perform the adaptive control method for live volume as described in the first aspect.

[0020] This application embodiment acquires target audio and determines its scene information; then calculates the signal energy of the target audio, and smooths the signal energy based on the scene information to obtain a smoothed signal of the target audio; subsequently, it estimates the signal gain of the target audio based on the scene information, signal energy, and smoothed signal, and adjusts the volume of the target audio based on the signal gain. By employing the above technical means and combining scene information to smooth the target audio, a smooth representation of the audio signal can be achieved for different scenes. Furthermore, by estimating the signal gain based on the smoothed signal and adjusting the audio volume, the audio signal in speech scenes becomes clearer and more stable, while the audio signal in music scenes retains its dynamic range and has a more appropriate volume, meeting the volume gain control requirements of different scenes, optimizing volume control results, and improving the user's listening experience. Attached Figure Description

[0021] Figure 1 This is a flowchart of an adaptive control method for live streaming volume provided in an embodiment of this application;

[0022] Figure 2 This is a flowchart of the target audio processing in an embodiment of this application;

[0023] Figure 3This is a schematic diagram of signal smoothing processing in an embodiment of this application;

[0024] Figure 4 This is a schematic diagram of signal energy conversion in an embodiment of this application;

[0025] Figure 5 This is a schematic diagram of smooth signal conversion in an embodiment of this application;

[0026] Figure 6 This is a flowchart of the signal gain calculation in an embodiment of this application;

[0027] Figure 7 This is a schematic diagram of signal gain estimation in an embodiment of this application;

[0028] Figure 8 This is a schematic diagram illustrating the target gain value query in an embodiment of this application;

[0029] Figure 9 This is a schematic diagram of the structure of an adaptive control system for live streaming volume provided in an embodiment of this application;

[0030] Figure 10 This is a schematic diagram of the structure of an adaptive control device for live streaming volume provided in an embodiment of this application. Detailed Implementation

[0031] To make the objectives, technical solutions, and advantages of this application clearer, specific embodiments of this application will be described in further detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely for explaining this application and not for limiting it. It should also be noted that, for ease of description, only the parts relevant to this application are shown in the drawings, not all of them. Before discussing exemplary embodiments in more detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although the flowcharts describe operations (or steps) as sequential processes, many of these operations can be performed in parallel, concurrently, or simultaneously. Furthermore, the order of the operations can be rearranged. The process can be terminated when its operation is completed, but may also have additional steps not included in the drawings. The process can correspond to a method, function, procedure, subroutine, subprogram, etc.

[0032] The adaptive volume control method for live streaming provided in this application aims to smooth the target audio by combining scene information, estimate the signal gain based on the smoothed signal, and adjust the audio volume to meet the volume gain control requirements in different scenarios, thereby achieving adaptive volume control in different scenarios and improving the user's listening experience.

[0033] Generally, in audio and video service scenarios, users prefer a stable, smooth, and clear audio experience, and volume is a crucial factor affecting this experience. Excessive volume can lead to distortion and a harsh, jarring sound, while insufficient volume makes the audio difficult to listen to and less clear. Different audio types require different volume levels. In live streaming, a good volume experience means maintaining a stable volume for continuous audio signals, providing a dynamic feel to continuous music signals, and preventing excessive amplification of noise signals, all while ensuring the appropriate volume. However, on live streaming platforms, the equipment used by broadcasters, the streaming environment, and the content they record result in significant volume variations among different broadcasters. Therefore, platforms need to use volume control algorithms to adjust and maintain a relatively stable volume across the entire platform. Since people have different volume requirements for audio, music, and noise environments, traditional volume control technologies primarily develop different algorithm kernels for specific scenarios. If different technologies are used haphazardly in live streaming, algorithm conflicts or unnatural transitions between scenarios may occur, degrading the overall sound quality. For volume control in scenarios such as voice, music, and voice-music mix, targeting only one type of audio signal with volume control will affect the volume processing of other types of audio signals, resulting in relatively poor stability.

[0034] Therefore, volume control methods suitable for live streaming should provide users with intelligent volume adjustment functions according to the needs of different environments. Based on this, this application provides an adaptive volume control method for live streaming to solve the technical problem of poor volume control stability in different scenarios.

[0035] Example:

[0036] Figure 1 A flowchart of an adaptive control method for live streaming volume provided in this application embodiment is given. The adaptive control method for live streaming volume provided in this embodiment can be executed by an adaptive control device for live streaming volume. This adaptive control device for live streaming volume can be implemented by software and / or hardware. The adaptive control device for live streaming volume can consist of two or more physical entities, or it can consist of a single physical entity. Generally, the adaptive control device for live streaming volume can be an audio processing server, a live streaming terminal device, a computer, a mobile phone, a tablet, or other processing devices.

[0037] The following description uses the adaptive control device for live streaming volume as the main body for implementing the adaptive control method for live streaming volume as an example. (Refer to...) Figure 1 The adaptive volume control method for this live stream specifically includes:

[0038] S110. Obtain the target audio and determine the scene information of the target audio;

[0039] S120. Calculate the signal energy of the target audio, and smooth the signal energy based on scene information to obtain a smoothed signal of the target audio.

[0040] S130. Estimate the signal gain of the target audio based on scene information, signal energy, and smoothing signal, and adjust the volume of the target audio based on the signal gain.

[0041] In this embodiment of the application, when controlling the live streaming volume, the scene information of the input audio signal is determined to adaptively control the volume based on the scene information. The audio signal of the adaptive control device corresponding to the input live streaming volume is defined as the target audio. By identifying the scene information of the target audio, corresponding strategy parameters are provided for the smoothing processing and signal gain estimation of the target audio based on the scene information, thereby adjusting the volume of the target audio in real time.

[0042] Generally, in order to reduce the volume difference between different live streaming rooms and optimize the volume experience within a live streaming room, volume control methods are usually introduced to provide signal compression and amplification capabilities, making the sound more stable and louder. Based on this, this application embodiment provides a scenario-based audio processing logic to achieve precise and stable volume control in live streaming scenarios.

[0043] Reference Figure 2 This application provides a flowchart of the target audio processing in an embodiment of the present application. The volume of the target audio input to the device is controlled through the aforementioned scene-based audio processing logic. Specifically, for the corresponding input signal x[n] (i.e., the target audio), the signal gain is calculated and used to adjust the audio. When calculating the signal gain, a signal detection module is first used to detect the scene information of the target audio. The signal detection module is a composite model that can provide scene information of the input signal. The signal detection module uses SED (Signal Environment Detection) technology to classify the input signal into different scenes such as speech, music, speech mixed with music, and noisy signals, and then outputs the corresponding scene information.

[0044] SED technology provides frame-level scene classification, but frequent and rapid scene switching is clearly not suitable for live streaming scenarios. Therefore, this application further performs post-processing operations on SED, enabling the signal detection module to provide second-level scene information.

[0045] The input signal is further tracked using a signal tracking module to obtain a smoothed signal of the target audio. The signal tracking module calculates the frame size of each frame of the target audio signal to obtain the signal energy of the target audio. Then, based on this signal energy and relevant parameter information determined according to scene information, the target audio is converted from an instantaneously changing amplitude to a smoothly changing envelope form amplitude, i.e., a smoothed signal. The signal tracking module designs different near-exponential smoothing equations to smooth the signal between consecutive frames, achieving accurate and fast signal tracking in speech scenarios, and accurate, smooth, and stable tracking in music scenarios.

[0046] Next, based on the smoothed signal, the gain estimation module estimates the amount of gain required for the target audio, i.e., the signal gain. The gain estimation module calculates the gain value of the smoothed signal and the original signal energy of the target audio frame, combining parameters provided by the scene information, thus obtaining the signal gain of the target audio. This signal gain is then passed to the gain processing module, which applies a gain to the target audio based on this signal gain, thereby allowing volume adjustment of the target audio using the signal gain.

[0047] Furthermore, this application also utilizes a control module to provide corresponding calculation parameters to the signal tracking module and gain estimation module based on scene information, thereby achieving adaptive volume adjustment under different scenarios. The control module receives scene information from the signal detection module, and when the scene information changes relative to the previous input signal, the control module adjusts the parameters and volume control strategies of the signal tracking module and gain estimation module. Optionally, to ensure the naturalness and smoothness of scene transitions, different switching strategies can be designed; that is, the strategies for switching from scene A to scene B and from scene B to scene A are different, satisfying adaptive volume control under different scenarios.

[0048] By acquiring scene information based on the characteristics of sound signals in different scenarios, and using the scene information to provide influencing parameters and additional mechanisms to control the real-time signal tracking and gain estimation process, the resulting signal gain is used for volume control, making voice signal tracking more accurate and faster, and music signal tracking and estimation smoother and more controllable. Furthermore, through additional gain compensation and gain control modules, the signal gain requirements in different scenarios are met, thereby obtaining better volume control results in live streaming and improving the user's listening experience.

[0049] Specifically, when determining the scene information of the target audio, the signal detection module includes:

[0050] S1101. Perform speech recognition on the target audio to obtain speech information and noise information;

[0051] S1102. Determine the scene label of the target audio based on the signal composition of the target audio;

[0052] S1103, using voice information, noise information and scene labels as scene information.

[0053] Considering the potential for misjudgment by a single signal detection model, the signal detection module in this embodiment uses two types of models for signal detection, primarily including an audio signal detection model (SED) and an audio recognition model (VAD). The scene label of the target audio is determined by the audio signal detection model, while the audio recognition model provides additional calibration information, adding conditions for judging noisy scenes. The audio signal detection model is a signal classification model implemented using a neural network, while the audio recognition model is a Gaussian mixture model based on energy features. For the target audio, 200 frames of signal are input to the two models of the signal detection module each time, with each frame having a length of 10ms. The audio recognition model, through speech recognition, outputs the number of speech signal frames and the number of noise signal frames in the 200 frames, i.e., speech information and noise information, to assist in scene classification. The audio signal detection model, through signal detection, outputs three arrays of length 200, where the elements of the arrays range from 0 to 1, representing the probability that each frame of the 200 frames is speech, music, or noise.

[0054] Furthermore, the audio signal detection model performs post-processing based on the aforementioned probability information to output scene labels for the target audio. Specifically, by determining the probability values ​​of each signal type in the target audio, the scene label is determined based on these probability values. Signal types include speech, music, and noise. The audio signal detection model determines the probability of a signal being speech, music, or noise by statistically analyzing 200 frames of signal. Different threshold parameters α, β, and θ are set based on the model's classification bias and offline experimental results. The specific post-processing equations are as follows:

[0055]

[0056]

[0057]

[0058] in, Represents the probability of a speech signal. Indicates the probability of a music signal. This represents the probability of the noise signal. The signal probabilities are compared frame-by-frame with the corresponding threshold parameters α, β, and θ. If the signal probability is greater than the threshold parameter, the corresponding cumulative signal value is incremented by 1. This yields the cumulative values ​​of the music, music, and noise signals. , and Based on the cumulative values ​​of the three signals, output the corresponding scene label:

[0059]

[0060] Finally, based on the aforementioned speech information, noise information, and scene labels, the scene information of the target audio can be obtained, indicating the scene in which the target audio exists. It should be noted that a scene label is output every 200 frames of the target audio signal. Based on multiple scene labels, the audio scene in which the target audio exists can be determined. There are a total of 7 audio scenes, including speech scenes, music scenes, mixed scenes, noisy scenes corresponding to these three scenes, and pure noise scenes. Specifically, if 5 out of 10 consecutive scene labels are the same speech scene label, the target audio is considered to be in a speech scene; if the number of speech scene and music scene labels is greater than or equal to 3 out of 10 consecutive scene labels, the target audio is considered to be in a mixed scene; if there are more than 2 noise scene labels out of 10 consecutive scene labels, noise is added to the scene determination; the initial scene is set to a music scene, and in other cases, the original scene information is maintained without switching. The scene labels determine the scene in which the target audio exists, while the speech and noise information provide auxiliary reference information that can be used to correct the scene determined by the scene labels. This allows for the precise setting of appropriate parameters for subsequent signal tracking and gain estimation. Optionally, different parameters can be pre-set for different audio scenarios. Furthermore, parameter settings can differ when switching from different scenarios to the same scenario. For example, switching from a speech or music scenario to a mixed scenario requires different parameter settings to ensure the stability and reliability of volume control during scenario transitions. Additionally, parameters can be set in conjunction with corresponding speech and noise information. Specific parameter settings can be pre-defined according to actual signal tracking and gain estimation needs and are not fixed here.

[0061] Next, corresponding to the target audio, the smoothed signal of the target audio is determined by the signal tracking module. (Refer to...) Figure 3 The signal tracking module takes 10ms audio frames as input and first calculates the RMS signal energy. The RMS signal energy represents the magnitude of the signal; the RMS value, also known as the effective value, is the square root of the signal and characterizes the energy level within it. The formula for calculating signal energy is as follows:

[0062]

[0063] Where k represents the number of sampling points in a frame of signal. The RMS energy of the nth frame signal is represented by x[n], where x is the amplitude at each sampling point. The signal energy of the input signal is obtained by converting the input signal amplitude using the above signal energy calculation formula. (Refer to...) Figure 4The input signal amplitude from the upper part is converted to obtain the corresponding signal energy from the lower part.

[0064] Then, based on scene information, the signal energy is smoothed to obtain a smoothed signal of the target audio. Specifically, the smoothing parameters are determined according to the scene information, and the smoothing parameters and signal energy are substituted into the set smoothing formula to obtain the smoothed signal of the target audio.

[0065] Signal energy smoothing transforms instantaneous signal changes into smooth envelope-like changes. Through near-exponential smoothing, let the input signal of the current frame be x[n], and the smoothed signal of the current frame be y[n]. Based on this input signal x[n], the smoothed signal is calculated using the corresponding smoothing parameters.

[0066] Prior to this, by obtaining historical smoothing parameters, the weighting coefficients of the current smoothing parameter and historical smoothing parameters are determined based on scene information. The current smoothing parameter is then calculated based on these weighting coefficients and the historical smoothing parameters. The formula for calculating the smoothing parameter is as follows:

[0067] Where a0, a1, a2, b1, and b2 are parameters provided based on scene information, i.e., weight coefficients. and Indicates the current smoothing parameter. This represents the historical smoothing parameter. Its weighting coefficients differ for different audio scenarios.

[0068] Based on the above smoothing parameters, the formula for calculating the smoothed signal is as follows:

[0069] in,

[0070] This indicates the smoothed signal from the previous frame. This represents the smoothing reference for the current frame, i.e., the smoothed y[n-1] from the previous frame and the current frame's smoothed y[n-1]. The smoothing calculation is performed. If the actual value of the current frame is greater than the smoothing reference, it indicates that the signal of the current frame is rising.

[0071] The smoothed signal for the current frame is:

[0072]

[0073] When the actual value of the current frame is less than the smoothing reference, the smoothing signal for the current frame is:

[0074]

[0075] The smoothing signal for the current frame is determined based on the current frame signal, the current frame smoothing reference, and the rising or falling state of the current frame signal. Different and The parameters, resulting in different smooth signal envelopes, also have different emphases on accuracy and smoothness (e.g., fast signal tracking in speech scenarios, smooth signal tracking in music scenarios), thereby enabling more precise and adaptive signal control. Signal smoothing processing, such as... Figure 5 As shown, the smoothed signal below is obtained by smoothing the signal energy above.

[0076] Then, based on the above scene information, signal energy, and smoothed signal, the signal gain of the target audio is estimated, referring to... Figure 6 The signal gain calculation process includes:

[0077] S1301, Determine the first target gain value of the signal energy and the second target gain value of the smoothed signal;

[0078] S1302. Determine the influence parameters of the signal gain based on the scene information, and calculate the signal gain based on the influence parameters, the first target gain value, and the second target gain value.

[0079] Specifically, corresponding to the target audio signal energy and smoothed signal obtained above, the target gain of the target audio is determined by determining the target gain value required for the original signal energy and the smoothed signal after smoothing. Determining the first target gain value of the signal energy and the second target gain value of the smoothed signal includes: converting the signal amplitude of the signal energy into a first decibel value and converting the smoothed signal into a second decibel value; determining the first target gain value by querying a mapping relationship based on the first decibel value; and determining the second target gain value by querying a mapping relationship based on the second decibel value. The mapping relationship is pre-constructed based on different decibel values ​​and their corresponding target gain values.

[0080] Reference Figure 7 Based on the signal energy and smoothing signal of the audio signal in the target audio, the signal energy (i.e., the current signal) and smoothing signal are first converted from linear values ​​to decibel values ​​xdB[n]. Then, according to the set mapping relationship gain_map, the control quantity required for the current frame signal x[n] is determined, that is, the target gain value gaindB[n], i.e., gaindB[n] = gain_map(xdB[n]), where different gain_map() methods exist for different scenarios. Afterwards, gaindB[n] is converted into a linear signal gain gain[n], completing the signal gain calculation.

[0081] When converting a signal from its linear domain representation to the dB domain to obtain the corresponding decibel value, the conversion form is as follows:

[0082] XdB[n]= max( 10 * log10( X[n]) , -90)

[0083] The current signal decibel value is cur_RMS_dB = max( 10 * log10( cur_RMS[n] ), -90).

[0084] The decibel value of the smoothed signal is smooth_RMS_dB = max( 10 * log10( smooth_RMS[n] ) , -90)

[0085] The mapping relationship gain_map is as follows Figure 8 As shown, the diagonal line represents the decibel value of the input signal, and the curve represents the target gain value of the input signal. The coordinates [-38, -22.431] indicate that the input signal of the current frame is -38dB, and the target output signal of the algorithm should be -22.431dB. Its calculation form is: gaindB[n]=gain_map(XdB[n]).

[0086] Referring to the above mapping relationship, first calculate the target gain values ​​gain_1[n] and gain_2[n] corresponding to cur_RMS_dB[n] and smooth_RMS_dB[n], respectively. Then, based on the signal gain influence parameters determined by the control module according to the scene information, calculate the signal gain based on the influence parameters, the first target gain value, and the second target gain value. The influence parameters are represented as make_up_gain and coff_gain, where coff_gain has a value range of [0, 1] and make_up_gain has a value range of [-3, 6]. The formula for calculating the signal gain gain[n] is as follows:

[0087] gain[n]=(1-coff_gain)*gain_1[n]+coff_gain*gain_2[n])+make_up_gain

[0088] gain_1[n]=gain_map(cur_RMS_dB[n])

[0089] gain_2[n]=gain_map(smooth_RMS_dB[n])

[0090] Based on the above formula for calculating signal gain gain[n], the signal gain can be obtained by adaptively calculating it according to scene information. Then, the gain processing module adjusts the target audio volume according to this signal gain.

[0091] The gain processing module adjusts the volume of the target audio based on the signal gain, including:

[0092] S1303. Determine the adjusted audio after the signal gain is applied to the target audio;

[0093] S1304. When the audio is adjusted to reach the set peak clipping threshold, the signal gain is adjusted based on the set reference signal, and the volume of the target audio is adjusted using the adjusted signal gain.

[0094] S1305. If the adjusted audio does not reach the set peak clipping threshold, use signal gain to adjust the volume of the target audio.

[0095] The gain processing module calculates the final linear gain gain_final[n] applied to the target audio based on the signal gain gain[n]. To prevent peak clipping in the output signal after gain[n] is applied to the target audio input signal x[n], this application combines the signal y_pre after volume control processing of the previous frame signal to correct gain[n] and obtain the true linear gain gain_final[n] applied to the target audio. By correcting gain[n], peak clipping is prevented after the signal gain gain[n] is applied to the input signal x[n]. Since the signal tracking and gain estimation modules perform multiple smoothing operations, there may be a situation where gain[n]*x[n]>1, so it is necessary to avoid peak clipping and correct the signal gain gain[n].

[0096] If there is no risk of peak shaving, then 1 is the peak reduction threshold;

[0097]

[0098] If there is a risk of peak shaving, that is

[0099]

[0100]

[0101]

[0102] Based on the above correction formula, the corrected signal gain is obtained. ,Will Applying this to the target audio signal yields the amplified signal h[n] and a reference signal for peak clipping in the next frame. :

[0103] y[n]=x[n]*

[0104]

[0105] Thus, through the aforementioned signal detection, signal tracking, gain estimation, and gain processing procedures, scene-based adaptive volume control can be achieved, effectively addressing the issues of fluctuating volume, popping, and loss of sound caused by the diversity of devices, scenes, and frequent scene and live stream switching in online live streaming. By mapping the input signal to a specified dynamic range, while ensuring that larger signals have appropriate dynamic margins to protect signal amplitude, small and medium signals are appropriately amplified, and extremely small signals or background noise are eliminated. This results in stable, sufficiently loud, and dynamically variable signal volume in voice scenarios. In music scenarios, the original dynamic changes of the signal are maintained while the volume is sufficiently loud. Furthermore, noisy background sounds are not excessively amplified in all scenarios. Simultaneously, the volume difference between different live streams is small. This effectively solves the problem that single-scene technologies and strategies cannot meet the needs of online business scenarios in online live streaming.

[0106] The above describes a process where the target audio is acquired and its scene information is determined. The signal energy of the target audio is then calculated, and the signal energy is smoothed based on the scene information to obtain a smoothed signal. Subsequently, the signal gain of the target audio is estimated based on the scene information, signal energy, and the smoothed signal, and the volume of the target audio is adjusted based on this gain. By combining scene information with the smoothed signal for target audio smoothing, a smooth representation of the audio signal can be achieved for different scenes. Furthermore, by estimating the signal gain based on the smoothed signal and adjusting the audio volume, the audio signal in speech scenes becomes clearer and more stable, while the audio signal in music scenes retains its dynamic range and has a more appropriate volume, meeting the volume gain control requirements of different scenes, optimizing volume control results, and improving the user's listening experience.

[0107] Based on the above embodiments, Figure 9 This is a schematic diagram of the structure of an adaptive control system for live streaming volume provided in this application. (Reference) Figure 9 The adaptive control system for live streaming volume provided in this embodiment specifically includes: a signal detection module 21, a signal tracking module 22, a gain estimation module 23, and a gain processing module 24.

[0108] The signal detection module 21 is configured to acquire the target audio and determine the scene information of the target audio.

[0109] The signal tracking module 22 is configured to calculate the signal energy of the target audio and smooth the signal energy based on scene information to obtain a smoothed signal of the target audio.

[0110] The gain estimation module 23 is configured to estimate the signal gain of the target audio based on scene information, signal energy, and smoothing signal.

[0111] The gain processing module 24 is configured to adjust the volume of the target audio based on the signal gain.

[0112] Specifically, determining the scene information of the target audio includes:

[0113] Speech recognition is performed on the target audio to obtain speech information and noise information;

[0114] The scene label of the target audio is determined based on the signal composition of the target audio.

[0115] Scene information is defined by voice information, noise information, and scene labels.

[0116] Among them, determining the scene label of the target audio based on the signal composition of the target audio includes:

[0117] Determine the probability value of each signal type in the target audio, and determine the scene label of the target audio based on each probability value. The signal types include speech signals, music signals and noise signals.

[0118] Specifically, the smoothed signal of the target audio is obtained by smoothing the signal energy based on scene information, including:

[0119] Determine the smoothing parameters based on the scene information, substitute the smoothing parameters and signal energy into the set smoothing processing formula, and obtain the smoothed signal of the target audio.

[0120] Among them, determining the smoothing parameters based on scene information includes:

[0121] Obtain historical smoothing parameters, determine the weight coefficients of the current smoothing parameters and historical smoothing parameters based on scene information, and calculate the current smoothing parameters based on the weight coefficients and historical smoothing parameters.

[0122] Specifically, the signal gain of the target audio is estimated based on scene information, signal energy, and smoothed signal, including:

[0123] Determine the first target gain value for the signal energy and the second target gain value for the smoothed signal;

[0124] The influence parameters of the signal gain are determined based on the scene information, and the signal gain is calculated based on the influence parameters, the first target gain value, and the second target gain value.

[0125] Determining the first target gain value of the signal energy and the second target gain value of the smoothed signal includes:

[0126] The signal amplitude of the signal energy is converted into a first decibel value, and the smoothed signal is converted into a second decibel value;

[0127] The first target gain value is determined by querying the mapping relationship based on the first decibel value, and the second target gain value is determined by querying the mapping relationship based on the second decibel value. The mapping relationship is pre-constructed based on different decibel values ​​and their corresponding target gain values.

[0128] Specifically, adjusting the volume of the target audio based on the signal gain includes:

[0129] Determine the adjusted audio after the signal gain is applied to the target audio;

[0130] When the audio is adjusted to reach the set peak clipping threshold, the signal gain is adjusted based on the set reference signal, and the volume of the target audio is adjusted using the adjusted signal gain.

[0131] If the audio adjustment does not reach the set clipping threshold, use signal gain to adjust the volume of the target audio.

[0132] The above describes a process where the target audio is acquired and its scene information is determined. The signal energy of the target audio is then calculated, and the signal energy is smoothed based on the scene information to obtain a smoothed signal. Subsequently, the signal gain of the target audio is estimated based on the scene information, signal energy, and the smoothed signal, and the volume of the target audio is adjusted based on this gain. By combining scene information with the smoothed signal for target audio smoothing, a smooth representation of the audio signal can be achieved for different scenes. Furthermore, by estimating the signal gain based on the smoothed signal and adjusting the audio volume, the audio signal in speech scenes becomes clearer and more stable, while the audio signal in music scenes retains its dynamic range and has a more appropriate volume, meeting the volume gain control requirements of different scenes, optimizing volume control results, and improving the user's listening experience.

[0133] The live streaming volume adaptive control system provided in this application embodiment can be configured to execute the live streaming volume adaptive control method provided in the above embodiment, and has corresponding functions and beneficial effects.

[0134] Based on the above practical examples, this application also provides an adaptive control device for live streaming volume, referring to... Figure 10The adaptive control device for live streaming volume includes a processor 31, a memory 32, a communication module 33, an input device 34, and an output device 35. The memory 32, as a computer-readable storage medium, can be configured to store software programs, computer-executable programs, and modules, such as program instructions / modules corresponding to the adaptive control method for live streaming volume described in any embodiment of this application (e.g., signal detection module, signal tracking module, gain estimation module, and gain processing module in the adaptive control system for live streaming volume). The communication module 33 is configured to perform data transmission. The processor 31 executes various functional applications and data processing of the device by running the software programs, instructions, and modules stored in the memory, thereby realizing the aforementioned adaptive control method for live streaming volume. The input device 34 can be configured to receive input digital or character information and generate key signal inputs related to user settings and function control of the device. The output device 35 may include a display screen or other display device. The aforementioned adaptive control device for live streaming volume can be configured to execute the adaptive control method for live streaming volume provided in the above embodiments, possessing corresponding functions and beneficial effects.

[0135] Based on the above embodiments, this application also provides a computer-readable storage medium storing computer-executable instructions. These computer-executable instructions, when executed by a computer processor, are configured to perform an adaptive control method for live streaming volume. The storage medium can be any type of memory device or storage device. Of course, the computer-readable storage medium provided in this application is not limited to the adaptive control method for live streaming volume described above; it can also perform related operations within the adaptive control method for live streaming volume provided in any embodiment of this application.

[0136] Based on the above embodiments, this application also provides a computer program product. The technical solution of this application, in essence or in other words, the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. The computer program product is stored in a storage medium and includes several instructions to cause a computer device, mobile terminal, or processor therein to execute all or part of the steps of the adaptive control method for live volume described in the various embodiments of this application.

Claims

1. An adaptive control method for live streaming volume, characterized in that, include: Acquire target audio, perform speech recognition on the target audio to obtain speech information and noise information, determine the scene label of the target audio based on the signal composition of the target audio, and use the speech information, noise information and the scene label as the scene information of the target audio; Calculate the signal energy of the target audio, determine the smoothing parameters based on the scene information, and substitute the smoothing parameters and the signal energy into a set smoothing formula to obtain the smoothed signal of the target audio. The step of determining the smoothing parameters based on the scene information includes: Obtain historical smoothing parameters, determine the weight coefficients of the current smoothing parameter and the historical smoothing parameters based on the scene information, and calculate the current smoothing parameter based on the weight coefficients and the historical smoothing parameters. The formula for calculating the current smoothing parameter is as follows: [n]=a0* [n]+a1* [n 1]+a2* [n 2] b1* [n 1] b2* [n 2] [n]=a0* [n]+a1* [n 1]+a2* [n 2] b1* [n 1] b2* [n 2] Where a0, a1, a2, b1, and b2 are weighting coefficients provided based on the scenario information. [n] and [n] represents the current smoothing parameter. [n 1]、 [n 2]、 [n 1] and [n [2] represents the historical smoothing parameter, with different weighting coefficients for different audio scenes; The smoothing formula is as follows: in, [n] and [n] represents the current smoothing parameter. This indicates the smoothed signal from the previous frame. Indicates signal energy. Indicates the smooth reference for the current frame; Estimate the signal gain of the target audio based on the scene information, the signal energy, and the smoothing signal, and adjust the volume of the target audio based on the signal gain; wherein, estimating the signal gain of the target audio based on the scene information, the signal energy, and the smoothing signal includes: determining a first target gain value of the signal energy and a second target gain value of the smoothing signal, determining an influence parameter of the signal gain based on the scene information, and calculating the signal gain based on the influence parameter, the first target gain value, and the second target gain value.

2. The adaptive control method for live streaming volume according to claim 1, characterized in that, The step of determining the scene label of the target audio based on the signal composition of the target audio includes: The probability values ​​of each signal type in the target audio are determined, and the scene label of the target audio is determined based on each probability value. The signal types include speech signals, music signals, and noise signals.

3. The adaptive control method for live streaming volume according to claim 1, characterized in that, Determining the smoothing parameters based on the scene information includes: Obtain historical smoothing parameters, determine the weight coefficients of the current smoothing parameters and historical smoothing parameters based on the scene information, and calculate the current smoothing parameters based on the weight coefficients and historical smoothing parameters.

4. The adaptive control method for live streaming volume according to claim 1, characterized in that, Determining the first target gain value of the signal energy and the second target gain value of the smoothed signal includes: The signal amplitude of the signal energy is converted into a first decibel value, and the smoothed signal is converted into a second decibel value; The first target gain value is determined by querying the set mapping relationship based on the first decibel value, and the second target gain value is determined by querying the set mapping relationship based on the second decibel value. The set mapping relationship is pre-constructed based on different decibel values ​​and corresponding target gain values.

5. The adaptive control method for live streaming volume according to claim 1, characterized in that, Adjusting the volume of the target audio based on the signal gain includes: Determine the adjusted audio after the signal gain is applied to the target audio; When the adjusted audio reaches the set peak clipping threshold, the signal gain is adjusted based on the set reference signal, and the volume of the target audio is adjusted using the adjusted signal gain; If the adjusted audio does not reach the set peak clipping threshold, the volume of the target audio is adjusted using the signal gain.

6. An adaptive control system for live streaming volume, characterized in that, include: The signal detection module is configured to acquire target audio, perform speech recognition on the target audio to obtain speech information and noise information, determine the scene label of the target audio based on the signal composition of the target audio, and use the speech information, noise information and the scene label as the scene information of the target audio; The signal tracking module is configured to calculate the signal energy of the target audio, determine smoothing parameters based on the scene information, and substitute the smoothing parameters and the signal energy into a set smoothing formula to obtain a smoothed signal of the target audio. The step of determining the smoothing parameters based on the scene information includes: Obtain historical smoothing parameters, determine the weight coefficients of the current smoothing parameter and the historical smoothing parameters based on the scene information, and calculate the current smoothing parameter based on the weight coefficients and the historical smoothing parameters. The formula for calculating the current smoothing parameter is as follows: [n]=a0* [n]+a1* [n 1]+a2* [n 2] b1* [n 1] b2* [n 2] [n]=a0* [n]+a1* [n 1]+a2* [n 2] b1* [n 1] b2* [n 2] Where a0, a1, a2, b1, and b2 are weighting coefficients provided based on the scenario information. [n] and [n] represents the current smoothing parameter. [n 1]、 [n 2]、 [n 1] and [n [2] represents the historical smoothing parameter, with different weighting coefficients for different audio scenes; The smoothing formula is as follows: in, [n] and [n] represents the current smoothing parameter. This indicates the smoothed signal from the previous frame. Indicates signal energy. Indicates the smooth reference for the current frame; A gain estimation module is configured to estimate the signal gain of the target audio based on the scene information, the signal energy, and the smoothed signal, and to adjust the volume of the target audio based on the signal gain. The step of estimating the signal gain of the target audio based on the scene information, the signal energy, and the smoothed signal includes: determining a first target gain value for the signal energy and a second target gain value for the smoothed signal; determining an influence parameter for the signal gain based on the scene information; and calculating the signal gain based on the influence parameter, the first target gain value, and the second target gain value. A gain processing module is configured to adjust the volume of the target audio based on the signal gain.

7. An adaptive volume control device for live streaming, characterized in that, include: Memory and one or more processors; The memory is configured to store one or more programs; When the one or more programs are executed by the one or more processors, the one or more processors implement the adaptive control method for live volume as described in any one of claims 1-5.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions that, when executed by a computer processor, are configured to perform the adaptive control method for live volume as described in any one of claims 1-5.

9. A computer program product, characterized in that, The computer program product includes instructions that, when executed on a computer or processor, cause the computer or processor to perform the adaptive control method for live volume as described in any one of claims 1-5.