Audio generation method, computer device and computer-readable storage medium

By performing decorrelation processing and adjusting the image position of stereo audio, the problem of insufficient separation in multi-channel music generation is solved, achieving efficient conversion from stereo to surround sound and improving audio generation efficiency and effect.

CN117119369BActive Publication Date: 2026-06-12TENCENT MUSIC ENTERTAINMENT TECH (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TENCENT MUSIC ENTERTAINMENT TECH (SHENZHEN) CO LTD
Filing Date
2023-08-07
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies, especially in the conversion of stereo audio to surround sound audio, suffer from insufficient separation, resulting in excessive correlation between channels and an inability to create a good sense of spatial surround sound, leading to poor audio quality.

Method used

By extracting the original sound source signals of multiple target sound source objects from the original audio, performing decorrelation processing, generating derived sound source signals, adjusting the gain according to the sound image position, and outputting the target audio, the number of channels is increased to improve the surround sound effect.

🎯Benefits of technology

With a limited number of tracks, stereo sound sources can be quickly converted into multi-channel surround sound output by decorrelated processing and image repositioning, improving audio generation efficiency and quality.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117119369B_ABST
    Figure CN117119369B_ABST
Patent Text Reader

Abstract

The application relates to an audio generation method, a computer device and a computer readable storage medium. The method comprises the following steps: extracting original sound source signals corresponding to a plurality of target sound source objects in original audio; obtaining derived sound source signals corresponding to the target sound source objects by performing a decorrelation process on the original sound source signals corresponding to the target sound source objects; for any target sound source object, assigning a sound image position corresponding to the target sound source object according to the original sound source signal corresponding to the target sound source object and the derived sound source signal corresponding to the target sound source object; adjusting the gain of each target sound source object based on the sound image positions corresponding to the target sound source objects, and outputting target audio corresponding to the original audio; the number of sound channels of the target audio is greater than that of the original audio. The method can quickly convert a stereo sound source into a multi-channel surround sound output, ensures accurate positioning of a sound image, and effectively improves audio generation efficiency and audio effect.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, and in particular to an audio generation method, a computer device, and a computer-readable storage medium. Background Technology

[0002] For multi-channel music generation, such as creating multi-channel surround sound audio based on stereo audio, the traditional audio processing method is usually used to separate the accompaniment and simply assign it to the left and right surround channels. This will cause the channel correlation between each channel to be too large due to the separation not meeting the required standard, making it impossible to create a good sense of spatial surround sound and resulting in poor audio quality. Summary of the Invention

[0003] Therefore, it is necessary to provide an audio generation method, computer device, and computer-readable storage medium that can improve the surround sound effect of audio, in order to address the above-mentioned technical problems.

[0004] Firstly, this application provides an audio generation method. The method includes:

[0005] Extract the original sound source signals corresponding to multiple target sound source objects from the original audio;

[0006] By performing decorrelation processing on the original sound source signals corresponding to each of the target sound source objects, the derived sound source signals corresponding to each of the target sound source objects are obtained.

[0007] For any target sound source object, the sound image position corresponding to the target sound source object is assigned according to the original sound source signal and the derived sound source signal corresponding to the target sound source object.

[0008] Based on the sound image position corresponding to each of the target sound source objects, the gain of each of the target sound source objects is adjusted, and the target audio corresponding to the original audio is output; the number of channels of the target audio is greater than the number of channels of the original audio.

[0009] In one embodiment, the step of obtaining derived sound source signals corresponding to each of the target sound source objects by decorrelation processing of the original sound source signals corresponding to each of the target sound source objects includes:

[0010] According to the first time delay processing method, the original sound source signal corresponding to any target sound source object is subjected to decorrelation processing to obtain the first time delay result of the original sound source signal corresponding to any target sound source object;

[0011] According to the second time delay processing method, the first time delay result is subjected to decorrelation processing to generate multiple decorrelation signals for any target sound source object, which serve as the derived sound source signals corresponding to any target sound source object.

[0012] In one embodiment, the step of performing decorrelation processing on the original sound source signal corresponding to any target sound source object according to the first time delay processing method to obtain the first time delay result of the original sound source signal corresponding to the any target sound source object includes:

[0013] The original sound source signal corresponding to any target sound source object is input into an all-pass filter. Based on the impulse response information of the all-pass filter, the output signal of the original sound source signal corresponding to any target sound source object is obtained, which is used as the first time delay result.

[0014] In one embodiment, the step of performing decorrelation processing on the first time delay result according to the second time delay processing method to generate multiple decorrelation signals for any target sound source object includes:

[0015] Obtain preset sampling information; the preset sampling information includes the number of sampling points;

[0016] The first delay result is subjected to sampling delay processing based on the number of sampling points to obtain the plurality of decorrelation signals.

[0017] In one embodiment, the step of allocating the acoustic image position corresponding to any target sound source object based on the original sound source signal and the derived sound source signal corresponding to the target sound source object includes:

[0018] By placing the original sound source signal corresponding to any target sound source object and the derived sound source signal corresponding to any target sound source object into different channel positions, sound pressure difference information is obtained.

[0019] Based on the sound pressure difference information, locate the sound image position corresponding to any target sound source object.

[0020] In one embodiment, obtaining sound pressure difference information by placing the original sound source signal corresponding to any target sound source object and the derived sound source signal corresponding to any target sound source object into different channel positions includes:

[0021] Determine the signal placement method for any of the target sound source objects; the signal placement method includes the position of the first channel and the position of the second channel.

[0022] Place the original sound source signal corresponding to any target sound source object into the first channel position, and place the derived sound source signal corresponding to any target sound source object into the second channel position;

[0023] The sound pressure difference information is determined based on the sound pressure difference between the sound sources at the first channel position and the second channel position.

[0024] In one embodiment, extracting the original sound source signals corresponding to multiple target sound source objects in the original audio includes:

[0025] Get the raw audio that contains multiple sound source objects;

[0026] The original audio is input into the sound source separation network to obtain the original sound source signal corresponding to each sound source object;

[0027] The sound source object to be de-processed is taken as the target sound source object, and the original sound source signals corresponding to the multiple target sound source objects are obtained;

[0028] The method further includes:

[0029] Loudness scaling is performed on the original sound source signal and the original audio corresponding to each of the sound source objects to determine the loudness ratio information of each of the sound source objects in the original audio.

[0030] In one embodiment, adjusting the gain of each target sound source object based on the sound image position corresponding to each target sound source object, and outputting the target audio corresponding to the original audio, includes:

[0031] Based on the sound image position corresponding to each of the target sound source objects, the gain of each of the target sound source objects is adjusted to obtain the gain allocation information of each of the target sound source objects, and the gain configuration information of sound source objects other than the target sound source objects is obtained.

[0032] By combining the gain allocation information of each target sound source object, the gain configuration information of sound source objects other than the target sound source objects, and the loudness ratio information of each sound source object in the original audio, the converted audio corresponding to the original audio is synthesized.

[0033] The converted audio is rendered according to preset standardized processing information to obtain the target audio.

[0034] Secondly, this application also provides a computer device. The computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to perform the following steps:

[0035] Extract the original sound source signals corresponding to multiple target sound source objects from the original audio;

[0036] By performing decorrelation processing on the original sound source signals corresponding to each of the target sound source objects, the derived sound source signals corresponding to each of the target sound source objects are obtained.

[0037] For any target sound source object, the sound image position corresponding to the target sound source object is assigned according to the original sound source signal and the derived sound source signal corresponding to the target sound source object.

[0038] Based on the sound image position corresponding to each of the target sound source objects, the gain of each of the target sound source objects is adjusted, and the target audio corresponding to the original audio is output; the number of channels of the target audio is greater than the number of channels of the original audio.

[0039] Thirdly, this application also provides a computer-readable storage medium. The computer-readable storage medium stores a computer program thereon, which, when executed by a processor, performs the following steps:

[0040] Extract the original sound source signals corresponding to multiple target sound source objects from the original audio;

[0041] By performing decorrelation processing on the original sound source signals corresponding to each of the target sound source objects, the derived sound source signals corresponding to each of the target sound source objects are obtained.

[0042] For any target sound source object, the sound image position corresponding to the target sound source object is assigned according to the original sound source signal and the derived sound source signal corresponding to the target sound source object.

[0043] Based on the sound image position corresponding to each of the target sound source objects, the gain of each of the target sound source objects is adjusted, and the target audio corresponding to the original audio is output; the number of channels of the target audio is greater than the number of channels of the original audio.

[0044] Fourthly, this application also provides a computer program product. The computer program product includes a computer program that, when executed by a processor, performs the following steps:

[0045] Extract the original sound source signals corresponding to multiple target sound source objects from the original audio;

[0046] By performing decorrelation processing on the original sound source signals corresponding to each of the target sound source objects, the derived sound source signals corresponding to each of the target sound source objects are obtained.

[0047] For any target sound source object, the sound image position corresponding to the target sound source object is assigned according to the original sound source signal and the derived sound source signal corresponding to the target sound source object.

[0048] Based on the sound image position corresponding to each of the target sound source objects, the gain of each of the target sound source objects is adjusted, and the target audio corresponding to the original audio is output; the number of channels of the target audio is greater than the number of channels of the original audio.

[0049] The above-described audio generation method, computer device, and computer-readable storage medium are described. In this solution, original sound source signals corresponding to multiple target sound source objects are extracted from the original audio. Then, decorrelation processing is performed on the original sound source signals corresponding to each target sound source object to obtain derived sound source signals corresponding to each target sound source object. For any target sound source object, a sound image position corresponding to that target sound source object is assigned based on both the original and derived sound source signals. Furthermore, based on the sound image positions of each target sound source object, the gain of each target sound source object is adjusted to output the target audio corresponding to the original audio. The number of channels in this target audio is greater than the number of channels in the original audio. This achieves the goal of obtaining a large number of extended musical elements through decorrelation processing when producing a small number of tracks. The redistribution of sound image positions ensures accurate sound image positioning, enabling rapid conversion of stereo sound sources into multi-channel surround sound output, effectively improving audio generation efficiency and audio quality. Attached Figure Description

[0050] Figure 1 This is a flowchart illustrating an audio generation method in one embodiment;

[0051] Figure 2 This is a schematic diagram of a signal placement method in one embodiment;

[0052] Figure 3 This is a schematic diagram of one type of azimuth modulation in one embodiment;

[0053] Figure 4 This is a flowchart illustrating another audio generation method in one embodiment;

[0054] Figure 5 This is a structural block diagram of an audio generation device in one embodiment;

[0055] Figure 6 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0056] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0057] In one embodiment, such as Figure 1As shown, an audio generation method is provided. This embodiment illustrates the application of this method to a terminal. It is understood that this method can also be applied to a server, or to a system including both a terminal and a server, and can be implemented through the interaction between the terminal and the server. In this embodiment, the method includes the following steps.

[0058] In step S101, the original sound source signals corresponding to multiple target sound source objects in the original audio are extracted;

[0059] As an example, the original audio can be stereo audio, which may correspond to two channels, such as the left channel and the right channel.

[0060] The target sound source object can be a sound source object to be decorrelated. Based on the sound source characteristics of the sound source object, it can be determined whether the sound source object needs to be further decorrelated and the sound image position redistributed. For example, the sound source object can include musical instruments such as guitar, piano, and bass, as well as human voices.

[0061] In practical applications, the original audio containing multiple sound source objects can be obtained. Then, the original audio can be input into the sound source separation network to obtain the original sound source signal corresponding to each sound source object. Then, the sound source object to be decorrelated can be used as the target sound source object to obtain the original sound source signal corresponding to multiple target sound source objects for further decorrelation processing and sound image position redistribution.

[0062] In one example, stereo audio (i.e., the original audio) can be input into a sound source separation network, which can separate the sound source signals (i.e., the original sound source signals) of multiple sound source objects, such as guitar, piano, bass, and human voice. For each sound source object, multiple sound source signals can be separated, that is, the original sound source signals can include multiple signals, such as the sound source signal of the sound source object in the left channel and the sound source signal of the sound source object in the right channel.

[0063] In another example, taking the separation of guitar, piano, bass, and vocals as an example, since the bass is a bass-heavy instrument with low-frequency characteristics, it can be determined that it does not need to undergo subsequent decorrelation processing and sound image position redistribution. Therefore, the sound source objects such as guitar, piano, and vocals that need to be decorrelated can be used as target sound source objects. Other sound source objects to be decorrelated can also be determined according to other sound source characteristics or audio production requirements. No specific restrictions are made in this embodiment.

[0064] In step S102, the derived sound source signals corresponding to each target sound source object are obtained by decorrelation processing of the original sound source signals corresponding to each target sound source object.

[0065] In practical implementation, the original sound source signals corresponding to each extracted target sound source object can be decorrelated. For example, the processing method of combining all-pass filter and delay operation can be combined to derive multiple decorrelated signals, that is, to obtain the derived sound source signals corresponding to each target sound source object.

[0066] In an optional embodiment, taking stereo audio as an example, since the original sound source signal in the left channel and the original sound source signal in the right channel can be separated for each sound source object, the original sound source signal in the left channel can be used for decorrelation processing to obtain the corresponding derived sound source signal in the left direction range, and the original sound source signal in the right channel can be used for decorrelation processing to obtain the corresponding derived sound source signal in the right direction range.

[0067] In step S103, for any target sound source object, the sound image position corresponding to any target sound source object is assigned according to the original sound source signal and the derived sound source signal corresponding to any target sound source object.

[0068] After obtaining the derived sound source signal of any target sound source object, the original sound source signal obtained by separation and the derived sound source signal obtained by decorrelation processing can be used for directional modulation. Specifically, taking a guitar as an example, the original sound source signal of the guitar and the derived sound source signal of the guitar can be placed at different locations of the listening position. Then, the sound image position of the guitar relative to the listening position can be located by the sound pressure difference of the sound source at the placement position.

[0069] For example, taking stereo audio as the original audio source, the original sound source signal of the guitar can be placed in front of the listening position, and the derived sound source signal of the guitar can be placed behind the listening position. Then, by using the sound pressure difference between the front and rear sound sources, the position of the guitar behind the listening position, i.e., the sound image position, can be determined.

[0070] In step S104, based on the sound image position corresponding to each target sound source object, the gain of each target sound source object is adjusted, and the target audio corresponding to the original audio is output; the number of channels of the target audio is greater than the number of channels of the original audio.

[0071] As an example, the target audio can be surround sound audio, which can correspond to six or eight channels, such as 5.1 channel surround sound or 7.1 channel surround sound.

[0072] After obtaining the sound image position corresponding to each target sound source object, the gain of each target sound source object can be adjusted to create a surround sound effect. Then, the surround sound can be rendered to output the target audio, such as surround sound audio.

[0073] In one example, immersive sound (surround sound), also known as spatial audio, not only offers more possibilities for music streaming but also greatly enhances the user's auditory experience. Since most audio sources in the music market are recorded and produced in stereo, songs that have already been released as stereo masters cannot be quickly converted into spatial audio. Furthermore, creating immersive sound requires an immersive listening environment, such as a 5.1 surround sound system, making the production cost extremely high. Therefore, the audio generation method of this application can quickly convert a large number of existing stereo sound sources into corresponding multi-channel surround sound versions, thus adapting to different playback scenarios according to different surround sound output requirements.

[0074] For example, the audio generation method of this application can be based on deep learning for sound source separation, realizing the conversion of stereo sound sources into 5.1 channels. By using deep neural networks to extract different music signals in the left and right channels, such as including but not limited to human voices, drums, bass, piano, guitar and other sound sources, the separated signals can be processed by signal decorrelation, thereby automatically generating audio encoding formats that conform to 5.1 and 7.1 channel surround sound.

[0075] Compared to traditional methods that can only simply separate the accompaniment, resulting in low separation and excessive correlation between channels, and the inability to redistribute sound image positions, leading to inaccurate sound image positioning and an inability to create a good surround sound effect, the technical solution of this embodiment uses deep learning methods to separate stereo into independent vocal tracks and multiple independent instrument tracks, and can formulate the allocation of surround sound channels based on this, providing a better multi-channel music generation solution. Based on deep learning-based sound source separation, a large number of extended musical elements can be obtained through decorrelation processing with a small number of tracks, and the surround sound effect of positioning instruments can be achieved through sound pressure difference in directional modulation. This allows for the rapid conversion of stereo-recorded sound sources into multi-channel surround sound output, which helps to broaden the application range of music libraries, such as in-vehicle surround sound systems and panoramic sound support for live streaming.

[0076] In the above audio generation method, the original sound source signals corresponding to multiple target sound source objects are extracted from the original audio. Then, the original sound source signals corresponding to each target sound source object are decorrelated to obtain the derived sound source signals corresponding to each target sound source object. For any target sound source object, the sound image position corresponding to any target sound source object is assigned according to the original sound source signal and the derived sound source signal corresponding to any target sound source object. Then, based on the sound image position corresponding to each target sound source object, the gain of each target sound source object is adjusted to output the target audio corresponding to the original audio. This method achieves the production of a large number of extended music elements through decorrelation processing when the number of tracks is small. By redistributing the sound image positions, accurate sound image positioning is ensured. This method can quickly convert stereo sound sources into multi-channel surround sound output, effectively improving audio generation efficiency and audio effect.

[0077] In one embodiment, step S102, by performing decorrelation processing on the original sound source signals corresponding to each target sound source object to obtain the derived sound source signals corresponding to each target sound source object, may include the following steps:

[0078] According to the first time delay processing method, the original sound source signal corresponding to any target sound source object is subjected to decorrelation processing to obtain the first time delay result of the original sound source signal corresponding to any target sound source object; according to the second time delay processing method, the first time delay result is subjected to decorrelation processing to generate multiple decorrelation signals for any target sound source object, which serve as the derived sound source signal corresponding to any target sound source object.

[0079] In one example, to address the problem of insufficient sound sources failing to create a good surround sound effect, the original sound source signal corresponding to any target sound source object is decorated with signals. For example, by combining the processing methods of an all-pass filter (i.e., the first time delay processing method) and a delay combination operation (i.e., the second time delay processing method), multiple decorated signals can be derived. Thus, instead of performing directional processing on each type of instrument separately, the signal is copied and decorated with different placement settings to create a surround sound effect. At the same time, the main sound image is created by adjusting the gain.

[0080] In this embodiment, by performing decorrelation processing on the original sound source signal corresponding to any target sound source object according to the first time delay processing method, a first time delay result of the original sound source signal corresponding to any target sound source object is obtained. Then, by performing decorrelation processing on the first time delay result according to the second time delay processing method, multiple decorrelation signals for any target sound source object are generated as derived sound source signals corresponding to any target sound source object. This allows for the generation of a large number of extended music elements through decorrelation processing when the number of tracks is relatively small, providing rich materials for audio production.

[0081] In one embodiment, performing decorrelation processing on the original sound source signal corresponding to any target sound source object according to the first time delay processing method to obtain the first time delay result of the original sound source signal corresponding to any target sound source object may include the following steps:

[0082] The original sound source signal corresponding to any target sound source object is input into an all-pass filter. Based on the impulse response information of the all-pass filter, the output signal of the original sound source signal corresponding to any target sound source object is obtained, which is used as the first time delay result.

[0083] In practical applications, taking a guitar as an example, the original sound source signal of the guitar can be input into an all-pass filter. Without changing the amplitude response, the phase changes to different degrees with the frequency. That is, the group delay characteristic of the all-pass filter can be used to cause the waveform envelope of different frequency components to be delayed in the time domain, thereby achieving the purpose of decorrelation.

[0084] For example, the transfer function of a first-order all-pass filter can be expressed as follows:

[0085]

[0086] Where w is the normalized cutoff frequency, ranging from [0, 1].

[0087] In one example, taking human voice as an example, the signal of human voice (i.e., the original sound source signal) is x[n], the impulse response of the all-pass filter (i.e., the impulse response information) is h[n], and the output signal after passing through the all-pass filter is y[n]. The following relationship can be obtained:

[0088] y[n] = x[n] * h[n]

[0089] In this embodiment, by inputting the original sound source signal corresponding to any target sound source object into an all-pass filter, and obtaining the output signal of the original sound source signal corresponding to any target sound source object based on the impulse response information of the all-pass filter, the signal decorrelation can be achieved based on the group delay characteristics of the all-pass filter.

[0090] In one embodiment, decorrelation processing is performed on the first time delay result according to the second time delay processing method to generate multiple decorrelation signals for any target sound source object, which may include the following steps:

[0091] Obtain preset sampling information; the preset sampling information includes the number of sampling points; perform sampling delay processing on the first time delay result according to the number of sampling points to obtain multiple decorrelation signals.

[0092] In practical implementation, due to the priority effect, when two sounds are within 35 milliseconds apart, the listener cannot distinguish them as two different sound sources. Utilizing this characteristic, the signal y[n] (i.e., the first delay result) after passing through the full-pass filter can be delayed by n (i.e., the number of sampling points) sampling points (e.g., n < audio sampling rate 44100 * 0.035), thereby further enhancing the decorrelation effect while maintaining consistent timbre. After completing the decorrelation operation combining the full-pass filter and delay, multiple (e.g., 3-6) uncorrelated signals can be derived from the same type of instrument signal, i.e., multiple decorrelation signals for any target sound source.

[0093] In this embodiment, by acquiring preset sampling information, and then performing sampling delay processing on the first delay result according to the number of sampling points, multiple decorrelation signals are obtained. A large number of extended music elements can be obtained through decorrelation processing, providing data support for multi-channel music generation.

[0094] In one embodiment, step S103, for any target sound source object, assigning the sound image position corresponding to any target sound source object based on the original sound source signal and the derived sound source signal corresponding to any target sound source object, may include the following steps:

[0095] By placing the original sound source signal corresponding to any target sound source object and the derived sound source signal corresponding to any target sound source object into different channel positions, sound pressure difference information is obtained; based on the sound pressure difference information, the sound image position corresponding to any target sound source object is located.

[0096] In one example, the original sound source signal corresponding to any target sound source object and the derived sound source signal corresponding to any target sound source object can be placed at different locations of the listening position, i.e., placed at different channel positions, such as... Figure 2 As shown, this is an example of a placement method, and the sound pressure difference (i.e., sound pressure difference information) of the sound source at the placement location can be used to... Figure 2 The sound pressure difference between the front and rear sound sources is used to locate the sound image position of any target sound source relative to the listening position.

[0097] In this embodiment, by placing the original sound source signal corresponding to any target sound source object and the derived sound source signal corresponding to any target sound source object into different channel positions, sound pressure difference information is obtained. Then, based on the sound pressure difference information, the sound image position corresponding to any target sound source object is located, which can realize the redistribution of sound image position and improve the accuracy of sound image positioning.

[0098] In one embodiment, obtaining sound pressure difference information by placing the original sound source signal corresponding to any target sound source object and the derived sound source signal corresponding to any target sound source object into different channel positions may include the following steps:

[0099] Determine the signal placement method for any target sound source object; the signal placement method includes the first channel position and the second channel position; place the original sound source signal corresponding to any target sound source object in the first channel position, and place the derived sound source signal corresponding to any target sound source object in the second channel position; determine the sound pressure difference information based on the sound pressure difference between the first channel position and the second channel position.

[0100] In practical applications, after deriving multiple decorrelation signals from the same type of musical instrument signal, static azimuth modulation can be performed based on the original signal and the derived signals. Taking a guitar as an example, for instance... Figure 2 As shown, taking 5.1 channel as an example, the original guitar left and right channel signals can be defined as G. l and G r The guitar-derived signal obtained through the decorrelation operation is defined as G. (new-l) and G (new-r) The original guitar signal (i.e., the original sound source signal) and the derived guitar signal (i.e., the derived sound source signal) can be placed in front of and behind the listening position (i.e., the first channel position and the second channel position). This ensures that the same element has a direct sound signal reaching the listening position from all directions. In this way, the amplitude of the four signals can be adjusted, and the guitar position can be determined behind the listening position by the sound pressure difference between the front and rear sound sources.

[0101] In this embodiment, by determining the signal placement method for any target sound source object, the original sound source signal corresponding to any target sound source object is placed in the first channel position, and the derived sound source signal corresponding to any target sound source object is placed in the second channel position. Then, based on the sound pressure difference between the sound sources in the first channel position and the second channel position, the sound pressure difference information is determined, thereby realizing the redistribution of the sound image position, ensuring accurate positioning of the sound image, and effectively improving the audio effect.

[0102] In one embodiment, step S101, extracting the original sound source signals corresponding to multiple target sound source objects in the original audio, may include the following steps:

[0103] The process involves: acquiring the original audio containing multiple sound source objects; inputting the original audio into a sound source separation network to obtain the original sound source signal corresponding to each sound source object; and using the sound source object to be decorrelated as the target sound source object to obtain the original sound source signal corresponding to multiple target sound source objects.

[0104] The sound source separation network can be a neural network, which can be based on a neural network structure to separate different musical elements; other model structures that can achieve the same separation effect can also be used.

[0105] In one example, taking stereo audio as the original audio, the stereo audio can be input into a sound source separation network. This sound source separation network can separate the sound source signals (i.e., the original sound source signals) of multiple sound source objects, such as guitar, piano, bass, and human voice. Then, the sound source objects that need to be decorrelated can be identified as target sound source objects, and further decorrelation and azimuth modulation processing can be performed based on the original sound source signals corresponding to the target sound source objects.

[0106] The method further includes:

[0107] Loudness scaling is performed on the original sound source signal and original audio corresponding to each sound source object to determine the loudness ratio information of each sound source object in the original audio.

[0108] In the specific implementation, taking the original audio as stereo audio as an example, after separating the sound source signals of multiple sound source objects, the loudness scaling of each instrument signal (i.e. the original sound source signal) can be compared with the original stereo to retain the loudness ratio (volume ratio) of each signal in the original work, i.e., loudness ratio information, which can be used as a basis for restoration in subsequent panoramic sound synthesis.

[0109] In this embodiment, by acquiring the original audio containing multiple sound source objects and then inputting the original audio into the sound source separation network, the original sound source signals corresponding to each sound source object are obtained. Then, the sound source object to be decorrelated is taken as the target sound source object, and the original sound source signals corresponding to multiple target sound source objects are obtained. The stereo can be separated into independent human voice tracks and multiple independent instrument tracks using deep learning methods, thus achieving effective sound source separation.

[0110] In one embodiment, step S104, adjusting the gain of each target sound source object based on the sound image position corresponding to each target sound source object, and outputting the target audio corresponding to the original audio, may include the following steps:

[0111] Based on the sound image position corresponding to each target sound source object, the gain of each target sound source object is adjusted to obtain the gain allocation information of each target sound source object, and the gain configuration information of sound source objects other than the target sound source objects is obtained; combined with the gain allocation information of each target sound source object, the gain configuration information of sound source objects other than the target sound source objects, and the loudness ratio information of each sound source object in the original audio, the converted audio corresponding to the original audio is synthesized; the converted audio is rendered according to the preset standardization processing information to obtain the target audio.

[0112] In one example, using 5.1 channels, such as Figure 3 The diagram shows the directional modulation of multiple instrument signals and four reverb signals. The thickness of the lines indicates the transmission level of the instrument in the surround sound field. For example, vocal signals can be directly transmitted to the center channel, while instrument signals are evenly distributed across the five channels with different gain levels (e.g.,...). Figure 3 In addition to the center, front left, front right, left surround, and right surround channels, different instruments can also select different channels (this channel can involve multiple channels, and its direction is related; for example, if the sound image is located on the right, it can be sent through the right front channel and the right rear channel) to send a gain exceeding the average distribution, thus determining the main positioning sound image. Figure 3 In the location of the medium-thick lines, this channel can serve a locating function for the corresponding instrument. For example... Figure 3 The guitar, piano, drums, vocals, and others each have their corresponding bold lines.

[0113] In one alternative embodiment, the bass, as a bass-heavy instrument, can be evenly distributed across several channels except for the LFE (Low Frequency Sound Channel). Since the LFE primarily enhances bass, the bass can be sent to the LFE with an above-average gain. The LFE may also include a low-pass signal with a cutoff frequency of 120Hz after linearly summing all signals except the bass.

[0114] In yet another example, using 5.1 channels, such as... Figure 3 As shown, to more realistically reproduce surround sound, the reverberation signal can be modulated in the directions of left and right surround to simulate the surround reverberation signal after the sound source is reflected through the room. The directional modulation can use HOA (Higher Order Ambisonics) mode, and then it can be decoded into a six-channel signal.

[0115] In practical applications, for surround sound rendering output, a compression module can be added before the final signal output to perform standardized protection processing. This involves rendering the converted audio according to preset standardized processing information, thereby preventing clipping distortion and signal distortion caused by signal overload. This allows for the rapid conversion of stereo signals into multi-channel surround sound signals, adapting to different playback scenarios based on output requirements.

[0116] For example, in a 5.1 car audio playback scenario, the stereo sound source can be output as a six-channel signal, which can be transmitted to the 5.1 speaker array for playback through a sound card-type playback device, thus achieving a better surround sound effect. For 7.1 and other surround sound formats, it can also make the music library better adapted to the playback device, thus expanding the application scenarios of streaming media.

[0117] The equalizers, reverb units, compressors, and other effects units involved in the technical solution of this embodiment are not limited to a specific algorithm, as long as they have similar functions; the directional modulation and channel allocation mechanisms are not limited to methods such as VBAP and HOA.

[0118] In this embodiment, the gain of each target sound source object is adjusted based on the sound image position corresponding to each target sound source object to obtain the gain allocation information of each target sound source object. The gain configuration information of sound source objects other than the target sound source objects is also obtained. Then, the gain allocation information of each target sound source object, the gain configuration information of sound source objects other than the target sound source objects, and the loudness ratio information of each sound source object in the original audio are combined to synthesize the converted audio corresponding to the original audio. Then, the converted audio is rendered according to the preset standardized processing information to obtain the target audio. This realizes the rapid conversion of stereo sound source into multi-channel surround sound output, effectively improving audio generation efficiency and audio effect.

[0119] In one embodiment, such as Figure 4 As shown, a flowchart illustrating another audio generation method is provided.

[0120] In this embodiment, the method includes the following steps:

[0121] In step 401, the original sound source signals corresponding to multiple target sound source objects are extracted from the original audio. In step 402, the original sound source signal corresponding to any target sound source object is input to an all-pass filter. Based on the impulse response information of the all-pass filter, the output signal of the original sound source signal corresponding to any target sound source object is obtained as the first time delay result. In step 403, preset sampling information is obtained, and the first time delay result is processed by sampling time delay according to the number of sampling points to obtain multiple decorrelation signals, which are used as the derived sound source signals corresponding to any target sound source object. In step 404, the original sound source signal corresponding to any target sound source object and the derived sound source signal corresponding to any target sound source object are placed in different channel positions to obtain sound pressure difference information. In step 405, the sound image position corresponding to any target sound source object is located based on the sound pressure difference information. In step 406, based on the sound image position corresponding to each target sound source object, the gain of each target sound source object is adjusted to obtain the gain allocation information of each target sound source object, and the gain configuration information of sound source objects other than the target sound source objects is obtained. In step 407, the converted audio corresponding to the original audio is synthesized by combining the gain allocation information of each target sound source object, the gain configuration information of sound source objects other than the target sound source objects, and the loudness ratio information of each sound source object in the original audio. In step 408, the converted audio is rendered according to preset normalization processing information to obtain the target audio. It should be noted that the specific limitations of the above steps can be found in the specific limitations of an audio generation method described above, and will not be repeated here.

[0122] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0123] Based on the same inventive concept, this application also provides an audio generation apparatus for implementing the audio generation method described above. The solution provided by this apparatus is similar to the implementation described in the above method; therefore, the specific limitations in one or more audio generation apparatus embodiments provided below can be found in the limitations of the audio generation method described above, and will not be repeated here.

[0124] In one embodiment, such as Figure 5 As shown, an audio generation apparatus is provided, comprising:

[0125] The original sound source signal extraction module 501 is used to extract the original sound source signals corresponding to multiple target sound source objects in the original audio.

[0126] The derivative sound source signal acquisition module 502 is used to obtain the derivative sound source signal corresponding to each of the target sound source objects by performing decorrelation processing on the original sound source signal corresponding to each of the target sound source objects;

[0127] The sound image position allocation module 503 is used to allocate the sound image position corresponding to any target sound source object based on the original sound source signal and the derived sound source signal corresponding to the target sound source object.

[0128] The target audio acquisition module 504 is used to adjust the gain of each target sound source object based on the sound image position corresponding to each target sound source object, and output the target audio corresponding to the original audio; the number of channels of the target audio is greater than the number of channels of the original audio.

[0129] In one embodiment, the derived sound source signal obtaining module 502 includes:

[0130] The first decorrelation submodule is used to perform decorrelation processing on the original sound source signal corresponding to any target sound source object according to the first time delay processing method, so as to obtain the first time delay result of the original sound source signal corresponding to the target sound source object.

[0131] The second decorrelation submodule is used to decorrelate the first time delay result according to the second time delay processing method, and generate multiple decorrelation signals for any target sound source object, which serve as the derived sound source signals corresponding to any target sound source object.

[0132] In one embodiment, the first decorrelation submodule includes:

[0133] The all-pass filter processing unit is used to input the original sound source signal corresponding to any target sound source object into the all-pass filter, and obtain the output signal of the original sound source signal corresponding to any target sound source object based on the impulse response information of the all-pass filter, which is used as the first time delay result.

[0134] In one embodiment, the second decorrelation submodule includes:

[0135] A sampling information acquisition unit is used to acquire preset sampling information; the preset sampling information includes the number of sampling points.

[0136] The decorrelation signal obtaining unit is used to perform sampling delay processing on the first delay result according to the number of sampling points to obtain the plurality of decorrelation signals.

[0137] In one embodiment, the acoustic image location allocation module 503 includes:

[0138] The sound pressure difference information acquisition submodule is used to obtain sound pressure difference information by placing the original sound source signal corresponding to any target sound source object and the derived sound source signal corresponding to any target sound source object into different channel positions respectively.

[0139] The acoustic image location submodule is used to locate the acoustic image position corresponding to any target sound source object based on the sound pressure difference information.

[0140] In one embodiment, the submodule for obtaining the sound pressure difference information includes:

[0141] A signal placement method determination unit is used to determine the signal placement method for any target sound source object; the signal placement method includes the position of the first channel and the position of the second channel.

[0142] A signal placement unit is used to place the original sound source signal corresponding to any target sound source object into the first channel position, and to place the derived sound source signal corresponding to any target sound source object into the second channel position.

[0143] The sound pressure difference determination unit is used to determine the sound pressure difference information based on the sound pressure difference of the sound source between the first channel position and the second channel position.

[0144] In one embodiment, the original sound source signal extraction module 501 includes:

[0145] The raw audio acquisition submodule is used to acquire raw audio containing multiple sound source objects;

[0146] The sound source separation submodule is used to input the original audio into the sound source separation network to obtain the original sound source signal corresponding to each sound source object;

[0147] The target sound source object determination submodule is used to take the sound source object to be decorrelated as the target sound source object and obtain the original sound source signals corresponding to multiple target sound source objects;

[0148] In one embodiment, the apparatus further includes:

[0149] The loudness ratio information determination module is used to determine the loudness ratio information of each sound source object in the original audio by performing loudness scaling on the original sound source signal and the original audio corresponding to each sound source object.

[0150] In one embodiment, the target audio obtaining module 504 includes:

[0151] The gain information determination submodule is used to adjust the gain of each target sound source object based on the sound image position corresponding to each target sound source object, to obtain the gain allocation information of each target sound source object, and to obtain the gain configuration information of sound source objects other than the target sound source objects;

[0152] The audio conversion submodule is used to combine the gain allocation information of each of the target sound source objects, the gain configuration information of sound source objects other than the target sound source objects, and the loudness ratio information of each of the sound source objects in the original audio to synthesize the converted audio corresponding to the original audio.

[0153] The audio rendering submodule is used to render the converted audio according to preset standardized processing information to obtain the target audio.

[0154] Each module in the aforementioned audio generation device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the operations corresponding to each module.

[0155] In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 6 As shown, the computer device includes a processor, memory, communication interface, display screen, and input device connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, NFC (Near Field Communication), or other technologies. When executed by the processor, the computer program implements an audio generation method.

[0156] Those skilled in the art will understand that Figure 6 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0157] In one embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to perform the following steps:

[0158] Extract the original sound source signals corresponding to multiple target sound source objects from the original audio;

[0159] By performing decorrelation processing on the original sound source signals corresponding to each of the target sound source objects, the derived sound source signals corresponding to each of the target sound source objects are obtained.

[0160] For any target sound source object, the sound image position corresponding to the target sound source object is assigned according to the original sound source signal and the derived sound source signal corresponding to the target sound source object.

[0161] Based on the sound image position corresponding to each of the target sound source objects, the gain of each of the target sound source objects is adjusted, and the target audio corresponding to the original audio is output; the number of channels of the target audio is greater than the number of channels of the original audio.

[0162] In one embodiment, the processor also performs the steps described in the other embodiments when executing the computer program.

[0163] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, the computer program performing the following steps when executed by a processor:

[0164] Extract the original sound source signals corresponding to multiple target sound source objects from the original audio;

[0165] By performing decorrelation processing on the original sound source signals corresponding to each of the target sound source objects, the derived sound source signals corresponding to each of the target sound source objects are obtained.

[0166] For any target sound source object, the sound image position corresponding to the target sound source object is assigned according to the original sound source signal and the derived sound source signal corresponding to the target sound source object.

[0167] Based on the sound image position corresponding to each of the target sound source objects, the gain of each of the target sound source objects is adjusted, and the target audio corresponding to the original audio is output; the number of channels of the target audio is greater than the number of channels of the original audio.

[0168] In one embodiment, the computer program, when executed by a processor, also implements the steps described in the other embodiments above.

[0169] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, performs the following steps:

[0170] Extract the original sound source signals corresponding to multiple target sound source objects from the original audio;

[0171] By performing decorrelation processing on the original sound source signals corresponding to each of the target sound source objects, the derived sound source signals corresponding to each of the target sound source objects are obtained.

[0172] For any target sound source object, the sound image position corresponding to the target sound source object is assigned according to the original sound source signal and the derived sound source signal corresponding to the target sound source object.

[0173] Based on the sound image position corresponding to each of the target sound source objects, the gain of each of the target sound source objects is adjusted, and the target audio corresponding to the original audio is output; the number of channels of the target audio is greater than the number of channels of the original audio.

[0174] In one embodiment, the computer program, when executed by a processor, also implements the steps described in the other embodiments above.

[0175] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data shall comply with the relevant laws, regulations and standards of the relevant countries and regions.

[0176] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.

[0177] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0178] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. An audio generation method, characterized in that, The method includes: Extract the original sound source signals corresponding to multiple target sound source objects from the original audio; By performing decorrelation processing on the original sound source signals corresponding to each of the target sound source objects, the derived sound source signals corresponding to each of the target sound source objects are obtained. By placing the original sound source signal corresponding to any target sound source object and the derived sound source signal corresponding to any target sound source object into different channel positions, sound pressure difference information is obtained; based on the sound pressure difference information, the sound image position corresponding to any target sound source object is located. Based on the sound image position corresponding to each of the target sound source objects, the gain of each of the target sound source objects is adjusted, and the target audio corresponding to the original audio is output; the number of channels of the target audio is greater than the number of channels of the original audio.

2. The method according to claim 1, characterized in that, The step of obtaining derived sound source signals corresponding to each of the target sound source objects by decorrelation processing of the original sound source signals corresponding to each of the target sound source objects includes: According to the first time delay processing method, the original sound source signal corresponding to any target sound source object is subjected to decorrelation processing to obtain the first time delay result of the original sound source signal corresponding to any target sound source object; According to the second time delay processing method, the first time delay result is subjected to decorrelation processing to generate multiple decorrelation signals for any target sound source object, which serve as the derived sound source signals corresponding to any target sound source object.

3. The method according to claim 2, characterized in that, The first time delay processing method involves decorrelation processing of the original sound source signal corresponding to any target sound source object to obtain the first time delay result of the original sound source signal corresponding to any target sound source object, including: The original sound source signal corresponding to any target sound source object is input into an all-pass filter. Based on the impulse response information of the all-pass filter, the output signal of the original sound source signal corresponding to any target sound source object is obtained, which is used as the first time delay result.

4. The method according to claim 2, characterized in that, The step of performing decorrelation processing on the first time delay result according to the second time delay processing method to generate multiple decorrelation signals for any target sound source object includes: Obtain preset sampling information; the preset sampling information includes the number of sampling points; The first delay result is subjected to sampling delay processing based on the number of sampling points to obtain the plurality of decorrelation signals.

5. The method according to claim 1, characterized in that, The step of obtaining sound pressure difference information by placing the original sound source signal corresponding to any target sound source object and the derived sound source signal corresponding to any target sound source object into different channel positions includes: Determine the signal placement method for any of the target sound source objects; the signal placement method includes the position of the first channel and the position of the second channel. Place the original sound source signal corresponding to any target sound source object into the first channel position, and place the derived sound source signal corresponding to any target sound source object into the second channel position; The sound pressure difference information is determined based on the sound pressure difference between the sound sources at the first channel position and the second channel position.

6. The method according to claim 1, characterized in that, The extraction of the original sound source signals corresponding to multiple target sound source objects in the original audio includes: Get the raw audio that contains multiple sound source objects; The original audio is input into the sound source separation network to obtain the original sound source signal corresponding to each sound source object; The sound source object to be de-processed is taken as the target sound source object, and the original sound source signals corresponding to the target sound source objects are obtained. The method further includes: Loudness scaling is performed on the original sound source signal and the original audio corresponding to each sound source object to determine the loudness ratio information of each sound source object in the original audio.

7. The method according to claim 6, characterized in that, The step of adjusting the gain of each target sound source object based on the sound image position corresponding to each target sound source object, and outputting the target audio corresponding to the original audio, includes: Based on the sound image position corresponding to each of the target sound source objects, the gain of each of the target sound source objects is adjusted to obtain the gain allocation information of each of the target sound source objects, and the gain configuration information of sound source objects other than the target sound source objects is obtained. By combining the gain allocation information of each target sound source object, the gain configuration information of sound source objects other than the target sound source objects, and the loudness ratio information of each sound source object in the original audio, the converted audio corresponding to the original audio is synthesized. The converted audio is rendered according to preset standardized processing information to obtain the target audio.

8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 7.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.