Audio processing method and device, computer device and storage medium

By preprocessing and feature extraction of stereo audio signals, and combining three-dimensional sound source location coordinates and convolutional attention neural networks, the problem of high correlation of channel signals in stereo music upmixing technology is solved, realizing the conversion from stereo to three-dimensional channels and improving sound effects.

CN116320962BActive Publication Date: 2026-06-12NIO TECH ANHUI CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NIO TECH ANHUI CO LTD
Filing Date
2023-01-03
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Most existing stereo upmixing techniques can only upmix to a two-dimensional 5.1 or 7.1 channel format, resulting in high correlation of channel signals, information redundancy, insufficient separation of sound sources and spatial sense, which affects the sound effect.

Method used

By acquiring stereo audio signals, configuring three-dimensional sound source position coordinates, preprocessing and feature extraction of the audio signals, and upmixing the time and frequency domain features using the three-dimensional sound source position coordinates, a target audio signal with three-dimensional channels is generated. A convolutional attention neural network is used for model training to improve processing efficiency.

🎯Benefits of technology

It enables the conversion of stereo audio signals into three-dimensional channel format, improving sound quality and enhancing the spatial sense and separation of sound effects.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116320962B_ABST
    Figure CN116320962B_ABST
Patent Text Reader

Abstract

The application relates to an audio processing method and device, computer equipment, a storage medium and a computer program product. The method comprises the following steps: obtaining a to-be-processed audio signal and a configured three-dimensional sound source position coordinate, pre-processing the to-be-processed audio signal to obtain a pre-processed standard audio signal, then extracting features from the standard audio signal to obtain extracted time-domain audio features and frequency-domain audio features, and finally performing upmix processing on the time-domain audio features and the frequency-domain audio features according to the three-dimensional sound source position coordinate to obtain a processed target audio signal with three-dimensional sound channels. Thus, the to-be-processed audio signal in a stereo sound format is converted into the target audio signal with three-dimensional sound channels, so that the sound quality effect is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of audio technology, and in particular to an audio processing method, apparatus, computer equipment, storage medium, and computer program product. Background Technology

[0002] With the development of audio technology, various surround sound formats have emerged to achieve a better music experience, such as stereo, 5.1 channel, 7.1 channel, and 7.1.4 channel. Among these, 5.1, 7.1, and 7.1.4 are ways of expressing the number of channels output by a music system. The first number represents the sum of the left and right channels, surround channels, and center channel; the second number represents the number of subwoofer channels; and the third number represents the number of overhead channels. If not specified, overhead channels are not configured. Generally, the more channels there are, the more finely the spatial positioning of the music is divided during playback, resulting in a more immersive surround sound experience.

[0003] In traditional techniques, limited by equipment and production difficulty, over 90% of music on the market is produced in stereo format, i.e., two-channel playback, resulting in a lack of spatial surround sound. Therefore, upmixing techniques for stereo music have been proposed. These techniques mainly include two categories: one is an upmixing scheme based on empirical rules, which expands the number of channels by applying gain, delay, crossover filtering, and correlation processing to the left and right channels; the other is an upmixing scheme based on primary ambient extraction (PAE), which expands the number of channels by repositioning the sound image of the extracted two types of sound source signals.

[0004] However, most of the above-mentioned upmixing techniques for stereo music can only upmix into a two-dimensional 5.1 or 7.1 channel format, and the channel signals produced by upmixing have high correlation, resulting in information redundancy between channels. This leads to insufficient separation and spatial sense of the sound source, thus affecting the sound effect. Summary of the Invention

[0005] Therefore, it is necessary to provide an audio processing method, apparatus, computer device, computer-readable storage medium, and computer program product that can improve sound effects in response to the above-mentioned technical problems.

[0006] Firstly, this application provides an audio processing method. The method includes:

[0007] The audio signal to be processed and the configured three-dimensional sound source position coordinates are obtained, wherein the audio signal to be processed is a stereo audio signal;

[0008] The audio signal to be processed is preprocessed to obtain a preprocessed standard audio signal;

[0009] Feature extraction is performed on the standard audio signal to obtain extracted time-domain audio features and frequency-domain audio features;

[0010] The time-domain audio features and the frequency-domain audio features are upmixed based on the three-dimensional sound source location coordinates to obtain the processed target audio signal with three-dimensional channels.

[0011] In one embodiment, the preprocessing of the audio signal to be processed to obtain a preprocessed standard audio signal includes: performing frame segmentation processing on the audio signal to be processed according to a preset frame length and frame shift to obtain multiple audio signal frames; performing channel averaging processing on each audio signal frame to obtain a processed average audio signal frame; obtaining the audio sampling rate of the average audio signal frame, converting the audio sampling rate into a target audio sampling rate, and obtaining the standard audio signal of the average audio signal frame.

[0012] In one embodiment, the step of extracting features from the standard audio signal to obtain extracted time-domain audio features and frequency-domain audio features includes: sampling the standard audio signal according to the target audio sampling rate to obtain sampled time-domain audio features; and obtaining transformed frequency-domain audio features by performing a short-time Fourier transform on the time-domain audio features.

[0013] In one embodiment, the step of upmixing the time-domain audio features and the frequency-domain audio features according to the three-dimensional sound source location coordinates to obtain a processed target audio signal with three-dimensional channels includes: normalizing the time-domain audio features and the frequency-domain audio features respectively to obtain normalized time-domain features and frequency-domain features; performing feature synthesis transformation on the normalized time-domain features and the frequency-domain features to obtain synthesized latent space features; and parsing, separating, and recombining the latent space features according to the three-dimensional sound source location coordinates to obtain a target audio signal with three-dimensional channels.

[0014] In one embodiment, the step of upmixing the time-domain audio features and the frequency-domain audio features based on the three-dimensional sound source location coordinates to obtain a processed target audio signal with three-dimensional channels includes: inputting the three-dimensional sound source location coordinates, the time-domain audio features, and the frequency-domain audio features into a pre-trained audio upmixing model for upmixing to obtain a target audio signal with three-dimensional channels output by the audio upmixing model.

[0015] In one embodiment, the method for generating the audio upmixing model includes: constructing a training dataset based on the original audio track signal, the training dataset including the original audio track signal, sampled three-dimensional sound source position coordinate samples, and rendered three-dimensional channel sample audio signals; inputting the original audio track signal and the three-dimensional sound source position coordinate samples into a convolutional attention neural network for upmixing processing to obtain a processed predicted audio signal with three-dimensional channels; training the convolutional attention neural network based on the sample audio signal, the predicted audio signal, and a preset mean squared error loss function until convergence to obtain the trained audio upmixing model.

[0016] In one embodiment, constructing a training dataset based on the original audio track signals includes: acquiring multiple original audio track signals of different sounds; uniformly sampling a set space according to the three coordinate axes of the three-dimensional sound channel with a set step size to obtain sampled three-dimensional sound source position coordinate samples; performing layout rendering on the multiple original audio track signals based on the three-dimensional sound source position coordinate samples to obtain sample audio signals of the rendered three-dimensional sound channel; and constructing the training dataset based on the multiple original audio track signals, the three-dimensional sound source position coordinate samples, and the sample audio signals.

[0017] Secondly, this application also provides an audio processing apparatus. The apparatus includes:

[0018] The data acquisition module is used to acquire the audio signal to be processed and the configured three-dimensional sound source position coordinates, wherein the audio signal to be processed is a stereo audio signal;

[0019] The preprocessing module is used to preprocess the audio signal to be processed to obtain a preprocessed standard audio signal;

[0020] The feature extraction module is used to extract features from the standard audio signal to obtain extracted time-domain audio features and frequency-domain audio features;

[0021] The processing module is used to perform upmixing processing on the time-domain audio features and the frequency-domain audio features according to the three-dimensional sound source location coordinates to obtain the processed target audio signal with three-dimensional channels.

[0022] Thirdly, this application also provides a computer device. The computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the steps of the method described in the first aspect above.

[0023] Fourthly, this application also provides a computer-readable storage medium. The computer-readable storage medium stores a computer program thereon, which, when executed by a processor, implements the steps of the method described in the first aspect above.

[0024] Fifthly, this application also provides a computer program product. The computer program product includes a computer program that, when executed by a processor, implements the steps of the method described in the first aspect above.

[0025] The aforementioned audio processing method, apparatus, computer equipment, storage medium, and computer program products involve a terminal acquiring the audio signal to be processed and the configured three-dimensional sound source position coordinates. The terminal preprocesses the audio signal to obtain a preprocessed standard audio signal, then extracts features from the standard audio signal to obtain extracted time-domain and frequency-domain audio features. Finally, based on the three-dimensional sound source position coordinates, the time-domain and frequency-domain audio features are upmixed to obtain a processed target audio signal with three-dimensional channels. This converts the stereo format audio signal to be processed into a target audio signal with three-dimensional channels, thereby improving sound quality. Attached Figure Description

[0026] Figure 1 This is a flowchart illustrating an audio processing method in one embodiment;

[0027] Figure 2 This is a flowchart illustrating the preprocessing steps in one embodiment;

[0028] Figure 3 This is a flowchart illustrating the feature extraction steps in one embodiment;

[0029] Figure 4 This is a flowchart illustrating the mixing process in one embodiment;

[0030] Figure 5 This is a flowchart illustrating the steps of generating the supermixing model in one embodiment;

[0031] Figure 6 This is a schematic diagram of the structure of a convolutional attention neural network in one embodiment;

[0032] Figure 7 This is a flowchart illustrating the steps involved in constructing a training dataset in one embodiment.

[0033] Figure 8 This is a structural block diagram of an audio processing device in one embodiment;

[0034] Figure 9 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0035] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0036] In one embodiment, such as Figure 1 As shown, an audio processing method is provided. This embodiment illustrates the method applied to a terminal. It is understood that this method can also be applied to a server, and further to a system including both a terminal and a server, and implemented through interaction between the terminal and the server. In this embodiment, the method may include the following steps:

[0037] Step 102: Obtain the audio signal to be processed and the configured three-dimensional sound source position coordinates.

[0038] The audio signal to be processed is a stereo audio signal. The three-dimensional sound source position coordinates can be the three-dimensional position coordinates of the sound source object configured based on the audio format conversion. Specifically, the three-dimensional sound source position coordinates can be configured according to actual needs.

[0039] In this embodiment, to obtain better sound effects, the stereo audio signal can be converted into a three-dimensional (e.g., 7.1.4 channel) audio signal. Before conversion, the terminal first needs to obtain the stereo audio signal to be processed and the configured three-dimensional sound source position coordinates, and then process it through subsequent steps to obtain the three-dimensional audio signal.

[0040] Step 104: Preprocess the audio signal to be processed to obtain a preprocessed standard audio signal.

[0041] Preprocessing refers to the preparatory process before final processing and refinement. In this embodiment, preprocessing can be a standardization process performed on the audio signal to be processed. A standard audio signal can be an audio signal with a unified standard obtained after preprocessing the audio signal to be processed. Specifically, the unified standard can be unified frame segmentation processing, channel processing, and audio sampling rate processing, etc.

[0042] In this embodiment, the terminal preprocesses the acquired audio signal to obtain a preprocessed standard audio signal.

[0043] Step 106: Extract features from the standard audio signal to obtain the extracted time-domain audio features and frequency-domain audio features.

[0044] Specifically, the terminal extracts time-domain and frequency-domain audio features from the standard audio signal by performing feature extraction on the standard audio signal.

[0045] Step 108: Upmix the time-domain audio features and frequency-domain audio features according to the three-dimensional sound source location coordinates to obtain the processed target audio signal with three-dimensional channels.

[0046] The target audio signal refers to an audio signal with three-dimensional channels (e.g., 7.1.4 channels). The upmixing process includes the parsing, separation, and reconstruction of features. Specifically, during the reconstruction process, features can be reconstructed based on the three-dimensional sound source position coordinates configured above.

[0047] In this embodiment, the terminal can perform upmixing on the extracted time-domain audio features and frequency-domain audio features based on the three-dimensional sound source location coordinates, thereby obtaining the processed target audio signal with three-dimensional channels.

[0048] In the aforementioned audio processing method, the terminal acquires the audio signal to be processed and the configured three-dimensional sound source position coordinates, preprocesses the audio signal to be processed to obtain a preprocessed standard audio signal, then extracts features from the standard audio signal to obtain extracted time-domain and frequency-domain audio features, and finally upmixes the time-domain and frequency-domain audio features according to the three-dimensional sound source position coordinates to obtain a processed target audio signal with three-dimensional channels. This achieves the conversion of the stereo format audio signal to be processed into a target audio signal with three-dimensional channels, thereby improving sound quality.

[0049] In one embodiment, such as Figure 2 As shown, in step 104, the audio signal to be processed is preprocessed to obtain a preprocessed standard audio signal, which may specifically include:

[0050] Step 202: The audio signal to be processed is divided into frames according to the preset frame length and frame shift to obtain multiple audio signal frames.

[0051] Frame segmentation refers to dividing the audio signal into segments. This involves dividing the audio signal into segments according to a specified length (such as a time interval or number of samples), resulting in each segment being an audio signal frame. The frame length refers to the length of each audio signal frame, which can be a time interval, for example, 20ms. Frame shift refers to the overlap between adjacent audio signal frames after segmentation, ensuring signal continuity. It has the same unit of length as the frame length, for example, 5ms. It's understandable that when the frame length is in units of time, the corresponding frame shift should also be in units of time; conversely, when the frame length is in units of samples, the corresponding frame shift should also be in units of samples.

[0052] In this embodiment, the terminal can perform frame-segmentation of the audio signal to be processed according to a preset frame length and frame shift, thereby obtaining multiple audio signal frames after segmentation.

[0053] Step 204: For each audio signal frame, perform channel averaging processing to obtain the processed average audio signal frame.

[0054] The channel averaging process involves averaging the signals from the left and right channels of the audio signal frame. Since the audio signal frame is a stereo format audio signal, it includes left and right channel signals. By averaging the left and right channel signals, the processed average audio signal frame is obtained.

[0055] In this embodiment, the terminal performs average processing on the signals of the left and right channels of the audio signal frame, thereby erasing the sound source location information in the original format, and thus treating different sound source objects more evenly in subsequent processing.

[0056] Step 206: Obtain the audio sampling rate of the average audio signal frame, convert the audio sampling rate to the target audio sampling rate, and obtain the standard audio signal of the average audio signal frame.

[0057] The audio sampling rate refers to the number of times an analog signal is sampled per unit time. Generally, the higher the sampling frequency, the more realistic and natural the waveform of the mechanical wave. Specifically, the audio sampling rate refers to the actual sampling rate of the audio signal to be processed, while the target audio sampling rate can be a preset desired sampling rate or a standard sampling rate.

[0058] In this embodiment, the terminal obtains the audio sampling rate of the average audio signal frame and converts the audio sampling rate into a target audio sampling rate. That is, the audio sampling rate of the average audio signal frame is standardized to a fixed value, such as 16KHz, 24KHz, 36KHz or 48KHz, thereby obtaining a standard audio signal with the target audio sampling rate.

[0059] In the above embodiments, the terminal performs frame-by-frame processing on the audio signal to be processed according to a preset frame length and frame shift, thereby obtaining multiple audio signal frames with a uniform scale. For each audio signal frame, channel averaging is performed to obtain a processed average audio signal frame. This process removes the sound source location information from the original format, allowing for more balanced treatment of different sound source objects in subsequent processing. By obtaining the audio sampling rate of the average audio signal frame, the audio sampling rate is converted to a target audio sampling rate, resulting in a standard audio signal with the target audio sampling rate, ensuring that the standard audio signal has a uniform standard scale.

[0060] In one embodiment, such as Figure 3 As shown, in step 106, feature extraction is performed on the standard audio signal to obtain extracted time-domain audio features and frequency-domain audio features, which may specifically include:

[0061] Step 302: Sample the standard audio signal according to the target audio sampling rate to obtain the sampled time-domain audio features.

[0062] Specifically, the terminal can sample the standard audio signal according to the target audio sampling rate to obtain the sampled temporal audio features. For example, if the frame length during frame processing is 20ms, then each segment of the processed standard audio signal will have a frame length of 20ms. If the target audio sampling rate is 48kHz, the terminal can sample the standard audio signal according to this sampling rate to obtain 960 sampled values, which are then the sampled temporal audio features.

[0063] Step 304: The transformed frequency domain audio features are obtained by performing a short-time Fourier transform on the time-domain audio features.

[0064] In this embodiment, the terminal performs a short-time Fourier transform (STFT) on the time-domain audio features of each frame of standard audio signal obtained above to obtain the transformed frequency-domain audio features.

[0065] Specifically, by performing a short-time Fourier transform on the time-domain audio features of each frame of standard audio signal, the real and imaginary part frequency domain features are obtained. In this embodiment, in order to reduce the feature dimension and reduce the computational load of the algorithm, the conjugate symmetric part can be removed, and only the real and imaginary number features are retained as frequency domain audio features.

[0066] In this embodiment, the terminal samples the standard audio signal according to the target audio sampling rate to obtain the sampled time-domain audio features. Then, by performing a short-time Fourier transform on the time-domain audio features, the transformed frequency-domain audio features are obtained. This provides both time-domain and frequency-domain audio features, facilitating subsequent upmixing processing.

[0067] In one embodiment, such as Figure 4 As shown, in step 108, the time-domain audio features and frequency-domain audio features are upmixed according to the three-dimensional sound source location coordinates to obtain the processed target audio signal with three-dimensional channels. Specifically, this may include:

[0068] Step 402: Normalize the time-domain audio features and frequency-domain audio features respectively to obtain normalized time-domain features and frequency-domain features.

[0069] Normalization involves transforming dimensional parameters into dimensionless parameters. In this embodiment, the terminal normalizes the time-domain audio features to obtain normalized time-domain features, and normalizes the frequency-domain audio features to obtain normalized frequency-domain features.

[0070] Step 404: Perform feature synthesis transformation on the normalized time-domain features and frequency-domain features to obtain the latent space features after synthesis transformation.

[0071] The synthesis transformation can be a combination of time-domain features and frequency-domain features in a certain way. In this embodiment, the terminal performs a feature synthesis transformation on the normalized time-domain features and frequency-domain features to obtain the high-dimensional latent space features after the synthesis transformation.

[0072] Step 406: Analyze, separate and reconstruct the latent space features based on the three-dimensional sound source location coordinates to obtain the target audio signal with three-dimensional sound channels.

[0073] Specifically, the terminal can analyze, separate, and reassemble the latent space features obtained in the above steps based on the three-dimensional sound source location coordinates, thereby obtaining a target audio signal with three-dimensional channels.

[0074] In this embodiment, the terminal normalizes the time-domain audio features and frequency-domain audio features respectively to obtain normalized time-domain features and frequency-domain features. Then, it performs feature synthesis transformation on the normalized time-domain features and frequency-domain features to obtain the synthesized latent space features. Subsequently, it analyzes, separates and reassembles the latent space features according to the three-dimensional sound source position coordinates to obtain the target audio signal with three-dimensional channels. This realizes the conversion of stereo format audio signals into three-dimensional channel audio signals, thereby improving the sound effect of the audio signal.

[0075] In one embodiment, in step 108, the time-domain audio features and frequency-domain audio features are upmixed according to the three-dimensional sound source location coordinates to obtain the processed target audio signal with three-dimensional channels. Specifically, it may also include: inputting the three-dimensional sound source location coordinates, time-domain audio features and frequency-domain audio features into a pre-trained audio upmixing model for upmixing to obtain the target audio signal with three-dimensional channels output by the audio upmixing model.

[0076] The audio upmixing model can be a pre-trained neural network model used to mix three-dimensional sound source location coordinates, temporal audio features, and frequency audio features to obtain a target audio signal with three-dimensional channels. Because this embodiment uses an audio upmixing model to upmix the audio signal, it can improve processing efficiency.

[0077] In one embodiment, such as Figure 5 As shown, the method for generating audio upmixing models may include the following steps:

[0078] Step 502: Construct a training dataset based on the original audio track signal.

[0079] The training dataset includes the original audio track signal, sampled 3D sound source position coordinates, and rendered 3D channel sample audio signals. Specifically, the original audio track signal refers to the unprocessed raw audio with track attributes, such as vocals, drum sounds, guitar sounds, piano sounds, bass notes, and accompaniment sounds. The sampled 3D sound source position coordinates are samples of the 3D position coordinates of sound source objects determined by sampling a specific space based on certain sampling rules. The sample audio signals are the 3D channel audio signals obtained by rendering the original audio track signal based on the 3D sound source position coordinate samples.

[0080] In this embodiment, the training dataset is a dataset used for model training. Specifically, before model training, the terminal can first construct a training dataset based on the original audio track signal.

[0081] Step 504: Input the original audio track signal and the three-dimensional sound source position coordinate samples into the convolutional attention neural network for upmixing to obtain the processed predicted audio signal with three-dimensional channels.

[0082] Among them, the convolutional attention neural network is the basic network used for model training. Specifically, before the original audio track signal is input into the convolutional attention neural network, it needs to undergo processes such as... Figure 2 The preprocessing shown and as Figure 3 The feature extraction shown is then performed, and the extracted time-domain audio features, frequency-domain audio features, and three-dimensional sound source location coordinate samples are input into a convolutional attention neural network for upmixing, thereby obtaining a processed predicted audio signal with three-dimensional channels.

[0083] In this embodiment, the structure of the convolutional attention neural network is as follows: Figure 6 As shown, from top to bottom, it includes a normalization layer, a first one-dimensional convolutional layer, a pooling layer, a first two-dimensional convolutional layer, a second two-dimensional convolutional layer, an activation layer, a fully connected layer, and a second one-dimensional convolutional layer. The first two-dimensional convolutional layer can be a 3×3 two-dimensional convolutional layer, and the second two-dimensional convolutional layer can be a 7×7 two-dimensional convolutional layer.

[0084] Specifically, the normalization layer normalizes the input temporal and frequency-domain audio features. The first one-dimensional convolutional layer performs feature synthesis transformation on the normalized temporal and frequency-domain features to obtain high-dimensional latent space features, thus avoiding the limitations of manually extracted feature modeling. The pooling layer, the first two-dimensional convolutional layer, the second two-dimensional convolutional layer, the activation layer, and the fully connected layer form the convolutional attention module, which performs feature parsing, separation, and recombination on the transformed latent space features. Finally, the second one-dimensional convolutional layer outputs a predicted audio signal with three-dimensional channels. The pooling layer can perform max pooling with a 1×2 window length along the time dimension, thereby filtering out the main feature components, and each pooling operation reduces the computational load by 50%. Feature parsing is accomplished through 3×3 and 7×7 two-dimensional convolutional layers. Activation layers enable attention control over features. Feature rendering and reconstruction are controlled by the input three-dimensional sound source position coordinate samples, thereby achieving end-to-end upmixing processing and completing the stereo to three-dimensional channel conversion.

[0085] Step 506: Train the convolutional attention neural network based on the sample audio signal, the predicted audio signal, and the preset mean squared error loss function until convergence to obtain the trained audio upmixing model.

[0086] Since the sample audio signal is a three-dimensional audio signal obtained by rendering the original audio track signal based on three-dimensional sound source position coordinate samples, this signal can be used as the learning target of the convolutional attention neural network. Specifically, based on the sample audio signal, the predicted audio signal, and a preset mean squared error loss function, the loss predicted by the convolutional attention neural network can be calculated, and the convolutional attention neural network can be trained according to this loss until convergence (i.e., when the loss is minimized), resulting in the trained audio upmixing model. The preset mean squared error loss function can be used to calculate the loss of temporal audio features.

[0087] It is understandable that the structure of the audio upmixing model obtained after training is similar to... Figure 6 The working principle is similar, and will not be described in detail in this embodiment.

[0088] In one embodiment, such as Figure 7 As shown, in step 502, a training dataset is constructed based on the original audio track signal, which may specifically include:

[0089] Step 702: Obtain the original audio track signals of multiple different sounds.

[0090] Here, the original audio track signals of different sounds refer to the original audio with different audio track attributes. In this embodiment, the original audio track signals of multiple different sounds can include six original audio track signals, such as human voice, drum sound, guitar sound, piano sound, bass sound, and accompaniment sound.

[0091] In this embodiment, in order to construct a training dataset, the terminal can acquire multiple original audio track signals of different sounds and construct the training dataset through subsequent steps.

[0092] Step 704: Based on the three coordinate axes of the three-dimensional sound channel, uniformly sample the set space with a set step size to obtain the sampled three-dimensional sound source position coordinate samples.

[0093] The designated space can be the space where the original audio track signal is rendered, such as a car cabin, an airplane cabin, or a specific space. In this embodiment, to improve the rendered sound effects and increase the robustness of the model, the original audio track signal can be rendered in the designated space. Therefore, in this embodiment, the terminal can uniformly sample the designated space according to the three coordinate axes of the three-dimensional sound channel at a set step size to obtain the three-dimensional sound source position coordinate samples required for the layout rendering of the original audio track signal. Specifically, the set step size can be a pre-set sampling step size. The three coordinate axes can be the X-axis, Y-axis, and Z-axis. The terminal can uniformly sample on the three coordinate axes at a set step size and randomly construct the three-dimensional position coordinates of the sound source object from the sampled points, thus obtaining the sampled three-dimensional sound source position coordinate samples.

[0094] Step 706: Layout rendering of multiple original audio track signals based on the three-dimensional sound source position coordinate samples to obtain the rendered three-dimensional channel sample audio signals.

[0095] Specifically, Pro Tools (a workstation software system) can be used for layout rendering. For example, the terminal inputs multiple original audio track signals into Pro Tools by providing three-dimensional sound source position coordinate samples, thereby obtaining sample audio signals of three-dimensional channels output by Pro Tools after layout rendering.

[0096] Step 708: Construct a training dataset based on multiple original audio track signals, three-dimensional sound source position coordinate samples, and sample audio signals.

[0097] Specifically, the terminal can construct a training dataset based on multiple original audio track signals, three-dimensional sound source location coordinate samples, and sample audio signals. The sample audio signals are the target of model learning, while the multiple original audio track signals and three-dimensional sound source location coordinate samples are used for model prediction. By minimizing the difference between the predicted results and the sample audio signals, the model parameters are optimized, thereby improving the model's accuracy.

[0098] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0099] Based on the same inventive concept, this application also provides an audio processing apparatus for implementing the audio processing method described above. The solution provided by this apparatus is similar to the implementation described in the above method; therefore, the specific limitations in one or more audio processing apparatus embodiments provided below can be found in the limitations of the audio processing method described above, and will not be repeated here.

[0100] In one embodiment, such as Figure 8 As shown, an audio processing device is provided, including: a data acquisition module 802, a preprocessing module 804, a feature extraction module 806, and a processing module 808, wherein:

[0101] The data acquisition module 802 is used to acquire the audio signal to be processed and the configured three-dimensional sound source position coordinates, wherein the audio signal to be processed is a stereo audio signal;

[0102] Preprocessing module 804 is used to preprocess the audio signal to be processed to obtain a preprocessed standard audio signal;

[0103] The feature extraction module 806 is used to extract features from the standard audio signal to obtain extracted time-domain audio features and frequency-domain audio features;

[0104] The processing module 808 is used to perform upmixing processing on the time-domain audio features and the frequency-domain audio features according to the three-dimensional sound source position coordinates to obtain the processed target audio signal with three-dimensional channels.

[0105] In one embodiment, the preprocessing module is specifically used to: perform frame segmentation processing on the audio signal to be processed according to a preset frame length and frame shift to obtain multiple audio signal frames; perform channel averaging processing on each audio signal frame to obtain a processed average audio signal frame; obtain the audio sampling rate of the average audio signal frame, convert the audio sampling rate into a target audio sampling rate, and obtain the standard audio signal of the average audio signal frame.

[0106] In one embodiment, the feature extraction module is specifically used to: sample the standard audio signal according to the target audio sampling rate to obtain the sampled time-domain audio features; and obtain the transformed frequency-domain audio features by performing a short-time Fourier transform on the time-domain audio features.

[0107] In one embodiment, the processing module is specifically used to: normalize the time-domain audio features and the frequency-domain audio features respectively to obtain normalized time-domain features and frequency-domain features; perform feature synthesis transformation on the normalized time-domain features and the frequency-domain features to obtain synthesized latent space features; and analyze, separate, and reconstruct the latent space features according to the three-dimensional sound source location coordinates to obtain a target audio signal with three-dimensional channels.

[0108] In one embodiment, the processing module is further configured to: input the three-dimensional sound source location coordinates, the time-domain audio features, and the frequency-domain audio features into a pre-trained audio upmixing model for upmixing processing, to obtain a target audio signal with three-dimensional channels output by the audio upmixing model.

[0109] In one embodiment, the processing module is further configured to: construct a training dataset based on the original audio track signal, the training dataset including the original audio track signal, sampled three-dimensional sound source position coordinate samples, and rendered three-dimensional channel sample audio signals; input the original audio track signal and the three-dimensional sound source position coordinate samples into a convolutional attention neural network for upmixing processing to obtain a processed predicted audio signal with three-dimensional channels; and train the convolutional attention neural network based on the sample audio signal, the predicted audio signal, and a preset mean squared error loss function until convergence to obtain the trained audio upmixing model.

[0110] In one embodiment, the processing module is further configured to: acquire multiple original audio track signals of different sounds; uniformly sample a set space according to the three coordinate axes of the three-dimensional sound channel with a set step size to obtain sampled three-dimensional sound source position coordinate samples; perform layout rendering on the multiple original audio track signals according to the three-dimensional sound source position coordinate samples to obtain sample audio signals of the rendered three-dimensional sound channel; and construct the training dataset according to the multiple original audio track signals, the three-dimensional sound source position coordinate samples, and the sample audio signals.

[0111] Each module in the aforementioned audio processing device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in hardware within or independently of the processor in a computer device, or stored in software within the memory of the computer device, so that the processor can invoke and execute the operations corresponding to each module.

[0112] In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 9 As shown, the computer device includes a processor, memory, input / output interfaces, a communication interface, a display unit, and an input device. The processor, memory, and input / output interfaces are connected via a system bus, and the communication interface, display unit, and input device are also connected to the system bus via the input / output interfaces. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The input / output interfaces are used for exchanging information between the processor and external devices. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, NFC (Near Field Communication), or other technologies. When the computer program is executed by the processor, it implements an audio processing method. The display unit is used to form a visually visible image and can be a display screen, a projection device, or a virtual reality imaging device. The display screen can be an LCD screen or an e-ink screen. The input device of the computer device can be a touch layer covering the display screen, or buttons, trackballs, or touchpads set on the casing of the computer device, or external keyboards, touchpads, or mice, etc.

[0113] Those skilled in the art will understand that Figure 9 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0114] In one embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to perform the following steps:

[0115] The audio signal to be processed and the configured three-dimensional sound source position coordinates are obtained, wherein the audio signal to be processed is a stereo audio signal;

[0116] The audio signal to be processed is preprocessed to obtain a preprocessed standard audio signal;

[0117] Feature extraction is performed on the standard audio signal to obtain extracted time-domain audio features and frequency-domain audio features;

[0118] The time-domain audio features and the frequency-domain audio features are upmixed based on the three-dimensional sound source location coordinates to obtain the processed target audio signal with three-dimensional channels.

[0119] In one embodiment, when the processor executes the computer program, it further performs the following steps: performing frame processing on the audio signal to be processed according to a preset frame length and frame shift to obtain multiple audio signal frames; performing channel averaging processing on each audio signal frame to obtain a processed average audio signal frame; obtaining the audio sampling rate of the average audio signal frame, converting the audio sampling rate into a target audio sampling rate, and obtaining the standard audio signal of the average audio signal frame.

[0120] In one embodiment, when the processor executes the computer program, it further performs the following steps: sampling the standard audio signal according to the target audio sampling rate to obtain the sampled time-domain audio features; and obtaining the transformed frequency-domain audio features by performing a short-time Fourier transform on the time-domain audio features.

[0121] In one embodiment, when the processor executes the computer program, it further performs the following steps: normalizing the time-domain audio features and the frequency-domain audio features respectively to obtain normalized time-domain features and frequency-domain features; performing feature synthesis transformation on the normalized time-domain features and the frequency-domain features to obtain synthesized latent space features; and analyzing, separating, and recombining the latent space features according to the three-dimensional sound source location coordinates to obtain a target audio signal with three-dimensional channels.

[0122] In one embodiment, when the processor executes the computer program, it further performs the following steps: inputting the three-dimensional sound source location coordinates, the time-domain audio features, and the frequency-domain audio features into a pre-trained audio upmixing model for upmixing processing, thereby obtaining a target audio signal with three-dimensional channels output by the audio upmixing model.

[0123] In one embodiment, when the processor executes the computer program, it further performs the following steps: constructing a training dataset based on the original audio track signal, the training dataset including the original audio track signal, sampled three-dimensional sound source position coordinate samples, and rendered three-dimensional channel sample audio signals; inputting the original audio track signal and the three-dimensional sound source position coordinate samples into a convolutional attention neural network for upmixing processing to obtain a processed predicted audio signal with three-dimensional channels; training the convolutional attention neural network based on the sample audio signal, the predicted audio signal, and a preset mean squared error loss function until convergence to obtain the trained audio upmixing model.

[0124] In one embodiment, when the processor executes the computer program, it further performs the following steps: acquiring multiple original audio track signals of different sounds; uniformly sampling a set space according to the three coordinate axes of the three-dimensional sound channel with a set step size to obtain sampled three-dimensional sound source position coordinate samples; performing layout rendering on the multiple original audio track signals according to the three-dimensional sound source position coordinate samples to obtain sample audio signals of the rendered three-dimensional sound channel; and constructing the training dataset according to the multiple original audio track signals, the three-dimensional sound source position coordinate samples, and the sample audio signals.

[0125] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, the computer program performing the following steps when executed by a processor:

[0126] The audio signal to be processed and the configured three-dimensional sound source position coordinates are obtained, wherein the audio signal to be processed is a stereo audio signal;

[0127] The audio signal to be processed is preprocessed to obtain a preprocessed standard audio signal;

[0128] Feature extraction is performed on the standard audio signal to obtain extracted time-domain audio features and frequency-domain audio features;

[0129] The time-domain audio features and the frequency-domain audio features are upmixed based on the three-dimensional sound source location coordinates to obtain the processed target audio signal with three-dimensional channels.

[0130] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: performing frame processing on the audio signal to be processed according to a preset frame length and frame shift to obtain multiple audio signal frames; performing channel averaging processing on each audio signal frame to obtain a processed average audio signal frame; obtaining the audio sampling rate of the average audio signal frame, converting the audio sampling rate into a target audio sampling rate, and obtaining the standard audio signal of the average audio signal frame.

[0131] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: sampling the standard audio signal according to the target audio sampling rate to obtain the sampled time-domain audio features; and obtaining the transformed frequency-domain audio features by performing a short-time Fourier transform on the time-domain audio features.

[0132] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: normalizing the time-domain audio features and the frequency-domain audio features respectively to obtain normalized time-domain features and frequency-domain features; performing feature synthesis transformation on the normalized time-domain features and the frequency-domain features to obtain synthesized latent space features; and analyzing, separating, and recombining the latent space features according to the three-dimensional sound source location coordinates to obtain a target audio signal with three-dimensional channels.

[0133] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: inputting the three-dimensional sound source location coordinates, the time-domain audio features, and the frequency-domain audio features into a pre-trained audio upmixing model for upmixing processing, thereby obtaining a target audio signal with three-dimensional channels output by the audio upmixing model.

[0134] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: constructing a training dataset based on the original audio track signal, the training dataset including the original audio track signal, sampled three-dimensional sound source position coordinate samples, and rendered three-dimensional channel sample audio signals; inputting the original audio track signal and the three-dimensional sound source position coordinate samples into a convolutional attention neural network for upmixing processing to obtain a processed predicted audio signal with three-dimensional channels; training the convolutional attention neural network based on the sample audio signal, the predicted audio signal, and a preset mean squared error loss function until convergence to obtain the trained audio upmixing model.

[0135] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: acquiring multiple original audio track signals of different sounds; uniformly sampling a set space according to the three coordinate axes of the three-dimensional sound channel with a set step size to obtain sampled three-dimensional sound source position coordinate samples; performing layout rendering on the multiple original audio track signals according to the three-dimensional sound source position coordinate samples to obtain sample audio signals of the rendered three-dimensional sound channel; and constructing the training dataset according to the multiple original audio track signals, the three-dimensional sound source position coordinate samples, and the sample audio signals.

[0136] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, performs the following steps:

[0137] The audio signal to be processed and the configured three-dimensional sound source position coordinates are obtained, wherein the audio signal to be processed is a stereo audio signal;

[0138] The audio signal to be processed is preprocessed to obtain a preprocessed standard audio signal;

[0139] Feature extraction is performed on the standard audio signal to obtain extracted time-domain audio features and frequency-domain audio features;

[0140] The time-domain audio features and the frequency-domain audio features are upmixed based on the three-dimensional sound source location coordinates to obtain the processed target audio signal with three-dimensional channels.

[0141] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: performing frame processing on the audio signal to be processed according to a preset frame length and frame shift to obtain multiple audio signal frames; performing channel averaging processing on each audio signal frame to obtain a processed average audio signal frame; obtaining the audio sampling rate of the average audio signal frame, converting the audio sampling rate into a target audio sampling rate, and obtaining the standard audio signal of the average audio signal frame.

[0142] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: sampling the standard audio signal according to the target audio sampling rate to obtain the sampled time-domain audio features; and obtaining the transformed frequency-domain audio features by performing a short-time Fourier transform on the time-domain audio features.

[0143] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: normalizing the time-domain audio features and the frequency-domain audio features respectively to obtain normalized time-domain features and frequency-domain features; performing feature synthesis transformation on the normalized time-domain features and the frequency-domain features to obtain synthesized latent space features; and analyzing, separating, and recombining the latent space features according to the three-dimensional sound source location coordinates to obtain a target audio signal with three-dimensional channels.

[0144] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: inputting the three-dimensional sound source location coordinates, the time-domain audio features, and the frequency-domain audio features into a pre-trained audio upmixing model for upmixing processing, thereby obtaining a target audio signal with three-dimensional channels output by the audio upmixing model.

[0145] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: constructing a training dataset based on the original audio track signal, the training dataset including the original audio track signal, sampled three-dimensional sound source position coordinate samples, and rendered three-dimensional channel sample audio signals; inputting the original audio track signal and the three-dimensional sound source position coordinate samples into a convolutional attention neural network for upmixing processing to obtain a processed predicted audio signal with three-dimensional channels; training the convolutional attention neural network based on the sample audio signal, the predicted audio signal, and a preset mean squared error loss function until convergence to obtain the trained audio upmixing model.

[0146] In one embodiment, when the computer program is executed by the processor, it further performs the following steps: acquiring multiple original audio track signals of different sounds; uniformly sampling a set space according to the three coordinate axes of the three-dimensional sound channel with a set step size to obtain sampled three-dimensional sound source position coordinate samples; performing layout rendering on the multiple original audio track signals according to the three-dimensional sound source position coordinate samples to obtain sample audio signals of the rendered three-dimensional sound channel; and constructing the training dataset according to the multiple original audio track signals, the three-dimensional sound source position coordinate samples, and the sample audio signals.

[0147] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data shall comply with the relevant laws, regulations and standards of the relevant countries and regions.

[0148] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.

[0149] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0150] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. An audio processing method, characterized in that, The method includes: The audio signal to be processed and the configured three-dimensional sound source position coordinates are obtained, wherein the audio signal to be processed is a stereo audio signal; The audio signal to be processed is preprocessed to obtain a preprocessed standard audio signal; Feature extraction is performed on the standard audio signal to obtain extracted time-domain audio features and frequency-domain audio features; The time-domain audio features and the frequency-domain audio features are upmixed based on the three-dimensional sound source location coordinates to obtain the processed target audio signal with three-dimensional channels.

2. The method according to claim 1, characterized in that, The preprocessing of the audio signal to be processed to obtain a preprocessed standard audio signal includes: The audio signal to be processed is divided into frames according to a preset frame length and frame shift to obtain multiple audio signal frames; For each of the audio signal frames, channel averaging is performed to obtain the processed average audio signal frame; The audio sampling rate of the average audio signal frame is obtained, and the audio sampling rate is converted into a target audio sampling rate to obtain the standard audio signal of the average audio signal frame.

3. The method according to claim 2, characterized in that, The step of extracting features from the standard audio signal to obtain extracted time-domain audio features and frequency-domain audio features includes: The standard audio signal is sampled according to the target audio sampling rate to obtain the sampled time-domain audio features; The transformed frequency domain audio features are obtained by performing a short-time Fourier transform on the time-domain audio features.

4. The method according to any one of claims 1 to 3, characterized in that, The step of upmixing the time-domain audio features and the frequency-domain audio features based on the three-dimensional sound source location coordinates to obtain a processed target audio signal with three-dimensional channels includes: The time-domain audio features and the frequency-domain audio features are normalized respectively to obtain normalized time-domain features and frequency-domain features; The normalized time-domain features and frequency-domain features are subjected to feature synthesis transformation to obtain the synthesized latent space features. The latent space features are analyzed, separated, and recombined based on the three-dimensional sound source location coordinates to obtain a target audio signal with three-dimensional sound channels.

5. The method according to any one of claims 1 to 3, characterized in that, The step of upmixing the time-domain audio features and the frequency-domain audio features based on the three-dimensional sound source location coordinates to obtain a processed target audio signal with three-dimensional channels includes: The three-dimensional sound source location coordinates, the time-domain audio features, and the frequency-domain audio features are input into a pre-trained audio upmixing model for upmixing processing to obtain the target audio signal with three-dimensional channels output by the audio upmixing model.

6. The method according to claim 5, characterized in that, The method for generating the audio upmixing model includes: A training dataset is constructed based on the original audio track signal. The training dataset includes the original audio track signal, sampled three-dimensional sound source position coordinates, and rendered three-dimensional channel sample audio signals. The original audio track signal and the three-dimensional sound source position coordinate sample are input into a convolutional attention neural network for upmixing to obtain a processed predicted audio signal with three-dimensional channels. The convolutional attention neural network is trained based on the sample audio signal, the predicted audio signal, and the preset mean squared error loss function until convergence, resulting in the trained audio upmixing model.

7. The method according to claim 6, characterized in that, The construction of the training dataset based on the original audio track signal includes: Acquire raw audio track signals from multiple sources with different sounds; Based on the three coordinate axes of the three-dimensional sound channel, the set space is uniformly sampled with a set step size to obtain the sampled three-dimensional sound source position coordinate samples. Based on the three-dimensional sound source position coordinate samples, the layout rendering of multiple original audio track signals is performed to obtain the rendered three-dimensional sound channel sample audio signals. The training dataset is constructed based on the multiple original audio track signals, the three-dimensional sound source position coordinate samples, and the sample audio signals.

8. An audio processing apparatus, characterized in that, The device includes: The data acquisition module is used to acquire the audio signal to be processed and the configured three-dimensional sound source position coordinates, wherein the audio signal to be processed is a stereo audio signal; The preprocessing module is used to preprocess the audio signal to be processed to obtain a preprocessed standard audio signal; The feature extraction module is used to extract features from the standard audio signal to obtain extracted time-domain audio features and frequency-domain audio features; The processing module is used to perform upmixing processing on the time-domain audio features and the frequency-domain audio features according to the three-dimensional sound source location coordinates to obtain the processed target audio signal with three-dimensional channels.

9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 8.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 8.