A binaural audio generation method and system that fuses position and audio general representations

By integrating position and universal audio representation into a binaural audio generation method, and utilizing a relative position information extractor and a universal audio representation extractor, the problem of converting single-channel audio into binaural stereo audio is solved, thereby improving the audio immersion and accuracy in virtual reality and augmented reality.

CN117789692BActive Publication Date: 2026-06-23XIAMEN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
XIAMEN UNIV
Filing Date
2024-01-08
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively convert single-channel audio into binaural stereo audio, resulting in insufficient audio immersion in virtual reality and augmented reality. Furthermore, existing methods cannot accurately capture the temporal patterns and dynamic characteristics of audio signals.

Method used

A binaural audio generation method that integrates location and general audio representation is adopted. The binaural audio reconstruction model is trained and optimized by combining video frames and audio data through a relative location information extractor, a general audio representation extractor, and a mask generation module.

Benefits of technology

It improves the accuracy and immersion of binaural audio generation, enhances the auditory experience in virtual reality and augmented reality, and ensures that the temporal patterns and dynamic characteristics of audio signals are effectively captured.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117789692B_ABST
    Figure CN117789692B_ABST
Patent Text Reader

Abstract

The application discloses a kind of binaural audio generation method and system of fusion position and audio general representation, it is characterized in that, including, S1, making video frame data set and audio data set;S2, short-time Fourier transform and calculation are carried out to audio data set, obtain corresponding complex spectrogram, amplitude spectrogram and phase spectrogram;S3, video frame data set, audio data set and its corresponding spectrogram are input into binaural audio restoration model containing relative position information extractor, audio general representation extractor, mask generation module and are trained and optimized;S4, based on the binaural audio restoration model of well-trained, carries out binaural audio restoration.The network model proposed in the present application can effectively extract the relative position information of sound source in video frame, obtain more effective audio general representation, for guiding the generation of binaural audio, to improve system performance.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of audio restoration, specifically to a binaural audio generation method and system that integrates location and general audio representation. Background Technology

[0002] When a person receives sound input through both ears, the brain accurately determines the direction of the sound source by comparing information such as interaural level difference (ILD) and interaural time difference (ITD). This auditory direction perception ability can be used to precisely locate the direction of a sound source in the real world. In virtual reality (VR) and augmented reality (AR) applications, videos with binaural stereo sound can deeply immerse users in the virtual environment visually and aurally, enhancing the sense of immersion. However, due to limitations in recording equipment, most of the acquired audio and video data usually contains single-channel audio. Therefore, how to effectively convert single-channel audio into binaural stereo audio has become an important research direction.

[0003] With the advancement of spatial audio generation technology, synchronous video frame information and single-channel audio information are combined to generate dual-channel stereo audio, providing viewers with a more realistic, immersive, and interactive stereo audiovisual experience.

[0004] However, in audio streaming, previous methods primarily focused on extracting and reconstructing the complex spectrum features of audio. The complex spectrum is a complex frequency domain representation that shows the amplitude and phase information of a signal in the frequency domain. It provides a static representation of the entire audio segment and cannot fully capture the temporal patterns and dynamic characteristics of the audio signal. Deep learning models can automatically learn more abstract and complex features from audio data and generate general representations. These general audio representations can consider the continuity and dynamic changes of the audio signal over time, helping to more accurately capture the temporal patterns and evolutionary trends in audio.

[0005] Binaural audio can be used to infer the location of a sound source, and sound sources at different locations will cause us to have different auditory perceptions in our binaural experience. In the Mono to Binaural task accompanied by video frames, it is not feasible to directly use monoural audio to predict binaural audio because neural networks have difficulty learning effective mask information. To solve this problem, this invention uses monoural audio to predict the difference between binaural audio, and reconstructs the binaural audio by using the difference between monoural and binaural audio. Summary of the Invention

[0006] To address the technical problem in existing technologies that make it difficult to more effectively predict binaural audio using generated mask information, this application proposes a binaural audio generation method and system that integrates location and general audio representation.

[0007] According to one aspect of the present invention, a binaural audio generation method that fuses location and general audio representation is proposed, comprising:

[0008] S1, create video frame datasets and audio datasets;

[0009] S2 performs a short-time Fourier transform and calculation on the audio dataset to obtain the corresponding complex spectrum, amplitude spectrum, and phase spectrum;

[0010] S3, input the video frame dataset, audio dataset and their corresponding spectrograms into the binaural audio reconstruction model containing the relative position information extractor, the general audio representation extractor and the mask generation module for training and optimization;

[0011] S4, Perform binaural audio reconstruction based on the trained binaural audio reconstruction model.

[0012] Preferably, S1 specifically includes acquiring multiple audio segments with a length of one second, and selecting a video frame from the corresponding video segments as the input video frame to create a video frame dataset.

[0013] Preferably, the audio dataset includes a mixed audio dataset and a differential audio dataset, and the left and right channel audio data of the binaural audio corresponding to the audio segment are obtained. l and x r The corresponding sums are used to obtain the mixed audio data x. mono =x l +x r Create a mixed audio dataset, and subtract the corresponding values ​​to obtain the difference audio data x. diff =x l -x r Create a differential audio dataset.

[0014] Further preferably, the mixed audio data and the differential audio data are subjected to short-time Fourier transform and calculation to obtain the corresponding complex spectrum S. mono and S diff Calculate the differential audio value x diff Amplitude spectrum:

[0015] S mag =||S diff ||2

[0016] Calculate the differential audio value x diff Phase spectrum:

[0017]

[0018] The Real() operator represents taking S. diff The real part of S, the Imag() operator means taking the S diff The imaginary part.

[0019] More preferably, the relative position information extractor described in S3 is a visual image pre-training model composed of a multi-layer convolutional neural network, which extracts potential position information in video frames to obtain visual feature vectors.

[0020] The general audio representation extractor consists of a speech pre-trained model and a multi-layer downsampling convolutional neural network, which extracts the general audio representation of mixed audio data.

[0021] The mask generation module consists of two parts: an encoder and a decoder. The complex spectrogram of the mixed audio data is input into the encoder for encoding to obtain the encoded features of the mixed audio data. The encoded features, audio general representations, and visual feature vectors are concatenated according to the channel dimension and input into the decoder to obtain the mask.

[0022] Preferably, the steps described in S3 specifically include:

[0023] S31, after the video frame data is input into the relative position information extractor, the visual feature vector F is extracted. V ;

[0024] S32, Mixed audio data x mono The input audio general characterization extractor extracts the audio general characterization F. A ;

[0025] S33, Complex spectrogram of mixed audio data S mono After the encoder of the input mask generation module is processed through N downsampling operations, the feature mapping before each downsampling is as follows: i is in ascending order. The number of channels is doubled for each downsampling, and the output f after the last downsampling is... en As an encoding feature, it is related to the visual feature vector F V Audio General Characterization F A Concatenate according to channel dimension as (F) en F A F V );

[0026] S34, input the concatenation result into the decoder and perform N upsampling operations. The feature mapping after each upsampling is as follows: If i is in descending order, then each With the corresponding spliced ​​according to channel dimension Next Perform the next upsampling, halving the number of channels after each upsampling, and use the output of the last upsampling as the generated mask;

[0027] S35, compare the generated mask with the corresponding input S mono Multiplying them together yields the complex spectrogram S of the predicted differential audio data. pre .

[0028] Preferably, the loss function used in the binaural audio reconstruction model is L = α1L stft +α2L mag +a3L phs +α4L rec α1, α2, α3, and α4 are set based on experience.

[0029]

[0030]

[0031]

[0032]

[0033] in, This represents the sum of the squares of the Euclidean distance error, also known as L2 loss. and It is a network prediction S pre The amplitude and phase spectra obtained from S rec and The corresponding formula is shown below:

[0034] S rec =[S mag cos(S phs ), S mag sin(S phs )]

[0035]

[0036] S rec It is by S mag and S phs The imaginary and real parts of the complex spectrum are reconstructed, and the same applies. It is based on prediction. and Reconstruct the imaginary and real parts of the complex spectrum.

[0037] More preferably, S4 specifically includes obtaining the complex spectrogram S of the predicted differential audio data. pre Then, the predicted differential audio data x is obtained through short-time inverse Fourier transform. preUsing x pre and x mono Restore left and right ear channel input.

[0038] According to a second aspect of the present invention, a binaural audio generation system that fuses location and general audio representations is proposed, comprising the following modules:

[0039] Data collection module: used to create video frame datasets and audio datasets;

[0040] Data processing module: used to perform short-time Fourier transform and calculation on audio datasets to obtain the corresponding complex spectrum, amplitude spectrum and phase spectrum;

[0041] Model training module: Used to train and optimize a binaural audio reconstruction model that includes a relative position information extractor, a general audio representation extractor, and a mask generation module by inputting video frame datasets, audio datasets, and their corresponding spectrograms.

[0042] Binaural audio prediction module: Performs binaural audio reconstruction based on the trained binaural audio reconstruction model.

[0043] Thirdly, embodiments of the present invention provide a computer-readable medium having a computer program stored thereon, which, when executed by a processor, performs the method as described in any of the embodiments of the first aspect.

[0044] Fourthly, embodiments of the present invention provide a computing system including a processor and a memory, the processor being configured to perform the method as described in any of the embodiments of the first aspect.

[0045] Compared with the prior art, the beneficial results of the present invention are as follows:

[0046] 1. A relative position information extractor is created in the visual stream to extract the relative position information of potential objects in video frames. It has only one branch, which reduces the complexity of the visual stream network while ensuring significant performance improvement.

[0047] 2. Regarding the audio stream, this invention adds an audio universal representation extractor to extract universal audio representations. Combined with the original complex spectrum features of the audio, this further improves system performance. In summary, the network model proposed in this invention can effectively extract the relative position information of sound sources in video frames, obtain more effective universal audio representations, and guide the generation of binaural audio, thereby improving system performance. Attached Figure Description

[0048] The accompanying drawings are included to provide a further understanding of the embodiments and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments and, together with the description, serve to explain the principles of the invention. Other embodiments and many anticipated advantages of the embodiments will be readily recognized as they become better understood through reference to the following detailed description. Elements in the drawings are not necessarily to scale. The same reference numerals refer to corresponding similar parts.

[0049] Figure 1 A flowchart illustrating a binaural audio generation method that integrates location and general audio representation according to an embodiment of the present invention is shown.

[0050] Figure 2 A schematic diagram of the structure of a binaural audio generation model that integrates location and general audio representation according to the present invention is shown.

[0051] Figure 3 A schematic diagram of the mask generation module according to the present invention is shown;

[0052] Figure 4 A schematic diagram of a binaural audio generation system that integrates location and general audio representation according to the present invention is shown.

[0053] Figure 5 This is a schematic diagram of the structure of a computer device suitable for implementing electronic devices according to embodiments of the present invention. Detailed Implementation

[0054] The present application will now be described in further detail with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and not intended to limit it. Furthermore, it should be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings.

[0055] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.

[0056] Figure 1 A schematic flowchart of a binaural audio generation method that fuses location and general audio representation according to an embodiment of the present invention is shown, as follows: Figure 1 As shown, the specific steps include:

[0057] S1, Create video frame datasets and audio datasets;

[0058] S2, perform short-time Fourier transform and calculation on the audio dataset to obtain the corresponding complex graph, amplitude spectrum and phase spectrum;

[0059] S3, input the video frame dataset, audio dataset and their corresponding spectrograms into the binaural audio reconstruction model containing the relative position information extractor, the general audio representation extractor and the mask generation module for training and optimization;

[0060] S4, Perform binaural audio reconstruction based on the trained binaural audio reconstruction model.

[0061] In a specific embodiment, S1 specifically includes acquiring multiple audio segments with a length of 0.63s, and selecting video frames at the midpoint of the corresponding video segments as input video frames to create a video frame dataset.

[0062] Preferably, the audio dataset includes a mixed audio dataset and a differential audio dataset, and the left and right channel audio data of the binaural audio corresponding to the audio segment are obtained. l and x r The corresponding sums are used to obtain the mixed audio data x. mono =x l +x r Create a mixed audio dataset, and subtract the corresponding values ​​to obtain the difference audio data x. diff =x l -x r Create a differential audio dataset.

[0063] Further preferably, the mixed audio data and the differential audio data are subjected to short-time Fourier transform and calculation to obtain the corresponding complex spectrum S. mono and S diff Calculate the differential audio value x diff Amplitude spectrum:

[0064] S mag =||S diff ||2

[0065] Calculate the differential audio value x diff Phase spectrum:

[0066]

[0067] The Real() operator represents taking S. diff The real part of S, the Imag() operator means taking the S diff The imaginary part.

[0068] In specific embodiments, such as Figure 2 As shown in Figure S3, the binaural audio reconstruction model includes a relative position information extractor, an audio general representation extractor, and a mask generation module. STFT represents performing a short-time Fourier transform, and ISTFT represents performing an inverse short-time Fourier transform.

[0069] The relative position information extractor described in S3 is a visual image pre-training model that includes a multi-layer convolutional neural network. It extracts potential position information from video frames to obtain visual feature vectors. This relative position information extractor is pre-trained in tasks such as target detection and image classification.

[0070] The general audio representation extractor consists of a speech pre-trained model and a multi-layer downsampling convolutional neural network, which extracts the general audio representation of mixed audio data.

[0071] The mask generation module consists of two parts: an encoder and a decoder. The complex spectrogram of the mixed audio data is input into the encoder for encoding to obtain the encoded features of the mixed audio data. The encoded features, audio general representations, and visual feature vectors are concatenated according to the channel dimension and input into the decoder to obtain the mask.

[0072] In a preferred embodiment, the relative position information extractor uses the ResNet-101 pre-trained object detection model as the backbone feature extractor, and then performs dimensionality reduction on the extracted features through 1×1 convolution to extract visual feature vectors. The mask generation module uses a U-Net neural network as the backbone structure, which is divided into an encoder and a decoder. The audio general representation extractor consists of the speech pre-trained model WavLM and U-net_encoder. The U-net_encoder model structure is the same as the U-Net encoder in the mask generation module, but the training parameters are not shared. The output of the seventh hidden layer of WavLM is used as the input of U-net_encoder after passing through two 1×1 convolutions.

[0073] The ResNet-101 model is a general-purpose deep residual network model, consisting of five convolutional modules, pooling modules, and fully connected layers. The WavLM model comprises a convolutional encoder (CNN Encoder) and a Transformer encoder. The CNN Encoder has seven layers, each containing a temporal convolutional layer, a layer normalization layer, and a GELU activation function layer. The Transformer encoder consists of multiple identical layers stacked on top of each other, each with two sub-layers: the first is a multi-head self-attention layer; the second sub-layer is a location-based feedforward neural network. Each sub-layer employs residual connections.

[0074] U-Net consists of a symmetrical encoder and decoder. The encoder has four downsampling modules, each consisting of two convolutional layers with 3x3 kernels and ReLU activation function, and a 2x2 max-pooling layer. The decoder has four upsampling modules, each consisting of a deconvolutional layer with 2×2 kernels, a feature concatenation layer, and two convolutional layers with 3x3 kernels and ReLU activation function.

[0075] This invention proposes a binaural audio generation model that integrates positional and general audio representations, based on the WavLM, U-Net, and ResNet-101 models. This model, in addition to extracting relative positional features, adds a general audio representation extractor to extract general audio representations. These general audio representations are the high-dimensional audio features extracted from the audio file through a neural network. By combining original complex spectral features, high-dimensional audio depth features, and positional features, the generation of binaural audio is better guided.

[0076] like Figure 3 As shown, Figure 3 The rectangular blocks in the diagram represent the feature maps generated after processing by convolutional layers, pooling layers, or splicing layers, and the numbers on the feature maps represent the corresponding number of channels.

[0077] The steps described in S3 specifically include:

[0078] S31, after the video frame data is input into the relative position information extractor, the visual feature vector F is extracted. V ;

[0079] S32, Mixed audio data x mono The input audio general characterization extractor extracts the audio general characterization F. A ;

[0080] S33, Complex spectrogram of mixed audio data S mono After the encoder of the input mask generation module is processed through four downsampling operations, the feature mapping before each downsampling is as follows: i is in ascending order, the number of channels is doubled for each downsampling, and the output F after the last downsampling is... en As an encoding feature, it is related to the visual feature vector F V Audio General Characterization F A Concatenate according to channel dimension as (F) en F A F V );

[0081] S34, input the concatenation result into the decoder and perform N upsampling operations. The feature mapping after each upsampling is as follows: If i is in descending order, and in this implementation case N=4, then each With the corresponding spliced ​​according to channel dimension By fusing low-level and high-level features, the transfer of feature information between layers is enhanced, enabling the model to perceive richer feature information. Then... Perform the next upsampling, halving the number of channels after each upsampling, and use the output of the last upsampling as the generated mask.

[0082] In a specific embodiment, the loss function used in the binaural audio reconstruction model is L = α1L. stft +α2L mag +a3L phs +a4L rec Among them, a1, a2, α3, and α4 are set based on experience.

[0083]

[0084]

[0085]

[0086]

[0087] in, This represents the sum of the squares of the Euclidean distance error, also known as L2 loss. and It is a network prediction S pre The amplitude and phase spectra obtained from the above are given by the following formulas:

[0088] S rec =[S max cos(S phs ), S mag sin(S phs )]

[0089]

[0090] S rec It is by S mag and S phs The imaginary and real parts of the complex spectrum are reconstructed, and the same applies. It is based on prediction. and Reconstruct the imaginary and real parts of the complex spectrum.

[0091] S4 specifically includes the complex spectrogram S of the predicted differential audio data. pre Then, the predicted differential audio data x is obtained through short-time inverse Fourier transform. pre Using x pre and x mono Restore the left and right ear channel inputs. Output the predicted audio data for the left channel. Output the predicted audio data for the right channel.

[0092] In a specific embodiment, real single-channel audio data is used. The data is input into a binaural audio reconstruction model for prediction, and the mono audio data is restored to the output of the left and right binaural channels.

[0093] In a specific embodiment, during the training phase, an audio segment of 0.63s in length is randomly extracted from each 10s audio-visual segment, and a video frame is selected from the midpoint of the corresponding video segment. A Hanning window is used to generate the STFT, with a window length of 400, an FFT window size of 512, and a frame shift of 160. In the video stream, the size of each video frame is uniformly readjusted to a resolution of 480×240 before normalization. The initial learning rate is set to 1e-4, and a linear decay learning rate strategy is employed.

[0094] Table 1

[0095] Serial Number Complex Spectrum WavLM ResNet101 STFT distance↓ ENV distance↓ Mag↓ Phs(rad)↓ SNR(dB)↑ 1 √ × × 1.3991 0.171 2.7983 1.5832 4.6743 2 √ √ × 1.3031 0.1623 2.5369 1.5603 5.0689 3 √ × √ 1.1936 0.1568 2.3971 1.4968 5.3891 4 √ √ √ 1.165 0.1555 2.33 1.4604 5.4604

[0096] Table 1 shows a performance comparison between the binaural audio reconstruction model of the present invention and models that retain only the mask generation module, models that retain both the general audio representation extractor and the mask generation module, and models that retain both the relative position information extractor and the mask generation module. Downward-pointing arrows indicate smaller values ​​are better, while upward-pointing arrows indicate larger values ​​are better. The comparison metrics include the STFT distance between the predicted and actual binaural audio values ​​(i.e., the L2 difference after short-time Fourier transform); the ENV distance (i.e., the L2 difference between the time-domain audio signals obtained through Hilbert transform); the amplitude deviation (Mag) and phase radian deviation (Phs), representing the amplitude L2 difference and phase radian L1 difference between the complex spectra of the predicted and actual values, respectively; and finally, the signal-to-noise ratio (SNR) of the two binaural audio values.

[0097] The table shows that:

[0098] (1) Comparing the results of the binaural audio restoration model of the present invention with the original audio restoration model without the addition of the relative position information extractor and the audio general characterization extractor, it can be seen that the model using audio data is better than the original mono audio model, indicating that combining the frequency features of audio stream and visual stream information helps in the generation and restoration of binaural audio.

[0099] (2) The comparison between the audio restoration model without the audio universal representation extractor and the model of the present invention shows that, due to the addition of the audio universal representation extractor, the binaural audio restoration model can more accurately predict and restore the binaural audio by utilizing the high-dimensional features of the audio, and the results obtained have a higher degree of matching with the real binaural audio.

[0100] (3) The comparison between the audio restoration model without relative position information extractor and the model of the present invention shows that using the relative position information extractor to mine visual feature vectors and utilizing the potential position information of objects makes the dual-channel audio more realistic.

[0101] Secondly, embodiments of the present invention also disclose a binaural audio generation system that integrates location and general audio representations, including: a data collection module 41, a data processing module 42, a model training module 43, and a binaural channel audio prediction module 44.

[0102] Data collection module 41: Used to create video frame datasets and audio datasets;

[0103] Data processing module 42: used to perform short-time Fourier transform and calculation on the audio dataset to obtain the corresponding complex spectrum, amplitude spectrum and phase spectrum;

[0104] Model training module 43: used to train and optimize a binaural audio reconstruction model that includes a relative position information extractor, a general audio representation extractor, and a mask generation module by inputting the video frame dataset, audio dataset, and their corresponding spectrograms into the model.

[0105] Binaural channel audio prediction module 44: Performs binaural audio reconstruction based on the trained binaural audio reconstruction model.

[0106] The following is for reference. Figure 5 It shows a schematic diagram of the structure of a computer system 500 suitable for implementing electronic devices according to embodiments of the present application. Figure 5 The electronic device shown is merely an example and should not impose any limitation on the functionality and scope of use of the embodiments of this application.

[0107] like Figure 5 As shown, the computer system 500 includes a central processing unit (CPU) 501, which can perform various appropriate actions and processes based on programs stored in read-only memory (ROM) 502 or programs loaded from storage section 509 into random access memory (RAM) 504. The RAM 504 also stores various programs and data required for the operation of the system 500. The CPU 501, ROM 502, and RAM 504 are interconnected via a bus 505. An input / output (I / O) interface 506 is also connected to the bus 505.

[0108] The following components are connected to I / O interface 506: an input section 507 including a keyboard, mouse, etc.; an output section 508 including a liquid crystal display (LCD) and speakers, etc.; a storage section 509 including a hard disk, etc.; and a communication section 510 including a network interface card such as a LAN card and a modem, etc. The communication section 510 performs communication processing via a network such as the Internet. A drive 511 is also connected to I / O interface 506 as needed. A removable medium 512, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on drive 511 as needed so that computer programs read from it can be installed into storage section 509 as needed.

[0109] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a computer-readable storage medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication section 510, and / or installed from removable medium 512. When the computer program is executed by central processing unit (CPU) 501, it performs the functions defined in the methods of this application.

[0110] It should be noted that the computer-readable storage medium of this application can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this application, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media can also be any computer-readable storage medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable storage medium can be transmitted using any suitable medium, including but not limited to: wireless, wire, optical fiber, RF, etc., or any suitable combination thereof.

[0111] Computer program code for performing the operations of this application can be written in one or more programming languages ​​or a combination thereof. Programming languages ​​include object-oriented programming languages—such as Java, Smalltalk, and C++—as well as conventional procedural programming languages—such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0112] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0113] The modules described in the embodiments of this application can be implemented in software or in hardware.

[0114] In another aspect, this application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or it may exist independently and not assembled into the electronic device. The aforementioned computer-readable storage medium carries one or more programs, which, when executed by the electronic device, cause the electronic device to: create a video frame dataset and an audio dataset; perform a short-time Fourier transform and calculation on the audio dataset to obtain the corresponding complex spectrum, amplitude spectrum, and phase spectrum; input the video frame dataset, the audio dataset, and their corresponding spectra into a binaural audio reconstruction model containing a relative position information extractor, an audio general representation extractor, and a mask generation module for training and optimization; and perform binaural audio reconstruction based on the trained binaural audio reconstruction model.

[0115] The above description is merely a preferred embodiment of this application and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of the invention involved in this application is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described inventive concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features with similar functions disclosed in this application.

Claims

1. A binaural audio generation method that integrates location and general audio representations, characterized in that, Includes the following steps: S1, create a video frame dataset and an audio dataset. The audio dataset includes a mixed audio dataset and a differential audio dataset. Obtain the left and right channel audio data of the audio segments corresponding to the audio data. and The corresponding data are added together to obtain the mixed audio data. The mixed audio dataset is created, and the corresponding subtractions are used to obtain the differential audio data. Create the differential audio dataset; S2 performs a short-time Fourier transform and calculation on the audio dataset to obtain the corresponding complex spectrum, amplitude spectrum, and phase spectrum; S3 involves training and optimizing a binaural audio reconstruction model that includes a relative position information extractor, a general audio representation extractor, and a mask generation module, by inputting the video frame dataset, audio dataset, and their corresponding spectrograms into the model. Specifically, this includes: S31, after the video frame data is input into the relative position information extractor, visual feature vectors are extracted. ; S32, the mixed audio data The audio general characterization extractor is used to extract the audio general characterization. ; S33, the complex spectrogram of the mixed audio data After the encoder of the mask generation module is input, it undergoes N downsampling operations. The feature mapping before each downsampling is: , To achieve ascending order, the number of channels downsampled is doubled each time, and the output after the last downsampling is used. As an encoding feature, it is related to the visual feature vector. The general audio representation Concatenate according to channel dimension ( , ); S34, the splicing result is input into the decoder of the mask generation module, and N upsampling operations are performed. The feature mapping after each upsampling is as follows: , To sort in descending order, each With the corresponding Concatenate according to channel dimension ( , ), then ( , Then perform the next upsampling, halving the number of channels after each upsampling, and using the output of the last upsampling as the generated mask; S35, compare the generated mask with the corresponding input... Multiplying them yields the complex spectrogram of the predicted differential audio data. ; S4, Perform binaural audio reconstruction based on the trained binaural audio reconstruction model.

2. The binaural audio generation method according to claim 1, which integrates location and general audio representation, is characterized in that, S1 specifically includes acquiring multiple audio segments with a length of one second, and selecting a video frame from the corresponding video segments as the input video frame to create the video frame dataset.

3. The binaural audio generation method based on the fusion of location and general audio representation according to claim 2, characterized in that, Short-time Fourier transforms and calculations are performed on the mixed audio data and the differential audio data to obtain the corresponding complex spectrograms. and Calculate the differential audio Amplitude spectrum: ; in, Represents the L2 norm; Calculate differential audio Phase spectrum: ; The Real() operator represents taking... The real part of the Imag() operator is used to denote the real part of the Imag() operator. The imaginary part.

4. The binaural audio generation method according to claim 1, which integrates location and general audio representation, is characterized in that... The relative position information extractor described in S3 is a visual image pre-training model that includes a multi-layer convolutional neural network. It extracts potential position information from video frames to obtain visual feature vectors. The general audio representation extractor consists of a speech pre-trained model and a multi-layer downsampling convolutional neural network, which extracts the general audio representation of the mixed audio data. The mask generation module consists of two parts: an encoder and a decoder. The complex spectrogram of the mixed audio data is input into the encoder for encoding to obtain the encoded features of the mixed audio data. The encoded features, audio general representations, and visual feature vectors are concatenated according to the channel dimension and input into the decoder to obtain the mask.

5. The binaural audio generation method according to claim 1, characterized in that, The loss function used in the binaural audio reconstruction model is: ,in, , , and Based on experience ; ; ; ; in, This represents the sum of squares of the error over the Euclidean distance, also known as L2 loss. and It is a model prediction The amplitude and phase spectra obtained from the data are as follows: and The corresponding formula is shown below: ; ; It is by and The imaginary and real parts of the complex spectrum are reconstructed, and the same applies. It is based on prediction. and Reconstruct the imaginary and real parts of the complex spectrum.

6. The binaural audio generation method according to claim 1, characterized in that, S4 specifically includes the complex spectrogram of the predicted differential audio data. Then, the predicted differential audio data is obtained through short-time inverse Fourier transform. ,use and Restore left and right ear channel input.

7. A binaural audio generation system that integrates location and general audio representations, characterized in that, Specifically, it includes the following modules: Data collection module: used to create video frame datasets and audio datasets. The audio dataset includes a mixed audio dataset and a differential audio dataset, acquiring the left and right channel audio data of the two ears corresponding to the audio segments. and The corresponding data are added together to obtain the mixed audio data. The mixed audio dataset is created, and the corresponding subtractions are used to obtain the differential audio data. Create the differential audio dataset; Data processing module: used to perform short-time Fourier transform and calculation on audio datasets to obtain the corresponding complex spectrum, amplitude spectrum and phase spectrum; The model training module is used to train and optimize a binaural audio reconstruction model that includes a relative position information extractor, a general audio representation extractor, and a mask generation module, by inputting video frame datasets, audio datasets, and their corresponding spectrograms. Specifically, it includes: S31, after the video frame data is input into the relative position information extractor, visual feature vectors are extracted. ; S32, the mixed audio data The audio general characterization extractor is used to extract the audio general characterization. ; S33, the complex spectrogram of the mixed audio data After the encoder of the mask generation module is input, it undergoes N downsampling operations. The feature mapping before each downsampling is: , To achieve ascending order, the number of channels downsampled is doubled each time, and the output after the last downsampling is used. As an encoding feature, it is related to the visual feature vector. The general audio representation Concatenate according to channel dimension ( , ); S34, the splicing result is input into the decoder of the mask generation module, and N upsampling operations are performed. The feature mapping after each upsampling is as follows: , To sort in descending order, each With the corresponding Concatenate according to channel dimension ( , ), then ( , Then perform the next upsampling, halving the number of channels after each upsampling, and using the output of the last upsampling as the generated mask; S35, compare the generated mask with the corresponding input... Multiplying them yields the complex spectrogram of the predicted differential audio data. ; Binaural audio prediction module: Performs binaural audio reconstruction based on the trained binaural audio reconstruction model.

8. A computer-readable medium having a computer program stored thereon, the computer program, when executed by a processor, implementing the method as described in any one of claims 1-6.

9. A computing system comprising a processor and a memory, the processor being configured to perform the method as described in any one of claims 1-6.