Audio processing method and electronic device
By determining the location information of the main sound subject in panoramic video and performing directional enhancement processing, the problem of noisy sound in panoramic video recording is solved, and the main sound subject is clearly highlighted and the listening experience is improved.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- ARASHI VISION INC
- Filing Date
- 2024-12-23
- Publication Date
- 2026-07-02
AI Technical Summary
When recording panoramic video, the ambient sound captured by the microphone is quite noisy, and existing noise reduction algorithms are not very effective, making it difficult to highlight the location of the main sound subject.
By acquiring omnidirectional audio data from panoramic video data, the directional information of the main sound subject is determined, and based on this information, a pointing enhancement algorithm is applied. Combined with machine learning and noise reduction strategies, the auditory experience of the main sound subject is enhanced.
It effectively highlights the location of the main sound subject, enhances the user's listening experience, and improves the clarity and quality of panoramic audio data.
Smart Images

Figure CN2024141400_02072026_PF_FP_ABST
Abstract
Description
An audio processing method and electronic device Technical Field
[0001] This application relates to the field of computer technology, and in particular to an audio processing method and an electronic device. Background Technology
[0002] When recording panoramic video, the microphone will capture all the sounds in the environment, resulting in a relatively noisy overall sound. Related technologies usually use fixed noise reduction algorithms to process the panoramic audio data, which has poor noise reduction effect. Summary of the Invention
[0003] This application provides an audio processing method, an electronic device, and a storage medium.
[0004] The technical solution of this application is implemented as follows:
[0005] This application provides an audio processing method, the method comprising:
[0006] Acquire panoramic video data to be processed, the panoramic video data including: omnidirectional audio data; determine at least one sound subject in the panoramic video data; process the omnidirectional audio data according to the location information of the sound subject to obtain target audio data.
[0007] This application provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the above-described audio processing method.
[0008] At least one embodiment of this application provides an electronic device, including:
[0009] The system includes a processor and a memory interconnected thereto. The memory stores a computer program, which, when executed by the processor, is configured to acquire panoramic video data to be processed. The panoramic video data includes: omnidirectional audio data; determining at least one sound subject in the panoramic video data; and processing the omnidirectional audio data based on the location information of the sound subject to obtain target audio data.
[0010] This application provides a computer-readable storage medium, comprising: the computer-readable storage medium storing a computer program. When executed by a processor, the computer program implements the steps of the audio processing method provided in the first aspect of this invention.
[0011] This application embodiment acquires panoramic video data to be processed, which includes omnidirectional audio data, determines at least one sound subject in the panoramic video data, and processes the omnidirectional audio data based on the sound subject's location information, which can better highlight the sound subject's location and enhance the user's listening experience. Attached Figure Description
[0012] Figure 1 is a schematic diagram of the implementation flow of an audio processing method provided in an embodiment of this application;
[0013] Figure 2 is a schematic diagram of a user interface for an application (APP) provided in an embodiment of this application;
[0014] Figure 3 is a schematic diagram of an electronic device provided in an embodiment of this application. Detailed Implementation
[0015] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0016] In the following description, references are made to “some embodiments,” which describe a subset of all possible embodiments. However, it is understood that “some embodiments” may be the same subset or different subsets of all possible embodiments and may be combined with each other without conflict.
[0017] Figure 1 is a schematic flowchart illustrating the implementation of an audio processing method according to an embodiment of the present invention. The executing entity of the audio processing method is an electronic device. Referring to Figure 1, the audio processing method may include:
[0018] S101, Obtain panoramic video data to be processed, the panoramic video data including: omnidirectional audio data.
[0019] In one embodiment, the panoramic video data to be processed can be captured by a panoramic camera consisting of two or more fisheye lenses, resulting in raw spherical video. The omnidirectional audio data can be collected by the microphone built into the panoramic camera or by a separate microphone device.
[0020] In another embodiment, the panoramic video data to be processed can be acquired by a panoramic camera on the drone, and the omnidirectional audio data can be acquired by a microphone on the drone. The microphone can be an omnidirectional microphone or a microphone array consisting of multiple microphones.
[0021] S102, determine at least one sound subject in the panoramic video data.
[0022] The main sound source can be human voice, motorcycle engine sound, music sound, ambient sound, etc. The main sound source refers to the sound source corresponding to the sound that the user wants to retain.
[0023] Panoramic video data contains sound targets from multiple directions, and the location information of the sound subject can be determined by analyzing omnidirectional audio data. For example, Voice Activity Detection (VAD) and Direction of Arrival (DOA) algorithms can be used to determine the location information of the sound subject.
[0024] One method to determine the sound subject from at least one direction is for the user to select the sound subject themselves. For example, the user can select a target on the panoramic video screen. Suppose the target selected by the user is a "person", then the sound subject is the human voice in the direction of the "person".
[0025] Algorithms can also be used to automatically determine the sound subject in at least one direction. For example, a sound source localization algorithm can be used to estimate the direction of the sound source with the loudest current volume as the location of the sound subject. Then, an image recognition algorithm (such as distance detection, human detection, lip movement detection, etc.) can be used to determine the type of the sound subject in that direction.
[0026] In addition, a sound subject from at least one direction can be identified by training a machine learning model or a neural network model.
[0027] Users can also directly input the location information of the sound subject. Users can specify a location on the panoramic video screen, and that location will be used as the location of the sound subject. The type of sound subject can be determined by image recognition of the user-specified target; for example, if the user-specified target is a person, then the sound subject will be a human voice.
[0028] S103, the omnidirectional audio data is processed according to the directional information of the sound subject to obtain the target audio data.
[0029] For example, based on the location information of the sound subject, the sound subject in that location in the omnidirectional audio data can be enhanced to improve the auditory experience of the sound subject in that location.
[0030] The processing of omnidirectional audio data here can include noise reduction. Based on the directional information of the sound subject, a directional enhancement algorithm can be used to enhance the sound subject in that direction. Depending on the type of sound subject, this can be achieved by training artificial intelligence (AI) noise reduction technology that preserves the target. If the sound subject is a human voice, the AI noise reduction will employ a strategy that preserves the human voice.
[0031] This application embodiment acquires panoramic video data to be processed, which includes omnidirectional audio data, determines at least one sound subject in the panoramic video data, and processes the omnidirectional audio data based on the sound subject's location information, which can better highlight the sound subject's location and enhance the user's listening experience.
[0032] In some embodiments, the method further includes:
[0033] Based on the panoramic video data, a planar video corresponding to the main sound subject is obtained through editing.
[0034] The planar video and the target audio data are combined to obtain target audio-video data.
[0035] The panoramic video data is edited into a planar video corresponding to the main sound subject. Then, the planar video is combined with the target audio data to obtain the target audio-visual data, which can be played by the user.
[0036] In some embodiments, the step of editing the panoramic video data to obtain a planar video corresponding to the sound subject includes:
[0037] Obtain the tracking sequence corresponding to the sound subject, wherein the tracking sequence is obtained by tracking and identifying the sound subject;
[0038] Based on the panoramic video data, a planar video corresponding to the tracking sequence is obtained through editing.
[0039] Here, the sound subject in the panoramic video can be tracked and identified. After selecting the sound subject in the panoramic video, the target tracking algorithm is run on the panoramic video to obtain the tracking sequence corresponding to the sound subject. Then, the panoramic video data is edited based on the tracking sequence.
[0040] In some embodiments, the method further includes:
[0041] Based on the panoramic video data, determine the playback sequence of the viewpoint corresponding to the main sound subject;
[0042] Associate the perspective playback sequence corresponding to the sound subject with the target audio data, and play the target audio data while playing the perspective playback sequence.
[0043] Here, the playback sequence of the perspective corresponding to the sound subject can be composed of the playback perspective of the sound subject in each panoramic video frame, and the target audio data can be played at the same time as the playback sequence of the perspective corresponding to the sound subject.
[0044] In some embodiments, processing the omnidirectional audio data based on the directional information of the sound subject to obtain target audio data includes:
[0045] The corresponding noise reduction strategy is determined based on the location information of the sound subject;
[0046] The omnidirectional audio data is denoised based on the denoising strategy to obtain the target audio data.
[0047] For example, given the directional information of the sound subject, the corresponding noise reduction strategy could be to use a pointing enhancement algorithm to enhance the sound from that direction.
[0048] In some embodiments, determining the corresponding noise reduction strategy based on the location information of the sound subject includes:
[0049] Determine the type information of the sound subject;
[0050] Based on the location and type information of the sound subject, a corresponding noise reduction strategy is determined.
[0051] Different types of sound subjects can correspond to different noise reduction strategies. The correspondence between the types of sound subjects and noise reduction strategies can be stored in advance, and the corresponding noise reduction strategy can be queried according to the sound subject when it is applied.
[0052] For example, when the main sound is a human voice, the corresponding noise reduction strategy can be to enhance the human voice.
[0053] For example, when the main sound is music, the corresponding noise reduction strategy can be to improve the fidelity of the music.
[0054] For example, when the main sound source is ambient sound, the corresponding noise reduction strategy could be to preserve the ambient sound and filter out human voices.
[0055] For example, if the noise reduction strategy is to enhance vocals, a fast-tracking minimum tracking method combined with Wiener filtering or a deep neural network-based vocal enhancement method can be used. If the noise reduction strategy is to improve music fidelity, a slower-tracking minimum tracking method combined with Wiener filtering can be used, without equalizer (EQ) adjustments, and with Automatic Gain Control (AGC) using a longer release time, etc., to improve music fidelity.
[0056] This embodiment can intelligently select a noise reduction strategy based on the subject of the sound, and use the noise reduction strategy corresponding to the subject of the sound to process the audio data, thereby making the subject of the sound stand out more.
[0057] Sub-denoising strategies corresponding to the location and type information of the sound subject can be determined separately, and denoising processing can be performed using the sub-denoising strategies corresponding to the location and type information of the sound subject. For example, if the sound subject is a human voice, and the sound subject is located in a certain spatial direction, beamforming algorithm can be used for spatial directional sound reception, and a human voice enhancement denoising strategy can be adopted.
[0058] In some embodiments, before determining the corresponding noise reduction strategy based on the sound subject, the method further includes:
[0059] Determine the audio scene in the panoramic video data;
[0060] Correspondingly, determining the appropriate noise reduction strategy based on the location and type information of the sound subject includes:
[0061] Based on the location and type information of the sound subject and the audio scene, a corresponding noise reduction strategy is determined.
[0062] The audio data can be input into the audio scene recognition model to obtain the audio scene output by the audio scene recognition model. The audio scene includes outdoor scene and indoor scene. Outdoor scene can be further divided into cycling scene, running scene, skiing scene, diving scene, etc., while indoor scene can be further divided into no sound, pure human voice, light music, human voice mixed with background music, song and electronic echo.
[0063] An audio scene recognition model can be pre-trained using a large amount of audio data and training labels, and then used to identify audio scenes.
[0064] Different noise reduction strategies are needed for different audio scenarios. In the case of cycling, where the main sound is human voice, there will be wind noise. Therefore, a noise reduction strategy of reducing wind noise and enhancing human voice can be adopted. The low frequencies of the audio are suppressed (the frequency band where wind noise energy is concentrated), and the minimum tracking with fast tracking speed + Wiener filtering or human voice enhancement method based on deep neural networks is used to brighten the human voice.
[0065] When the main sound is ambient sound and the scene is quiet (e.g., in a forest in the suburbs), the noise reduction strategy is to use a lower noise reduction intensity or no noise reduction processing.
[0066] When the main sound is human voice and the scene is noisy, the noise reduction strategy is to use a higher noise reduction intensity, such as reducing the minimum signal-to-noise ratio, reducing the minimum gain, and increasing the noise spectrum adjustment coefficient.
[0067] Pointing enhancement is not suitable for certain scenarios, such as when there is a lot of wind noise.
[0068] In one embodiment, determining the corresponding noise reduction strategy based on the location information and type information of the sound subject and the audio scene includes:
[0069] Determine the first noise reduction strategy corresponding to the location information of the sound subject;
[0070] Determine the second noise reduction strategy corresponding to the type information of the sound subject;
[0071] Determine the third noise reduction strategy corresponding to the audio scene;
[0072] Correspondingly, the noise reduction processing of the audio data based on the noise reduction strategy includes:
[0073] The audio data is subjected to noise reduction processing based on the first noise reduction strategy, the second noise reduction strategy, and the third noise reduction strategy.
[0074] Among them, the first noise reduction strategy corresponding to the directional information of the sound subject can be sound direction enhancement, which can be achieved by using a beamforming algorithm.
[0075] The second noise reduction strategy corresponds to the type information of the sound subject. Different types may correspond to different second noise reduction strategies. If the type information of the sound subject is human voice, the second noise reduction strategy is to enhance human voice. If the type information of the sound subject is music, the corresponding noise reduction strategy is to improve the music reproduction. If the type information of the sound subject is ambient sound, the corresponding noise reduction strategy may be not to perform noise reduction processing.
[0076] The third noise reduction strategy corresponding to the audio scene is as follows: if the scene is cycling, the third noise reduction strategy can be to reduce wind noise; if the scene is a quiet scene (such as in a forest in the suburbs), the third noise reduction strategy is to use a lower noise reduction intensity or no noise reduction processing; if the scene is a noisy scene, the third noise reduction strategy is to use a higher noise reduction intensity, such as reducing the minimum signal-to-noise ratio, reducing the minimum gain, increasing the noise spectrum adjustment coefficient, etc.
[0077] The first, second, and third noise reduction strategies can be used sequentially in a set order. For example, the first noise reduction strategy can be used to reduce the noise of the audio data first, then the second noise reduction strategy can be used to reduce the noise of the audio data, and finally the third noise reduction strategy can be used to reduce the noise of the audio data.
[0078] In one embodiment, a target noise reduction strategy is obtained by combining the first noise reduction strategy, the second noise reduction strategy, and the third noise reduction strategy. Then, the audio data is noise-reduced based on the target noise reduction strategy. If there is a conflict between the first noise reduction strategy, the second noise reduction strategy, and the third noise reduction strategy, the target noise reduction strategy is determined according to a pre-set noise reduction priority. For example, the priority of the first noise reduction strategy is set to be greater than that of the second noise reduction strategy, and the priority of the second noise reduction strategy is set to be greater than that of the third noise reduction strategy.
[0079] In one embodiment, determining the audio scene in the panoramic video data includes:
[0080] Determine the audio features and / or image features contained in the panoramic video data;
[0081] The audio scene in the panoramic video data is determined based on the audio features and / or image features.
[0082] First, audio features (such as Mel frequency cepstral coefficients, spectrograms, short-time energy, zero-crossing rate, etc.) can be extracted from audio data, or image features (such as color histograms, grayscale images, edge features, optical flow, etc.) can be extracted from panoramic video data. Then, traditional machine learning methods such as K-nearest neighbors and support vector machines, or methods based on deep neural networks, can be used for scene classification. The input features for machine learning can be audio features, image features, or a combination of both.
[0083] In one embodiment, the omnidirectional audio data includes audio data from multiple channels, and the noise reduction processing of the audio data based on the noise reduction strategy includes:
[0084] Determine at least one channel corresponding to the noise reduction strategy;
[0085] The audio data of at least one channel is subjected to noise reduction processing based on the noise reduction strategy.
[0086] If the omnidirectional audio data includes audio data from multiple channels, the specific channels(s) to be used depend on the noise reduction strategy employed. For example, if the noise reduction strategy is directional enhancement, audio data from all channels needs to be synthesized; alternatively, the audio data from the channel directly facing the target sound source can be selected for noise reduction. If the noise reduction strategy is wind noise reduction, only the audio data from the channel with low wind noise can be processed.
[0087] In one embodiment, determining the location of at least one sound subject in the panoramic video data includes one of the following:
[0088] Obtain the target selected by the user in the panoramic video, and use the target as the sound subject in the at least one direction;
[0089] Determine the audio and / or image features contained in the panoramic video data, and determine the sound subject in the at least one direction based on the audio and / or image features.
[0090] Specifically, audio features may also include Mel-frequency cepstral coefficients, spectrograms, short-time energy, zero-crossing rate, etc. Image features may also include color histograms, grayscale images, edge features, optical flow, etc.
[0091] For example, speech activity detection and sound source localization algorithms can be used to estimate the general direction of the current speaker, and then lip movement detection algorithms can be used to determine the speaker's specific location.
[0092] For example, sound subjects can be identified by training machine learning models or neural network learning models. For instance, Mel Frequency Cepstral Coefficients (MFCC) features can be used as network input features, with the network structure consisting of multiple CNN or LSTM layers stacked together, and the output being a classification of different sound subjects.
[0093] For example, short-time smoothing power and short-time zero-crossing rate can be calculated for audio data, and corresponding thresholds can be set. When the short-time power exceeds the threshold and the short-time zero-crossing rate is below the threshold, it can be determined that human voices are present. The distribution characteristics of the spectrum can also be used to determine the sound scene or the main sound source. For example, when the spectral centroid is very low and the correlation between channels is low, wind noise can be identified. When the spectrum shows frequent overall increases and decreases over time, sudden noise such as collision sounds or drilling sounds can be identified. When the high-frequency proportion of the spectrum is high and the short-time fluctuations are rapid, but the fluctuations at different frequency points are asynchronous, it can be determined that the current environment is noisy, such as in a shopping mall or by the roadside.
[0094] In one embodiment, acquiring a target selected by the user in a panoramic video and using the target as the sound subject in the at least one location includes:
[0095] Identify multiple targets in the panoramic video;
[0096] Display the multiple targets to the user;
[0097] The target selected by the user is determined as the sound subject in at least one direction.
[0098] When there are multiple targets in a panoramic video frame, a target recognition algorithm can identify these targets (people, animals, plants, buildings, vehicles, etc.) and display them all to the user. The user can then select one target as the subject of their audio. For example, selecting "person" will result in a human voice, while selecting "animal" will result in animal sounds.
[0099] For example, in the APP user interface shown in Figure 2, a panoramic video is displayed, and all targets in the panoramic video are identified and displayed. Users can select targets directly in the panoramic video or through the target options at the bottom of the APP user interface. For example, Figure 2 includes target 1, target 2, and target 3. The target selected by the user is designated as the main sound subject.
[0100] In one embodiment, the noise reduction processing of the omnidirectional audio data based on the noise reduction strategy includes:
[0101] Determine the noise reduction algorithm and audio parameters corresponding to the noise reduction strategy;
[0102] The noise reduction algorithm is used to adjust the audio parameters of the omnidirectional audio data.
[0103] For example, if the noise reduction strategy is voice enhancement, the corresponding noise reduction algorithm could be a minimum tracking algorithm with Wiener filtering, which has a fast tracking speed, or a voice enhancement method based on deep neural networks to brighten the voice. Adjustable audio parameters include the smoothing coefficient of the noise spectrum estimation, the coefficient for adjusting the absolute value of the noise spectrum, the maximum noise amplitude, the minimum signal-to-noise ratio, and the minimum gain. Besides noise reduction, parameters related to audio processing can also be adjusted, such as equalizer (EQ) parameters, automatic gain control (AGC) or dynamic range compression (DRC) parameters.
[0104] In one embodiment, determining the location of at least one sound subject in the panoramic video data includes:
[0105] The panoramic video data is divided into multiple segments;
[0106] Identify the sound subject from at least one location in each segment;
[0107] Correspondingly, processing the omnidirectional audio data based on the location information of the sound subject includes:
[0108] The omnidirectional audio data of the corresponding segment is processed based on the sound subject in at least one direction in each segment.
[0109] This embodiment extracts panoramic video data into multiple segments for sequential processing. For example, a 10-second video is segmented into 10ms segments, and data is processed sequentially, starting from the beginning and proceeding in 10ms increments. This allows for real-time adjustment of the audio processing strategy, making it more suitable for the current segment and improving the audio processing effect.
[0110] In one embodiment, after processing the omnidirectional audio data of the corresponding segment based on the sound subject with at least one orientation in each segment, the method further includes:
[0111] If the processing strategies for two adjacent segments are different, then the audio parameters of the target audio data of the two adjacent segments are smoothed.
[0112] Since noise reduction is performed on a segment-by-segment basis, if two segments have different audio processing strategies, signal smoothing measures can be added when switching between different audio processing strategies to ensure that there is no obvious change in the listening experience during the switch and to avoid abrupt changes in the listening experience when switching between different audio processing strategies. Specific smoothing measures can include: simultaneously calculating the output audio parameters of the two segments before and after the switch, and performing linear fade-in, fade-out, exponential fade-in, and fade-out processing on the two outputs.
[0113] For example, in the smoothing phase, the parameters of the previous frame gradually decrease to 0 linearly, and the parameters of the next frame gradually increase linearly. For example, if the parameter of the previous frame is 'a', the parameter of the next frame is 'b', the smoothing time is 'N', and the time smoothing coefficient of the current point is 'i', then the parameter used at the current time point is 'c' = a * (Ni) / N + b * i / N'. Recursive smoothing, on the other hand, determines a memory coefficient 'alpha', with the initial parameter being 'c' = 'a', and the current smoothing parameter being 'c' = 'c' * alpha + b * (1 - alpha). There are many such smoothing methods to choose from, and the embodiments of this application can arbitrarily select a smoothing algorithm for smoothing processing.
[0114] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
[0115] It should be understood that, when used in this specification and the appended claims, the terms "comprising" and "including" indicate the presence of the described features, integrals, steps, operations, elements and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or collections thereof.
[0116] It should be noted that the technical solutions described in the embodiments of the present invention can be combined arbitrarily without conflict.
[0117] In addition, in the embodiments of the present invention, "first," "second," etc. are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence.
[0118] Based on the hardware implementation of the above-described program modules, and in order to implement the method of this application embodiment, this application embodiment also provides an electronic device. Figure 3 is a schematic diagram of the hardware composition structure of the electronic device of this application embodiment. As shown in Figure 3, the electronic device includes:
[0119] A communication interface enables information exchange with other devices, such as network devices.
[0120] The processor is connected to the communication interface to enable information interaction with other devices;
[0121] The memory stores a computer program, which, when executed by the processor, is configured to retrieve panoramic video data to be processed. The panoramic video data includes: omnidirectional audio data; determine at least one sound subject in the panoramic video data; and process the omnidirectional audio data according to the location information of the sound subject to obtain target audio data.
[0122] Furthermore, according to at least one embodiment of this application, the processor is configured to edit the panoramic video data to obtain a planar video corresponding to the sound subject; and to synthesize the planar video with the target audio data to obtain target audio-visual data.
[0123] Furthermore, according to at least one embodiment of this application, the processor is configured to acquire a tracking sequence corresponding to the sound subject, the tracking sequence being obtained by tracking and identifying the sound subject; and to edit a planar video corresponding to the tracking sequence based on the panoramic video data.
[0124] Furthermore, according to at least one embodiment of this application, the processor is configured to determine a perspective playback sequence corresponding to the sound subject based on the panoramic video data; associate the perspective playback sequence corresponding to the sound subject with the target audio data; and play the target audio data while playing the perspective playback sequence.
[0125] Furthermore, according to at least one embodiment of this application, the processor is configured to determine a corresponding noise reduction strategy based on the directional information of the sound subject; and to perform noise reduction processing on the omnidirectional audio data based on the noise reduction strategy to obtain the target audio data.
[0126] Furthermore, according to at least one embodiment of this application, the processor is configured to determine the type information of the sound subject; and based on the location information and type information of the sound subject, determine a corresponding noise reduction strategy.
[0127] Furthermore, according to at least one embodiment of this application, the processor is configured to determine an audio scene in the panoramic video data; and to determine a corresponding noise reduction strategy based on the orientation information and type information of the sound subject and the audio scene.
[0128] Furthermore, according to at least one embodiment of this application, the processor is configured to determine audio features and / or image features contained in the panoramic video data; and to determine an audio scene in the panoramic video data based on the audio features and / or image features.
[0129] Furthermore, according to at least one embodiment of this application, the processor is configured to determine a first noise reduction strategy corresponding to the directional information of the sound subject; determine a second noise reduction strategy corresponding to the type information of the sound subject; determine a third noise reduction strategy corresponding to the audio scene; and perform noise reduction processing on the omnidirectional audio data based on the first noise reduction strategy, the second noise reduction strategy, and the third noise reduction strategy.
[0130] Furthermore, according to at least one embodiment of this application, the omnidirectional audio data includes audio data from multiple channels, and the processor is configured to determine at least one channel corresponding to the noise reduction strategy; and to perform noise reduction processing on the audio data of the at least one channel based on the noise reduction strategy.
[0131] Furthermore, according to at least one embodiment of this application, the processor is configured to acquire a target selected by the user in a panoramic video and use the target as the sound subject in the at least one location; or, to determine audio features and / or image features contained in the panoramic video data and determine the sound subject in the at least one location based on the audio features and / or image features.
[0132] Furthermore, according to at least one embodiment of this application, the processor is configured to determine a plurality of targets in the panoramic video; display the plurality of targets to a user; and determine the target selected by the user as the sound subject in the at least one direction.
[0133] Furthermore, according to at least one embodiment of this application, the audio features include the direction of the sound source and the volume, and the image features include the location information of the sound subject.
[0134] Furthermore, according to at least one embodiment of this application, the processor is configured to determine the noise reduction algorithm and audio parameters corresponding to the noise reduction strategy; and to adjust the audio parameters of the omnidirectional audio data using the noise reduction algorithm.
[0135] Furthermore, according to at least one embodiment of this application, the processor is configured to divide the panoramic video data into multiple segments;
[0136] Identify at least one directional sound subject in each segment; process the omnidirectional audio data of the corresponding segment based on the at least one directional sound subject in each segment.
[0137] Furthermore, according to at least one embodiment of this application, the processor is configured to smooth the audio parameters of the target audio data of the two adjacent segments if the processing strategies corresponding to the two adjacent segments are different.
[0138] Of course, in practical applications, the various components in an electronic device are coupled together through a bus system. It can be understood that the bus system is used to achieve communication and connection between these components. In addition to the data bus, the bus system also includes a power bus, a control bus, and a status signal bus. However, for clarity, all buses are labeled as bus systems in Figure 3.
[0139] The memory in this application embodiment is used to store various types of data to support the operation of the electronic device. Examples of such data include any computer program used to operate on the electronic device.
[0140] It is understood that memory can be volatile or non-volatile, or both. Non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), ferromagnetic random access memory (FRAM), flash memory, magnetic surface memory, optical disc, or compact disc read-only memory (CD-ROM); magnetic surface memory can be disk storage or magnetic tape storage. Volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), SyncLink Dynamic Random Access Memory (SLDRAM), and Direct Rambus Random Access Memory (DRRAM).The memories described in the embodiments of this application are intended to include, but are not limited to, these and any other suitable types of memories.
[0141] The methods disclosed in the embodiments of this application can be applied to a processor or implemented by a processor. The processor may be an integrated circuit chip with signal processing capabilities. During implementation, each step of the above method can be completed by integrated logic circuits in the processor's hardware or by instructions in software form. The processor may be a general-purpose processor, a DSP, or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The processor can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. A general-purpose processor may be a microprocessor or any conventional processor, etc. The steps of the methods disclosed in the embodiments of this application can be directly manifested as execution by a hardware decoding processor, or execution by a combination of hardware and software modules in the decoding processor. The software modules may be located in a storage medium, which is located in memory. The processor reads the program from the memory and, in conjunction with its hardware, completes the steps of the aforementioned method.
[0142] Optionally, when the processor executes the program, it implements the corresponding processes implemented by the electronic device in the various methods of the embodiments of this application. For the sake of brevity, these will not be described in detail here.
[0143] In an exemplary embodiment, this application also provides a computer program product, including a computer program that can be executed by a processor of an electronic device to perform the steps described in the method of this application embodiment.
[0144] In an exemplary embodiment, this application also provides a storage medium, namely a computer storage medium, specifically a computer-readable storage medium, such as a first memory storing a computer program, which can be executed by a processor of an electronic device to complete the steps described in the aforementioned method. The computer-readable storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface memory, optical disc, or CD-ROM.
[0145] In the several embodiments provided in this application, it should be understood that the disclosed apparatus, electronic devices, and methods can be implemented in other ways. The apparatus embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods, such as: multiple units or components may be combined, or integrated into another system, or some features may be ignored or not executed. In addition, the coupling, direct coupling, or communication connection between the various components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.
[0146] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the units may be selected to achieve the purpose of this embodiment according to actual needs.
[0147] In addition, each functional unit in the various embodiments of this application can be integrated into one processing unit, or each unit can be a separate unit, or two or more units can be integrated into one unit; the integrated unit can be implemented in hardware or in the form of hardware plus software functional units.
[0148] Those skilled in the art will understand that all or part of the steps of the above method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it performs the steps of the above method embodiments. The aforementioned storage medium includes various media that can store program code, such as mobile storage devices, ROM, RAM, magnetic disks, or optical disks.
[0149] Alternatively, if the integrated units described above are implemented as software functional modules and sold or used as independent products, they can also be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the embodiments of this application, or the parts that contribute to related technologies, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as mobile storage devices, ROM, RAM, magnetic disks, or optical disks.
[0150] It should be noted that the technical solutions described in the embodiments of this application can be combined arbitrarily without conflict.
[0151] The description of the various embodiments above tends to emphasize the differences between the various embodiments. The similarities or similarities between them can be referred to, and for the sake of brevity, they will not be repeated here.
[0152] The methods disclosed in the various method embodiments provided in this application can be arbitrarily combined to obtain new method embodiments without conflict.
[0153] The features disclosed in the various product embodiments provided in this application can be arbitrarily combined without conflict to obtain new product embodiments.
[0154] The features disclosed in the various method or device embodiments provided in this application can be arbitrarily combined without conflict to obtain new method or device embodiments.
[0155] In addition, in this application example, terms such as "first" and "second" are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence.
[0156] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
Claims
1. An audio processing method, comprising: Acquire panoramic video data to be processed, wherein the panoramic video data includes: omnidirectional audio data; Identify at least one sound subject in the panoramic video data from at least one location; The omnidirectional audio data is processed based on the directional information of the sound subject to obtain the target audio data.
2. The method of claim 1, wherein, The method further includes: Based on the panoramic video data, a planar video corresponding to the main sound subject is obtained through editing. The planar video and the target audio data are combined to obtain target audio-video data.
3. The method of claim 2, wherein, The step of editing the panoramic video data to obtain a planar video corresponding to the sound subject includes: Obtain the tracking sequence corresponding to the sound subject, wherein the tracking sequence is obtained by tracking and identifying the sound subject; Based on the panoramic video data, a planar video corresponding to the tracking sequence is obtained through editing.
4. The method of claim 1, wherein, The method further includes: Based on the panoramic video data, determine the playback sequence of the viewpoint corresponding to the main sound subject; Associate the perspective playback sequence corresponding to the sound subject with the target audio data, and play the target audio data while playing the perspective playback sequence.
5. The method of claim 1, wherein, The step of processing the omnidirectional audio data based on the directional information of the sound subject to obtain target audio data includes: The corresponding noise reduction strategy is determined based on the location information of the sound subject; The omnidirectional audio data is denoised based on the denoising strategy to obtain the target audio data.
6. The method of claim 5, wherein, The step of determining the corresponding noise reduction strategy based on the location information of the sound subject includes: Determine the type information of the sound subject; Based on the location and type information of the sound subject, a corresponding noise reduction strategy is determined.
7. The method of claim 6, wherein, The method further includes: Determine the audio scene in the panoramic video data; Correspondingly, determining the appropriate noise reduction strategy based on the location and type information of the sound subject includes: Based on the location and type information of the sound subject and the audio scene, a corresponding noise reduction strategy is determined.
8. The method of claim 7, wherein, Determining the audio scene in the panoramic video data includes: Determine the audio features and / or image features contained in the panoramic video data; The audio scene in the panoramic video data is determined based on the audio features and / or image features.
9. The method of claim 7, wherein, The step of determining the corresponding noise reduction strategy based on the location information and type information of the sound subject and the audio scene includes: Determine the first noise reduction strategy corresponding to the location information of the sound subject; Determine the second noise reduction strategy corresponding to the type information of the sound subject; Determine the third noise reduction strategy corresponding to the audio scene; Correspondingly, the noise reduction processing of the omnidirectional audio data based on the noise reduction strategy includes: The omnidirectional audio data is subjected to noise reduction processing based on the first noise reduction strategy, the second noise reduction strategy, and the third noise reduction strategy.
10. The method of claim 5, wherein, The omnidirectional audio data includes audio data from multiple channels, and the noise reduction processing of the audio data based on the noise reduction strategy includes: Determine at least one channel corresponding to the noise reduction strategy; The audio data of at least one channel is subjected to noise reduction processing based on the noise reduction strategy.
11. The method of claim 1, wherein, The determination of at least one sound subject in the panoramic video data includes one of the following: Obtain the target selected by the user in the panoramic video, and use the target as the sound subject in the at least one direction; Determine the audio and / or image features contained in the panoramic video data, and determine the sound subject in the at least one direction based on the audio and / or image features.
12. The method of claim 11, wherein, Acquiring the target selected by the user in the panoramic video, and using the target as the sound subject in the at least one location, includes: Identify multiple targets in the panoramic video; Display the multiple targets to the user; The target selected by the user is identified as the sound subject in at least one direction.
13. The method of claim 11, wherein, The audio features include the direction of the sound source and the volume, and the image features include the location information of the sound subject.
14. The method of claim 5, wherein, The noise reduction processing of the omnidirectional audio data based on the noise reduction strategy includes: Determine the noise reduction algorithm and audio parameters corresponding to the noise reduction strategy; The noise reduction algorithm is used to adjust the audio parameters of the omnidirectional audio data.
15. The method of claim 1, wherein, The step of determining the sound subject with at least one location in the panoramic video data includes: The panoramic video data is divided into multiple segments; Identify at least one directional sound subject in each segment; Correspondingly, processing the omnidirectional audio data based on the location information of the sound subject includes: The omnidirectional audio data of the corresponding segment is processed based on the sound subject in at least one direction in each segment.
16. The method of claim 15, wherein, After processing the omnidirectional audio data of the corresponding segment based on the sound subject in at least one direction in each segment, the method further includes: If the processing strategies for two adjacent segments are different, then the audio parameters of the target audio data of the two adjacent segments are smoothed.
17. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein when the computer program is executed by the processor, the processor is configured to acquire panoramic video data to be processed, the panoramic video data comprising: Omnidirectional audio data; Identify at least one sound subject in the panoramic video data from at least one location; The omnidirectional audio data is processed based on the directional information of the sound subject to obtain the target audio data.
18. The electronic device of claim 17, wherein, The processor is configured to edit the panoramic video data to obtain a planar video corresponding to the sound subject; and to synthesize the planar video with the target audio data to obtain target audio-visual data.
19. The electronic device of claim 18, wherein the processor is configured to acquire a tracking sequence corresponding to the sound subject, the tracking sequence being obtained by tracking and identifying the sound subject; and to edit a planar video corresponding to the tracking sequence based on the panoramic video data.
20. The electronic device of claim 17, wherein, The processor is configured to determine a playback sequence of perspectives corresponding to the sound subject based on the panoramic video data; associate the playback sequence of perspectives corresponding to the sound subject with the target audio data; and play the target audio data while playing the playback sequence of perspectives.
21. The electronic device of claim 17, wherein, The processor is configured to determine a corresponding noise reduction strategy based on the directional information of the sound subject; and to perform noise reduction processing on the omnidirectional audio data based on the noise reduction strategy to obtain the target audio data.
22. The electronic device according to claim 21, wherein, The processor is configured to determine the type information of the sound subject; and based on the location information and type information of the sound subject, determine the corresponding noise reduction strategy.
23. The electronic device of claim 22, wherein, The processor is configured to determine the audio scene in the panoramic video data; and to determine a corresponding noise reduction strategy based on the location information and type information of the sound subject and the audio scene.
24. The electronic device of claim 23, wherein, The processor is configured to determine audio features and / or image features contained in the panoramic video data; and to determine audio scenes in the panoramic video data based on the audio features and / or image features.
25. The electronic device of claim 23, wherein, The processor is configured to determine a first noise reduction strategy corresponding to the location information of the sound subject; and to determine a second noise reduction strategy corresponding to the type information of the sound subject. A third noise reduction strategy corresponding to the audio scene is determined; based on the first noise reduction strategy, the second noise reduction strategy and the third noise reduction strategy, the omnidirectional audio data is subjected to noise reduction processing.
26. The electronic device of claim 21, wherein, The omnidirectional audio data includes audio data from multiple channels, and the processor is configured to determine at least one channel corresponding to the noise reduction strategy. The audio data of at least one channel is subjected to noise reduction processing based on the noise reduction strategy.
27. The electronic device of claim 17, wherein, The processor is configured to acquire a target selected by the user in the panoramic video and use the target as the sound subject in the at least one direction; or, to determine the audio features and / or image features contained in the panoramic video data and determine the sound subject in the at least one direction based on the audio features and / or image features.
28. The electronic device of claim 17, wherein, The processor is configured to identify multiple targets in the panoramic video; display the multiple targets to the user; and identify the target selected by the user as the sound subject in the at least one direction.
29. The electronic device of claim 27, wherein, The audio features include the direction of the sound source and the volume, and the image features include the location information of the sound subject.
30. The electronic device of claim 21, wherein, The processor is configured to determine the noise reduction algorithm and audio parameters corresponding to the noise reduction strategy; and to adjust the audio parameters of the omnidirectional audio data using the noise reduction algorithm.
31. The electronic device of claim 17, wherein, The processor is configured to divide the panoramic video data into multiple segments; Identify at least one directional sound subject in each segment; process the omnidirectional audio data of the corresponding segment based on the at least one directional sound subject in each segment.
32. The electronic device of claim 31, wherein, The processor is configured to smooth the audio parameters of the target audio data of two adjacent segments if the processing strategies corresponding to the two adjacent segments are different.