Video processing method and device, electronic equipment and computer storage medium
By performing frame-by-frame processing and feature fusion interaction on the video, and using audio features to determine the segmentation mask of the sound-producing object, the problem of inaccurate segmentation of sound-producing objects in the existing technology is solved, and accurate segmentation of sound-producing objects in the video and recognition of multiple sound-producing objects are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TAOBAO CHINA SOFTWARE
- Filing Date
- 2023-08-19
- Publication Date
- 2026-06-30
AI Technical Summary
Existing video processing methods cannot accurately segment the sound source in a video, resulting in inaccurate segmentation results.
By segmenting the video into frames, extracting multi-size video and audio features, and employing cross-modal feature fusion and temporal interaction, the segmentation mask of the sound-producing object is determined using audio query information, thereby achieving accurate segmentation of the sound-producing object in the video frame.
It achieves accurate segmentation of sound-producing objects in videos, and is applicable to the recognition and segmentation of multiple sound-producing objects in complex scenes, improving the accuracy and efficiency of segmentation.
Smart Images

Figure CN117315524B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, specifically to video processing methods, video processing apparatus, electronic devices, and computer storage media. Background Technology
[0002] With the increasing prevalence of video in people's daily lives and work, video processing technology has become increasingly important. In video processing, the localization and segmentation of the speaker in the video establishes a connection between audio and video, with wide applications in real-world scenarios. For example, in live video streaming, speaker recognition and segmentation can highlight the speaker in the video, providing a better viewing experience; in video conferencing, especially in multi-person meetings, speaker recognition and segmentation can magnify the speaker's body in the video window, attracting the attention of other listeners; and in short video editing, speaker recognition and segmentation can quickly achieve foreground / background differentiation and content editing.
[0003] However, existing video processing methods can only roughly identify and segment the sound-producing objects in videos, leading to inaccurate segmentation results. Therefore, how to accurately segment the sound-producing objects in videos has become an urgent technical problem to be solved. Summary of the Invention
[0004] This application provides a video processing method to accurately segment sound-producing objects in a video. This application also provides a video processing apparatus, an electronic device, and a computer storage medium.
[0005] This application provides a video processing method, including:
[0006] The video to be processed is segmented into frames to obtain multiple video frames corresponding to the video to be processed.
[0007] For any target video frame among multiple video frames, extract the video features of the target video frame at multiple sizes to obtain the multi-size video features of the target video frame;
[0008] Based on the audio corresponding to the video to be processed, obtain the audio features corresponding to the target video frame;
[0009] Based on the multi-size video features and the audio features, obtain the audio-related video mask features for the target video frame;
[0010] Based on pre-set audio query information for each video frame, the target features of the sound-producing object corresponding to the audio query information in the multiple video frames are determined in the video feature processing; the video feature processing is the video feature obtained by feature fusion and temporal interaction between the multi-size video features and the audio features;
[0011] Based on the target features and the video mask features, a segmentation mask for the target sound-emitting object in the target video frame is obtained; the segmentation mask is used to represent the target sound-emitting object in the target video frame.
[0012] Optionally, obtaining the audio-related video mask features for the target video frame based on the multi-size video features and the audio features includes:
[0013] The multi-size video features and the audio features are fused and temporally interacted to obtain video mask features related to the audio for the target video frame.
[0014] Optionally, the step of performing feature fusion and temporal interaction on the multi-size video features and the audio features to obtain audio-related video mask features for the target video frame includes:
[0015] The first attention mechanism is used to fuse the multi-size video features and the audio features to obtain fused video features that incorporate the audio features;
[0016] A second attention mechanism is used to process the multi-size video features and the fused video features to obtain aggregated video features of the multi-size video features at different sizes of the same pixel;
[0017] A third attention mechanism is used to perform temporal interaction processing on the audio features and the aggregated video features to obtain the processed video features after temporal interaction processing.
[0018] Based on the processed video features, obtain the audio-related video mask features for the target video frame.
[0019] Optionally, the step of employing a third attention mechanism to perform temporal interaction processing on the audio features and the aggregated video features to obtain processed video features after temporal interaction processing includes:
[0020] For the audio features, initial features corresponding to the audio features in the multiple video frames are determined from the aggregated video features;
[0021] Self-attention is used to enhance the initial features to determine the features after temporal interaction between different target video frames;
[0022] The features after the temporal interaction are mapped to obtain the processed video features after the temporal interaction.
[0023] Optionally, the step of determining the target features of the sound-producing object corresponding to the audio query information in the multiple video frames based on pre-set audio query information for each video frame includes:
[0024] Based on pre-set audio query information for each video frame, determine the sound source corresponding to the audio query information;
[0025] Based on the processed video features, obtain the video features of the processed video features at a specified size;
[0026] For each audio query, the video features at the specified size and the audio query are used as input to the audio query encoder to obtain the target features of the sound-producing object in the multiple video frames.
[0027] Optionally, the audio query encoder includes: a multi-head cross-attention module, a multi-head self-attention module, and a feedforward network module;
[0028] For each audio query, the process of using the video features at the specified size and the audio query information as input to the audio query encoder to obtain the target features of the sound-producing object in the multiple video frames includes:
[0029] For each audio query, the video features at the specified size and the audio query are used as input to the multi-head cross-attention module to obtain the output result information of the multi-head cross-attention module;
[0030] The output information of the multi-head cross-attention module is used as the input information of the multi-head self-attention module to obtain the output information of the multi-head self-attention module;
[0031] The output information of the multi-head self-attention module is used as the input information of the feedforward network module to obtain the output information of the feedforward network module;
[0032] Based on the output information of the feedforward network module, the target features of the sound-emitting object in the multiple video frames are obtained.
[0033] Optionally, the audio query encoder includes a multi-layer attention mechanism, each layer of which includes a multi-head cross-attention module, a multi-head self-attention module, and a feedforward network module.
[0034] The audio query information input to the first-layer attention mechanism is the audio features of each video frame corresponding to the audio; the audio query information input to the second-layer attention mechanism or higher is the output result information of the previous-layer attention mechanism; and the output result of the last-layer attention mechanism is the target feature of the sound-producing object in the multiple video frames.
[0035] Optionally, obtaining a segmentation mask for the target sound-emitting object in the target video frame based on the target features and the video mask features includes:
[0036] The target features and the video mask features are subjected to matrix multiplication and preset function operations to obtain a segmentation mask for the target sound object in the target video frame.
[0037] This application provides a video processing apparatus, including:
[0038] The frame segmentation processing unit is used to perform frame segmentation processing on the video to be processed, and obtain multiple video frames corresponding to the video to be processed.
[0039] The video feature extraction unit is used to extract video features of any target video frame at multiple sizes for any one of multiple video frames, and obtain multi-size video features of the target video frame.
[0040] An audio feature acquisition unit is used to obtain audio features corresponding to the target video frame based on the audio corresponding to the video to be processed.
[0041] The video mask feature acquisition unit is used to obtain, based on the multi-size video features and the audio features, audio-related video mask features for the target video frame;
[0042] The target feature determination unit is used to determine the target features of the sound-producing object corresponding to the audio query information in the multiple video frames based on the pre-set audio query information for each video frame in the processed video features; the processed video features are video features after feature fusion and temporal interaction between the multi-size video features and the audio features;
[0043] The segmentation mask obtaining unit is used to obtain a segmentation mask for the target sound object of the target video frame based on the target features and the video mask features; the segmentation mask is used to represent the target sound object of the target video frame.
[0044] This application provides an electronic device, including:
[0045] processor;
[0046] The memory is used to store computer programs, which are executed by the processor to perform the video processing methods described above.
[0047] This application provides a computer storage medium storing a computer program that is executed by a processor to perform the video processing method described above.
[0048] Compared with the prior art, the embodiments of this application have the following advantages:
[0049] This application provides a video processing method, comprising: performing frame segmentation on a video to be processed to obtain multiple video frames corresponding to the video to be processed; extracting video features of the target video frame at multiple sizes for any target video frame among the multiple video frames to obtain multi-size video features of the target video frame; obtaining audio features corresponding to the target video frame based on the audio corresponding to the video to be processed; obtaining audio-related video mask features for the target video frame based on the multi-size video features and the audio features; determining the target features of the sound-producing object corresponding to the audio query information in the multiple video frames based on pre-set audio query information for each video frame in the processed video features; the processed video features are video features obtained by feature fusion and temporal interaction of the multi-size video features and the audio features; obtaining a segmentation mask of the target sound-producing object for the target video frame based on the target features and the video mask features; the segmentation mask is used to represent the target sound-producing object of the target video frame. This video processing method first acquires multi-size video features of video frames and simultaneously obtains audio features corresponding to the target video frame. Then, it can obtain audio-related video mask features for the target video frame based on the multi-size video features and audio features. At the same time, based on pre-set audio query information for each video frame, it can determine the target features of the sound-producing object corresponding to the audio query information in multiple video frames. Finally, based on the target features and video mask features, it obtains a segmentation mask for the target sound-producing object in the target video frame, enabling accurate segmentation of the sound-producing object in the video. Furthermore, this method is applicable to complex scenarios where multiple sound-producing objects exist simultaneously in a video. Attached Figure Description
[0050] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this application. For those skilled in the art, other drawings can be obtained based on these drawings.
[0051] Figure 1 A flowchart of the video processing method provided in the first embodiment of this application;
[0052] Figure 2 A detailed process diagram illustrating the video processing method provided in the first embodiment of this application;
[0053] Figure 3 A schematic diagram of the video processing apparatus provided in the second embodiment of this application;
[0054] Figure 4 A schematic diagram of an electronic device provided in the third embodiment of this application. Detailed Implementation
[0055] Many specific details are set forth in the following description to provide a full understanding of this application. However, this application can be implemented in many other ways different from those described herein, and those skilled in the art can make similar extensions without departing from the spirit of this application. Therefore, this application is not limited to the specific implementations disclosed below.
[0056] This application provides a video processing method, a video processing apparatus, an electronic device, and a computer storage medium. The following specific embodiments describe the video processing method, the video processing apparatus, the electronic device, and the computer storage medium. To more clearly illustrate the video processing method provided by the embodiments of this application, the application scenarios of the video processing method provided by the embodiments of this application are first introduced.
[0057] The video processing method of this application can be applied to scenarios where sound-producing objects in a video are automatically segmented. For example, when a video is two minutes long, assuming that user A speaks in the first minute and user B speaks in the second minute, this video processing method can segment the shape contours corresponding to user A and user B in the video. Specifically, segmenting the shape contour corresponding to user A in the first minute is crucial because user A is speaking during that first minute; similarly, segmenting the shape contour corresponding to user B in the second minute is also crucial because user B is speaking during that second minute. Of course, it is understandable that if user A and user B speak simultaneously in the video, the shape contours corresponding to both user A and user B can be segmented simultaneously.
[0058] The above illustration depicts one application scenario of the video processing method of this application. The embodiments of this application do not specifically limit the application scenario of the video processing method. The above-described application scenario is merely one embodiment of the video processing method provided in this application. The purpose of providing this application scenario embodiment is to facilitate understanding of the video processing method provided in this application, and not to limit the video processing method provided in this application. Other application scenarios of the video processing method in the embodiments of this application will not be elaborated upon.
[0059] First Embodiment
[0060] The first embodiment of this application provides a video processing method, please refer to [the specific details]. Figure 1 This is a flowchart of the video processing method provided in the first embodiment of this application.
[0061] The video processing method of this application embodiment includes the following steps:
[0062] Step S101: Perform frame segmentation on the video to be processed to obtain multiple video frames corresponding to the video to be processed.
[0063] In the video processing method of this application, in order to facilitate the segmentation of the sound-producing objects in the video, the video to be processed can be segmented into frames, thereby converting the segmentation of the sound-producing objects in the video to be processed into the segmentation of the sound-producing objects in the video frames.
[0064] The sound source is the object that emits sound in the video to be processed. There can be one or more sound sources in the video to be processed. Of course, after the video to be processed is divided into frames, there can be one or more sound sources in each video frame.
[0065] Step S102: For any target video frame among multiple video frames, extract the video features of the target video frame at multiple sizes to obtain the multi-size video features of the target video frame.
[0066] After segmenting the video to be processed into frames, each segmented video frame can be used as a target video frame. For each target video frame, it can be input into a video encoder (such as a ResNet or ViT network) to extract multi-size video features. The video encoder can downsample the features of the target video frame to obtain video features of different sizes, i.e., multi-size video features. For example, video features can be extracted from the target video frame (original image), half of the target video frame, a quarter of the target video frame, or an eighth of the target video frame. In this embodiment, the video features can refer to visual features.
[0067] To better understand the characteristics of multi-size videos, please refer to... Figure 2 This is a detailed process diagram of the video processing method provided in the first embodiment of this application. Figure 2 In the process, after the video to be processed is divided into frames, there are three video frames. The multi-size video features of the first video frame are V1; the multi-size video features of the second video frame are V2; and the multi-size video features of the third video frame are V3. Figure 2 In this context, multi-size video features are video features under four size conditions.
[0068] Step S103: Obtain the audio features corresponding to the target video frame based on the audio corresponding to the video to be processed.
[0069] In the video processing method of this application, it is also necessary to extract the audio features of the video to be processed. Specifically, the audio of the video to be processed can be extracted; then, the audio is input into an audio encoder to extract the audio features of each target video frame. For details, please refer to... Figure 2 The audio features obtained by the audio encoder (such as VGGish) are A (the audio features of the first video frame are A1; the audio features of the first video frame are A2; the audio features of the first video frame are A3).
[0070] Step S104: Based on the multi-size video features and audio features, obtain the audio-related video mask features for the target video frame.
[0071] After obtaining multi-scale video and audio features, the multi-scale video and audio features from multiple video frames are fused across modally and interactively with temporal features using a pixel encoder. This yields audio-related video mask features for the target video frame, which are then temporally enhanced using ABTI. Since audio and video features belong to different modalities, cross-modal feature fusion is necessary. The temporal features correspond to the temporal sequence of the multi-scale video features from multiple video frames; they are sequential features, and the time-series data from multiple frames constitutes the temporal sequence.
[0072] In this embodiment, as a way to obtain audio-related video mask features for a target video frame based on multi-size video features and audio features, it can refer to: performing feature fusion and temporal interaction on multi-size video features and audio features to obtain audio-related video mask features for a target video frame.
[0073] Specifically, feature fusion and temporal interaction of multi-size video features and audio features to obtain audio-related video mask features for a target video frame can refer to:
[0074] First, a first attention mechanism is used to fuse multi-size video features and audio features to obtain fused video features with fused audio features. Then, a second attention mechanism is used to process the multi-size video features and the fused video features to obtain aggregated video features of the multi-size video features at different sizes of the same pixel. Next, a third attention mechanism is used to perform temporal interaction processing on the audio features and the aggregated video features to obtain processed video features after temporal interaction processing. Finally, based on the processed video features, audio-related video mask features for the target video frame are obtained.
[0075] The aforementioned use of a third attention mechanism to perform temporal interaction processing on audio features and aggregated video features to obtain processed video features after temporal interaction processing can refer to the following: First, for audio features, initial features corresponding to the audio features in multiple video frames are determined in the aggregated video features; then, self-attention is used to enhance the initial features to determine the features after temporal interaction between different target video frames; finally, the features after temporal interaction are mapped to obtain the processed video features after temporal interaction processing.
[0076] Specifically, please refer to Figure 2 The pixel encoder consists of three parts: cross-modal attention (an example of the first attention mechanism), multi-scale deformable attention (…), and multi-scale deformable attention (…). Figure 2 The MA module (an example of a second attention mechanism) and the ABTI module (Audio-Bridged Temporal Interaction, an example of a third attention mechanism) are used to process multi-size video features V in frame t of multiple video frames. t The audio features A of frame t t Perform feature fusion to obtain the fused video feature M, which incorporates audio features. t For details, please refer to the following formula:
[0077]
[0078]
[0079] Among them, f q ,f k f v ,f w Represents a fully connected transform layer. represents matrix multiplication, T represents matrix transpose, and Softmax is an activation function that normalizes a numerical vector into a probability distribution vector, where the sum of the probabilities is 1. These are intermediate calculation results.
[0080] Multi-scale deformable attention with multi-size video features V t With fused video features M t As input, video features at the same pixel location within the same video frame at different sizes are aggregated. Multi-scale deformable attention is an existing technique aimed at fusing features between feature maps of different sizes.
[0081] The ABTI module is a timing interaction module for audio bridging. When using the ABTI module, the following operations are performed:
[0082] First, audio is used as the query in cross-modal attention, in the audio features A of multiple video frames. p Aggregated video features F from multiple video frames q Cross-modal attention computation is performed between them to obtain the initial feature O corresponding to the audio features of each video frame across multiple video frames. pq The calculation process is as follows:
[0083]
[0084]
[0085] Among them, f a ,f m ,f n Represents a fully connected transform layer. S represents matrix multiplication. T represents matrix transpose. Softmax is an activation function that normalizes a numerical vector into a probability distribution vector, where the sum of the probabilities is 1. pq This is an intermediate calculation result. pq The audio feature of frame p is the shallow visual feature corresponding to the video feature of frame q, which is just an intermediate calculation result; the features aggregated by the subsequent audio query encoder are the deep visual features of the visual feature corresponding to each audio query in all video frames, which are directly related to the final prediction result.
[0086] In the process involved in step S104, there are multiple audio features. The number of audio features is the same as the number of video frames. The audio features can be determined based on the audio information corresponding to each video frame. For example, if the speaker in the first video frame is user A, then the speaker corresponding to the audio features of the first video frame is user A. In fact, the video features of user A from multiple video frames are obtained.
[0087] Next, the features corresponding to the initial features of each audio feature across multiple video frames (the initial features can be a set of vectors output by a convolutional neural network) are enhanced using self-attention, thereby achieving temporal interaction between the sound objects corresponding to the audio features across different video frames. The resulting features are... Then, the interactive features are mapped back to the original video features to obtain the temporally enhanced video features. (i.e., the processed video features after time-series interactive processing):
[0088]
[0089]
[0090] Among them, f o Represents a fully connected transform layer. Σ represents matrix multiplication, Σ represents accumulation, and T on the accumulation symbol represents the total number of video frames.
[0091] Finally, the first three scales of the video features after temporal enhancement (e.g. Figure 2 In ) as input to the subsequent audio query encoder, the last scale (e.g. Figure 2 In This is used as a video mask feature, which is then used to obtain the segmentation mask. The video mask feature is the largest-sized video feature among the multi-size video features.
[0092] Step S105: Based on the pre-set audio query information for each video frame, determine the target features of the sound-producing object corresponding to the audio query information in multiple video frames during video feature processing.
[0093] In this embodiment, the video feature processing involves fusing and temporally interacting multi-size video features with audio features to obtain the video features.
[0094] In this embodiment, as an implementation method for determining the target features of the sound-producing object corresponding to the audio query information in multiple video frames based on pre-set audio query information for each video frame, the following steps are taken: First, based on the pre-set audio query information for each video frame, the sound-producing object corresponding to the audio query information is determined; then, according to the processed video features, the video features of the processed video features at a specified size are obtained; then, for each audio query information, the video features at the specified size and the audio query information are used as input information to the audio query encoder to obtain the target features of the sound-producing object in multiple video frames. During step S105, the pre-set audio query information for each video frame is actually continuously updated, that is, the audio query information is continuously updated based on the initial audio query information, and the audio query information is continuously updated by the audio query encoder.
[0095] The aforementioned audio query encoder includes: a multi-head cross-attention module, a multi-head self-attention module, and a feedforward network module.
[0096] To obtain the target features of the sound-producing object across multiple video frames for each audio query, the process involves: First, for each audio query, using the video features and audio query at a specified size as input to a multi-head cross-attention module, and obtaining its output. Then, using the output of the multi-head cross-attention module as input to a multi-head self-attention module, and obtaining its output. Next, using the output of the multi-head self-attention module as input to a feedforward network module, and obtaining its output. Finally, based on the output of the feedforward network module, the target features of the sound-producing object across multiple video frames are obtained.
[0097] pass Figure 2 It can be seen that the audio query encoder includes a multi-layer attention mechanism, with each layer consisting of a multi-head cross-attention module, a multi-head self-attention module, and a feedforward network module.
[0098] The audio query information (initial audio query information) input to the first-layer attention mechanism is the audio features of each video frame corresponding to the audio; the audio query information input to the second-layer attention mechanism or higher is the output result information of the previous layer attention mechanism; the output result of the last layer attention mechanism is the target features of the sounding object in multiple video frames.
[0099] Specifically, please continue to refer to Figure 2 In the audio query encoder, the same number of audio queries as the number of video frames are first set in advance. Each audio query represents the sound object in the corresponding video frame (there can be one or more sound objects). In the audio query encoder, the target features of these sound objects in all video frames are gradually aggregated through a multi-layer attention mechanism. The target features can actually be the aggregation of video features of all pixels (i.e., video frames) belonging to the same sound object. Figure 2 In this context, ×N represents the three layers of MHCA, MHSA, and FFN repeated N times.
[0100] Taking the l-th layer of a multi-layer attention mechanism as an example, the audio query feature A output from the (l-1)-th layer is input. l-1 and the video features output by the pixel encoder, After passing through a multi-head cross-attention (MHCA) module, a multi-head self-attention (MHSA) module, a feedforward network (FFN) module, and layer normalization (LN), the audio query feature output A of this layer is obtained. l :
[0101]
[0102]
[0103]
[0104] For example, in obtaining the second-layer audio query feature A 2 At that time, the first layer of audio query features A 1 , Input the second-level MHCA, and then input the output result X of the second-level MHCA. 2 Input the second-layer MHSA, and then input the output of the second-layer MHSA into the second-layer FFN to obtain the second-layer audio query feature A. 2 In fact, this obtained A 2 This is an example of updating audio query information.
[0105] Step S106: Based on the target features and video mask features, obtain the segmentation mask for the target sound object in the target video frame; the segmentation mask is used to represent the target sound object in the target video frame.
[0106] After obtaining the target features and video mask features of multiple video frames, a segmentation mask for the target sound-emitting object in the target video frame is obtained based on the target features and video mask features. This can refer to:
[0107] The target features and video mask features are subjected to matrix multiplication and preset function operations to obtain the segmentation mask of the target sound object for the target video frame.
[0108] After repeating the above process L times, the target feature A corresponding to each audio query is obtained. L The target features in frame t are compared with the video mask features in frame t. Perform matrix multiplication and activate with the sigmoid function to obtain the segmentation mask of the target vocal object corresponding to each target feature:
[0109]
[0110] Where σ is the sigmoid function, represents matrix multiplication, T represents matrix transpose, L represents the total number of layers, l represents the l-th layer, and the value of l ranges from 1 to N.
[0111] A segmentation mask can refer to setting the pixel value to 1 in the foreground of an image (i.e., the area where the sound source is located) and setting the pixel value to 0 in the background area.
[0112] After obtaining the segmentation mask of the target vocal object corresponding to each target feature, it is possible to... Figure 2 The shape and outline of the sound-producing object in each video frame are identified, through... Figure 2 It can be seen that the sound source in the first video frame is the user on the left, the sound source in the second video frame is the user on the right, and the sound source in the third video frame is the guitar.
[0113] This method uses audio query information to represent the sound-producing object in each video frame and uses the audio query information to extract the video features corresponding to the sound-producing object. Thus, it establishes a correlation between audio features and video features in the sound-producing object. Compared with existing methods that only interact audio and video features at the pixel level, the method of this application is more helpful in quickly and accurately identifying the sound-producing object in the video. At the same time, it uses an audio-bridging temporal interaction module to complete the temporal interaction. Using audio for bridging can filter out video features in the video frame that are not related to audio, making the subsequent data processing process more efficient.
[0114] This application provides a video processing method, comprising: performing frame segmentation on a video to be processed to obtain multiple video frames corresponding to the video to be processed; extracting video features of the target video frame at multiple sizes for any target video frame among the multiple video frames to obtain multi-size video features of the target video frame; obtaining audio features corresponding to the target video frame based on the audio corresponding to the video to be processed; obtaining audio-related video mask features for the target video frame based on the multi-size video features and the audio features; determining the target features of the sound-producing object corresponding to the audio query information in the multiple video frames based on pre-set audio query information for each video frame in the processed video features; the processed video features are video features obtained by feature fusion and temporal interaction of the multi-size video features and the audio features; obtaining a segmentation mask of the target sound-producing object for the target video frame based on the target features and the video mask features; the segmentation mask is used to represent the target sound-producing object of the target video frame. This video processing method first acquires multi-size video features of video frames and simultaneously obtains audio features corresponding to the target video frame. Then, it can obtain audio-related video mask features for the target video frame based on the multi-size video features and audio features. At the same time, based on pre-set audio query information for each video frame, it can determine the target features of the sound-producing object corresponding to the audio query information in multiple video frames. Finally, based on the target features and video mask features, it obtains a segmentation mask for the target sound-producing object in the target video frame, enabling accurate segmentation of the sound-producing object in the video. Furthermore, this method is applicable to complex scenarios where multiple sound-producing objects exist simultaneously in a video.
[0115] Second Embodiment
[0116] Corresponding to the video processing method provided in the first embodiment of this application, the second embodiment of this application also provides a video processing apparatus. Since the apparatus embodiment is basically similar to the first embodiment, the description is relatively simple; relevant details can be found in the description of the first embodiment. The apparatus embodiments described below are merely illustrative.
[0117] Please refer to Figure 3 This is a schematic diagram of the video processing apparatus provided in the second embodiment of this application.
[0118] The video processing device 300 includes:
[0119] Frame processing unit 301 is used to perform frame processing on the video to be processed to obtain multiple video frames corresponding to the video to be processed.
[0120] The video feature extraction unit 302 is used to extract video features of any target video frame in multiple sizes for any one of multiple video frames, and obtain multi-size video features of the target video frame.
[0121] The audio feature acquisition unit 303 is used to obtain audio features corresponding to the target video frame based on the audio corresponding to the video to be processed;
[0122] The video mask feature acquisition unit 304 is used to obtain, based on the multi-size video features and the audio features, video mask features related to the audio for the target video frame;
[0123] The target feature determination unit 305 is used to determine the target features of the sound-producing object corresponding to the audio query information in the multiple video frames based on the pre-set audio query information for each video frame in the video feature processing; the video feature processing is the video feature after feature fusion and temporal interaction of the multi-size video features and the audio features;
[0124] The segmentation mask obtaining unit 306 is used to obtain a segmentation mask for the target sound object of the target video frame based on the target features and the video mask features; the segmentation mask is used to represent the target sound object of the target video frame.
[0125] Optionally, the video mask feature acquisition unit is specifically used for:
[0126] The multi-size video features and the audio features are fused and temporally interacted to obtain video mask features related to the audio for the target video frame.
[0127] Optionally, the video mask feature acquisition unit is specifically used for:
[0128] The first attention mechanism is used to fuse the multi-size video features and the audio features to obtain fused video features that incorporate the audio features;
[0129] A second attention mechanism is used to process the multi-size video features and the fused video features to obtain aggregated video features of the multi-size video features at different sizes of the same pixel;
[0130] A third attention mechanism is used to perform temporal interaction processing on the audio features and the aggregated video features to obtain the processed video features after temporal interaction processing.
[0131] Based on the processed video features, obtain the audio-related video mask features for the target video frame.
[0132] Optionally, the video mask feature acquisition unit is specifically used for:
[0133] For the audio features, initial features corresponding to the audio features in the multiple video frames are determined from the aggregated video features;
[0134] Self-attention is used to enhance the initial features to determine the features after temporal interaction between different target video frames;
[0135] The features after the temporal interaction are mapped to obtain the processed video features after the temporal interaction.
[0136] Optionally, the target feature determination unit is specifically used for:
[0137] Based on pre-set audio query information for each video frame, determine the sound source corresponding to the audio query information;
[0138] Based on the processed video features, obtain the video features of the processed video features at a specified size;
[0139] For each audio query, the video features at the specified size and the audio query are used as input to the audio query encoder to obtain the target features of the sound-producing object in the multiple video frames.
[0140] Optionally, the audio query encoder includes: a multi-head cross-attention module, a multi-head self-attention module, and a feedforward network module;
[0141] The target feature determination unit is specifically used for:
[0142] For each audio query, the video features at the specified size and the audio query are used as input to the multi-head cross-attention module to obtain the output result information of the multi-head cross-attention module;
[0143] The output information of the multi-head cross-attention module is used as the input information of the multi-head self-attention module to obtain the output information of the multi-head self-attention module;
[0144] The output information of the multi-head self-attention module is used as the input information of the feedforward network module to obtain the output information of the feedforward network module;
[0145] Based on the output information of the feedforward network module, the target features of the sound-emitting object in the multiple video frames are obtained.
[0146] Optionally, the audio query encoder includes a multi-layer attention mechanism, each layer of which includes a multi-head cross-attention module, a multi-head self-attention module, and a feedforward network module.
[0147] The audio query information input to the first-layer attention mechanism is the audio features of each video frame corresponding to the audio; the audio query information input to the second-layer attention mechanism or higher is the output result information of the previous-layer attention mechanism; and the output result of the last-layer attention mechanism is the target feature of the sound-producing object in the multiple video frames.
[0148] Optionally, the segmentation mask obtaining unit is specifically used for:
[0149] The target features and the video mask features are subjected to matrix multiplication and preset function operations to obtain a segmentation mask for the target sound object in the target video frame.
[0150] Third Embodiment
[0151] Corresponding to the method of the first embodiment of this application, the third embodiment of this application also provides an electronic device.
[0152] like Figure 4 As shown, Figure 4 A schematic diagram of an electronic device provided in the third embodiment of this application.
[0153] In this embodiment, an optional hardware structure of the electronic device 400 may be as follows: Figure 4 As shown, it includes: at least one processor 401, at least one memory 402 and at least one communication bus 405; the memory 402 contains a program 403 and data 404.
[0154] Bus 405 can be a communication device for transmitting data between components within electronic device 400, such as an internal bus (e.g., CPU-memory bus, where the processor is the central processing unit, or CPU for short) or an external bus (e.g., a universal serial bus port or a peripheral component interconnection fast port).
[0155] Additionally, the electronic device also includes at least one network interface 406 and at least one peripheral interface 407. The network interface 406 provides wired or wireless communication with an external network 408 (e.g., the Internet, intranet, local area network, mobile communication network, etc.). In some embodiments, the network interface 406 may include any number of network interface controllers (NICs), radio frequency (RF) modules, repeaters, transceivers, modems, routers, gateways, any combination of wired network adapters, wireless network adapters, Bluetooth adapters, infrared adapters, near field communication (NFC) adapters, cellular network chips, etc.
[0156] Peripheral interface 407 is used to connect to peripherals, such as peripheral 1 in the figure. Figure 4 409 in the middle), peripheral 2 ( Figure 4 410 in the middle) and peripheral 3 ( Figure 4 (411 in the original text). Peripherals are peripheral devices, which may include, but are not limited to, cursor control devices (such as mice, touchpads, or touchscreens), keyboards, displays (such as cathode ray tube displays, liquid crystal displays), displays or light-emitting diode displays, video input devices (such as cameras or input interfaces coupled to video files), etc.
[0157] The processor 401 may be a CPU, an application-specific integrated circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of this application.
[0158] Memory 402 may include high-speed RAM (Random Access Memory) memory, and may also include non-volatile memory, such as at least one disk storage device.
[0159] In this embodiment, the processor 401 calls the program and data stored in the memory 402 to execute the method of the first embodiment of this application.
[0160] Fourth embodiment
[0161] Corresponding to the method of the first embodiment of this application, the fourth embodiment of this application also provides a computer storage medium storing a computer program that is executed by a processor to perform the method of the first embodiment of this application.
[0162] Although this application discloses preferred embodiments as described above, it is not intended to limit this application. Any person skilled in the art can make possible changes and modifications without departing from the spirit and scope of this application. Therefore, the scope of protection of this application should be determined by the scope defined in the claims of this application.
[0163] In a typical configuration, a computing device includes one or more processors (CPUs), input / output interfaces, network interfaces, and memory. Memory may include non-persistent storage in computer-readable media, random access memory (RAM), and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.
[0164] 1. Computer-readable media, including both permanent and non-permanent, removable and non-removable media, can store information using any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transfer medium that can be used to store information accessible by a computing device. As defined in this article, computer-readable media do not include non-transitory computer-readable storage media, such as modulated data signals and carrier waves.
[0165] 2. Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0166] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation portals are provided for users to choose to authorize or refuse.
Claims
1. A video processing method, characterized in that, include: The video to be processed is segmented into frames to obtain multiple video frames corresponding to the video to be processed. For any target video frame among multiple video frames, extract the video features of the target video frame at multiple sizes to obtain the multi-size video features of the target video frame; Based on the audio corresponding to the video to be processed, obtain the audio features corresponding to the target video frame; Based on the multi-size video features and the audio features, obtain audio-related video mask features for the target video frame, including: performing feature fusion and temporal interaction on the multi-size video features and the audio features to obtain audio-related video mask features for the target video frame; Based on pre-set audio query information for each video frame, the target features of the sound-producing object corresponding to the audio query information are determined in the multiple video frames during video feature processing; the processed video features are video features resulting from feature fusion and temporal interaction between the multi-size video features and the audio features, and the audio query information is continuously updated through an audio query encoder; Based on the target features and the video mask features, a segmentation mask for the target sound-emitting object in the target video frame is obtained; the segmentation mask is used to represent the target sound-emitting object in the target video frame.
2. The method according to claim 1, characterized in that, The step of fusing and temporally interacting the multi-size video features and the audio features to obtain audio-related video mask features for the target video frame includes: The first attention mechanism is used to fuse the multi-size video features and the audio features to obtain fused video features that incorporate the audio features; A second attention mechanism is used to process the multi-size video features and the fused video features to obtain aggregated video features of the multi-size video features at different sizes of the same pixel; A third attention mechanism is used to perform temporal interaction processing on the audio features and the aggregated video features to obtain the processed video features after temporal interaction processing. Based on the processed video features, obtain the audio-related video mask features for the target video frame.
3. The method according to claim 2, characterized in that, The step of employing a third attention mechanism to perform temporal interaction processing on the audio features and the aggregated video features to obtain processed video features after temporal interaction processing includes: For the audio features, initial features corresponding to the audio features in the multiple video frames are determined from the aggregated video features; Self-attention is used to enhance the initial features to determine the features after temporal interaction between different target video frames; The features after the temporal interaction are mapped to obtain the processed video features after the temporal interaction.
4. The method according to claim 1, characterized in that, The step of determining the target features of the sound-producing object corresponding to the audio query information in the multiple video frames based on pre-set audio query information during video feature processing includes: Based on pre-set audio query information for each video frame, determine the sound source corresponding to the audio query information; Based on the processed video features, obtain the video features of the processed video features at a specified size; For each audio query, the video features at the specified size and the audio query are used as input to the audio query encoder to obtain the target features of the sound-producing object in the multiple video frames.
5. The method according to claim 4, characterized in that, The audio query encoder includes: a multi-head cross-attention module, a multi-head self-attention module, and a feedforward network module; For each audio query, the process of using the video features at the specified size and the audio query information as input to the audio query encoder to obtain the target features of the sound-producing object in the multiple video frames includes: For each audio query, the video features at the specified size and the audio query are used as input to the multi-head cross-attention module to obtain the output result information of the multi-head cross-attention module; The output information of the multi-head cross-attention module is used as the input information of the multi-head self-attention module to obtain the output information of the multi-head self-attention module; The output information of the multi-head self-attention module is used as the input information of the feedforward network module to obtain the output information of the feedforward network module; Based on the output information of the feedforward network module, the target features of the sound-emitting object in the multiple video frames are obtained.
6. The method according to claim 5, characterized in that, The audio query encoder includes a multi-layer attention mechanism, with each layer of the attention mechanism consisting of a multi-head cross-attention module, a multi-head self-attention module, and a feedforward network module. The audio query information input to the first-layer attention mechanism is the audio features of each video frame corresponding to the audio; the audio query information input to the second-layer attention mechanism or higher is the output result information of the previous-layer attention mechanism; and the output result of the last-layer attention mechanism is the target features of the sound-producing object in the multiple video frames.
7. The method according to claim 1, characterized in that, The step of obtaining a segmentation mask for the target sound-emitting object in the target video frame based on the target features and the video mask features includes: The target features and the video mask features are subjected to matrix multiplication and preset function operations to obtain a segmentation mask for the target sound object in the target video frame.
8. A video processing apparatus, characterized in that, include: The frame segmentation processing unit is used to perform frame segmentation processing on the video to be processed, and obtain multiple video frames corresponding to the video to be processed. The video feature extraction unit is used to extract video features of any target video frame at multiple sizes for any one of multiple video frames, and obtain multi-size video features of the target video frame. An audio feature acquisition unit is used to obtain audio features corresponding to the target video frame based on the audio corresponding to the video to be processed. A video mask feature acquisition unit is configured to obtain, based on the multi-size video features and the audio features, a video mask feature related to the audio for the target video frame, including: performing feature fusion and temporal interaction on the multi-size video features and the audio features to obtain a video mask feature related to the audio for the target video frame; The target feature determination unit is used to determine the target features of the sound-producing object corresponding to the audio query information in the multiple video frames based on the pre-set audio query information for each video frame in the processed video features; the processed video features are video features after feature fusion and temporal interaction between the multi-size video features and the audio features; the audio query information is continuously updated by the audio query encoder. The segmentation mask obtaining unit is used to obtain a segmentation mask for the target sound object of the target video frame based on the target features and the video mask features; the segmentation mask is used to represent the target sound object of the target video frame.
9. An electronic device, characterized in that, include: processor; A memory for storing a computer program that is executed by a processor to perform the method described in any one of claims 1-7.
10. A computer storage medium, characterized in that, The computer storage medium stores a computer program that is executed by a processor to perform the method described in any one of claims 1-7.