Video classification method, model training method, device, medium and equipment

CN115391600BActive Publication Date: 2026-06-19BEIJING YOUZHUJU NETWORK TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING YOUZHUJU NETWORK TECH CO LTD
Filing Date
2022-08-30
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, audio information is not effectively utilized in the video classification and labeling process, resulting in inaccurate feature extraction and affecting classification performance.

Method used

By extracting N video frames and corresponding N audio frames from the target video, extracting features from the video frames and audio frames, and fusing them to form dual-stream fusion features, then performing temporal fusion, the final classification label of the video is determined.

🎯Benefits of technology

It improves the recognition accuracy of video classification tags and enhances the ability to represent video content by utilizing the correlation information between video and audio.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115391600B_ABST
    Figure CN115391600B_ABST
Patent Text Reader

Abstract

This disclosure relates to a video classification method, model training method, apparatus, medium, and device. The video classification method includes: determining a target video and a target audio corresponding to the target video; extracting N video frames from the target video and segmenting the target audio into N audio frames that correspond one-to-one with the N video frames, where N is a positive integer greater than 1; extracting video features corresponding to each video frame and audio features corresponding to each audio frame; for each video frame, fusing the video features of the video frame with the audio features of the corresponding audio frame to obtain a corresponding dual-stream fusion feature; fusing the N dual-stream fusion features temporally to obtain a target fusion feature; and determining the classification label of the target video based on the target fusion feature. This disclosure can more accurately identify the classification label of the target video.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of data processing technology, and more specifically, to a video classification method, a model training method, an apparatus, a medium, and a device. Background Technology

[0002] Video classification refers to categorizing the content of a video and assigning it category tags, such as food, history, and travel. In the process of video tag recognition, the model's information primarily comes from the video's visuals. However, in real-world business scenarios, video and audio often coexist. In many ways, audio can serve as an important supplement to video information. Therefore, efficiently utilizing the audio information of a video should significantly improve the accuracy of video classification tag recognition.

[0003] In existing technologies, the methods for using audio information in video tag recognition tasks include: extracting a scene feature from the video frame and an audio feature from the audio, concatenating the scene feature and the audio feature as the target feature of the video, and classifying the video based on the target feature. However, this method is too simplistic in feature processing, and the target feature still cannot truly reflect the content of the video, so the effect is not accurate enough. Summary of the Invention

[0004] This summary section is provided to briefly introduce the concepts, which will be described in detail in the detailed description section below. This summary section is not intended to identify key or essential features of the claimed technical solution, nor is it intended to limit the scope of the claimed technical solution.

[0005] Firstly, this disclosure provides a video classification method, including:

[0006] Identify the target video and the target audio corresponding to the target video;

[0007] N video frames are extracted from the target video, and the target audio is divided into N audio frames that correspond one-to-one with the N video frames, where N is a positive integer greater than 1;

[0008] Extract the video features corresponding to each video frame and the audio features corresponding to each audio frame;

[0009] For each video frame, the video features of the video frame are fused with the audio features of the corresponding audio frame to obtain the corresponding dual-stream fusion features;

[0010] The target fused feature is obtained by fusing N dual-stream fusion features in time sequence.

[0011] Based on the target fusion features, the classification label of the target video is determined.

[0012] Secondly, this disclosure provides a model training method for training an end-to-end classification model, wherein the end-to-end classification model includes a backbone network, a fusion model, and a classifier, and the method includes:

[0013] Multiple training samples are identified, each containing a training video and its corresponding training audio.

[0014] For the training samples, N video frames are extracted from the training video, and the training audio is divided into N audio frames that correspond one-to-one with the N video frames, where N is a positive integer greater than 1.

[0015] The N video frames and the N audio frames are input into the end-to-end classification model;

[0016] The training loss is calculated based on the classification labels output by the end-to-end classification model and the real labels of the training videos.

[0017] Based on the training loss, the model parameters in the end-to-end classification model are updated using gradient descent.

[0018] In the end-to-end classification model, the backbone network is used to extract video features corresponding to each video frame and audio features corresponding to each audio frame, and output them to the fusion model.

[0019] The fusion model is used to fuse the video features of each video frame with the audio features of the corresponding audio frame to obtain the corresponding dual-stream fusion features, fuse N dual-stream fusion features in time sequence to obtain the target fusion feature, and output the target fusion feature to the classifier.

[0020] The classifier is used to output corresponding classification labels based on the target fusion features.

[0021] Thirdly, this disclosure provides a video classification device, comprising:

[0022] The audio and video determination module is used to determine the target video and the target audio corresponding to the target video;

[0023] The audio and video frame segmentation module is used to extract N video frames from the target video and to segment the target audio into N audio frames that correspond one-to-one with the N video frames, where N is a positive integer greater than 1.

[0024] The feature extraction module is used to extract the video features corresponding to each video frame and the audio features corresponding to each audio frame.

[0025] The dual-stream fusion module is used to fuse the video features of each video frame with the audio features of the corresponding audio frame to obtain the corresponding dual-stream fusion features.

[0026] The temporal fusion module is used to fuse N dual-stream fusion features in a temporal sequence to obtain the target fusion feature;

[0027] The video classification module is used to determine the classification label of the target video based on the target fusion features.

[0028] Fourthly, this disclosure provides a model training apparatus for training an end-to-end classification model, the end-to-end classification model including a backbone network, a fusion model, and a classifier, the apparatus comprising:

[0029] The training data determination module is used to determine multiple training samples, each of which contains a training video and a corresponding training audio.

[0030] The training data frame segmentation module is used to extract N video frames from the training video for the training samples, and to segment the training audio into N audio frames that correspond one-to-one with the N video frames, where N is a positive integer greater than 1.

[0031] A frame input module is used to input the N video frames and the N audio frames into the end-to-end classification model;

[0032] The loss calculation module is used to calculate the training loss based on the classification labels output by the end-to-end classification model and the real labels of the training videos.

[0033] The parameter training module is used to update the model parameters in the end-to-end classification model using gradient descent based on the training loss.

[0034] In the end-to-end classification model, the backbone network is used to extract video features corresponding to each video frame and audio features corresponding to each audio frame, and output them to the fusion model.

[0035] The fusion model is used to fuse the video features of each video frame with the audio features of the corresponding audio frame to obtain the corresponding dual-stream fusion features, fuse N dual-stream fusion features in time sequence to obtain the target fusion feature, and output the target fusion feature to the classifier.

[0036] The classifier is used to output corresponding classification labels based on the target fusion features.

[0037] Fifthly, this disclosure provides a computer-readable medium having a computer program stored thereon, which, when executed by a processing device, implements the steps of the method described in the first or second aspect.

[0038] Sixthly, this disclosure provides an electronic device, including:

[0039] A storage device on which computer programs are stored;

[0040] A processing device for executing the computer program in the storage device to implement the steps of the method described in the first or second aspect.

[0041] The above technical solution first extracts N video frames from the target video and then divides the target video into N audio frames, with each of the N video frames corresponding to one of the N audio frames. Then, utilizing the correspondence between video and audio, the video features of each video frame are fused with the audio features of the corresponding audio frame to obtain multiple dual-stream fusion features. Each dual-stream fusion feature incorporates the correlation information between the video and audio at the corresponding frame in the target video. These multiple dual-stream fusion features are then fused temporally, resulting in a target fusion feature that more accurately represents the video content. Consequently, the classification label of the target video determined based on this target fusion feature is more accurate.

[0042] Other features and advantages of this disclosure will be described in detail in the following detailed description section. Attached Figure Description

[0043] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic, and the originals and elements are not necessarily drawn to scale. In the drawings:

[0044] Figure 1 A flowchart of a video classification method in an exemplary embodiment is shown;

[0045] Figure 2 A flowchart of a model training method in an exemplary embodiment is shown;

[0046] Figure 3 A schematic diagram of the structure of an end-to-end classification model in an exemplary embodiment is shown;

[0047] Figure 4 A block diagram of a video classification apparatus in an exemplary embodiment is shown;

[0048] Figure 5 A block diagram of a model training apparatus in an exemplary embodiment is shown;

[0049] Figure 6 A block diagram of an electronic device in an exemplary embodiment is shown. Detailed Implementation

[0050] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.

[0051] It should be understood that the steps described in the method embodiments of this disclosure may be performed in different orders and / or in parallel. Furthermore, the method embodiments may include additional steps and / or omit the steps shown. The scope of this disclosure is not limited in this respect.

[0052] The term "comprising" and its variations as used herein are open-ended inclusions, meaning "including but not limited to". The term "based on" means "at least partially based on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Definitions of other terms will be given in the description below.

[0053] It should be noted that the concepts of "first" and "second" mentioned in this disclosure are used only to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or their interdependencies.

[0054] It should be noted that the terms "a" and "a plurality of" used in this disclosure are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as "one or more".

[0055] The names of messages or information exchanged between multiple devices or modules in the embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

[0056] It is understood that the data involved in this technical solution (including but not limited to the data itself, the acquisition or use of the data) shall comply with the requirements of relevant laws, regulations and related provisions.

[0057] This disclosure provides a video classification method that can more accurately identify video classification tags. Figure 1 A flowchart of a video classification method in an exemplary embodiment is shown. Please refer to... Figure 1 The method includes:

[0058] Step S101: Determine the target video and the target audio corresponding to the target video.

[0059] Step S102: Extract N video frames from the target video and divide the target audio into N audio frames that correspond one-to-one with the N video frames.

[0060] In one implementation, the target video is divided into N video segments at equal intervals, and one frame is extracted from each video segment to obtain a total of N video frames. Similarly, the target audio is divided into N audio segments at equal intervals, with each audio segment serving as an audio frame to obtain a total of N audio frames. Here, N is a positive integer greater than 1.

[0061] It is understandable that in practice, the target video and target audio may not be divided in an equal interval manner, but the division method of the two should be consistent.

[0062] Step S103: Extract the video features corresponding to each video frame and the audio features corresponding to each audio frame.

[0063] Specifically, video features corresponding to each video frame and audio features corresponding to each audio frame are extracted.

[0064] In one exemplary embodiment, an end-to-end classification model may be used, which includes a backbone network, a fusion model, and a classifier.

[0065] In one implementation, the backbone network of the end-to-end classification model includes a video backbone network and an audio backbone network. In the above steps, each video frame is input into the video backbone network, which extracts video features corresponding to each video frame and outputs these extracted features to the next-level fusion model. Similarly, each audio frame is input into the audio backbone network, which extracts audio features corresponding to each audio frame and outputs these extracted features to the next-level fusion model.

[0066] In another implementation, the backbone network in this end-to-end classification model is a video backbone network. It should be noted beforehand that each video frame extracted from the target video is represented as a two-dimensional or three-dimensional array, and each audio frame segmented from the target audio is represented as a one-dimensional array. Inputting video or audio frames into the network / model means inputting the arrays representing the video or audio frames into the network / model.

[0067] In the above steps, each two-dimensional or three-dimensional array representing a video frame is input into the video backbone network. The video backbone network extracts the video features corresponding to each video frame and outputs the extracted video features to the next-level fusion model. Similarly, for each audio frame's one-dimensional array, a Fast Fourier Transform is used to transform the one-dimensional array into a two-dimensional array. The values ​​in the two-dimensional array are then normalized to obtain the target array representing the audio frame. Optionally, this two-dimensional array is normalized to a mean of 0 and a variance of 1. Then, each target array representing the audio frame is input into the video backbone network. The video backbone network extracts the features corresponding to each audio frame as audio features and outputs the extracted audio features to the next-level fusion model.

[0068] In this way, when extracting audio features from audio frames, the same video backbone network can be reused with the video, simplifying the structure of the end-to-end classification model. Furthermore, when training this end-to-end classification model, normalizing the values ​​in the array accelerates the convergence process of the video backbone network.

[0069] Step S104: For each video frame, fuse the video features of the video frame with the audio features of the corresponding audio frame to obtain the corresponding dual-stream fusion features.

[0070] Having completed the aforementioned steps, we have obtained N video features corresponding to N video frames and N audio features corresponding to N audio frames. For each video frame, taking the k-th frame as an example, we fuse the video features corresponding to the k-th video frame with the audio features corresponding to the k-th audio frame to obtain the k-th dual-stream fused feature. k ranges from 1 to N, resulting in a total of N dual-stream fused features.

[0071] Step S105: Fuse the N dual-stream fusion features in time sequence to obtain the target fusion feature.

[0072] After obtaining N dual-stream fusion features, these N dual-stream fusion features need to be fused in time sequence to obtain the target fusion feature.

[0073] In an exemplary embodiment, steps S104 and S105 can be completed by the fusion model in the end-to-end classification model. The fusion model performs dual-stream fusion and temporal fusion sequentially based on the video features corresponding to each video frame and the audio features corresponding to each audio frame output by the backbone network to obtain the target fusion features, and outputs the target fusion features to the next-level classifier.

[0074] Step S106: Determine the classification label of the target video based on the target fusion feature.

[0075] In an exemplary embodiment, the classifier classifies the target video based on the target fusion features output by the fusion model and outputs a classification label representing the category of the target video, such as determining that the target video is a food video, a history video, or a travel video.

[0076] Understandably, the embodiments of this disclosure utilize the natural correspondence between video and audio, resulting in better recognition performance. For example, when the target video is a video explaining different landmarks, sequentially introducing landmark A, landmark B, and landmark C, by extracting N video frames and segmenting the audio into corresponding N audio frames, then within these N video and N audio frames, one extracted video frame might depict landmark A, and the corresponding audio frame might be a segment explaining landmark A; another extracted video frame might depict landmark B, and the corresponding audio frame might be a segment explaining landmark B; and yet another extracted video frame might depict landmark C, and the corresponding audio frame might be a segment explaining landmark C. Therefore, by fusing the video features of the video frames with the audio features of the corresponding audio frames, the resulting dual-stream fusion features can respectively reflect the feature information about landmark A, landmark B, and landmark C. Furthermore, by temporally fusing multiple dual-stream fusion features, the final target fusion feature can more accurately represent the content of the video, thus achieving better recognition of video classification tags.

[0077] The existing technology essentially merges the video information of attractions A, B, and C into a single overall video feature, and the narration information of attractions A, B, and C into a single overall audio feature. By merging these two features, the correlation between the video and audio is lost, resulting in inaccurate target features and consequently, inaccurate recognition of video classification labels.

[0078] Further, in step S104, for each video frame, the video features of the video frame are fused with the audio features of the corresponding audio frame to obtain the corresponding dual-stream fusion features.

[0079] Taking any video frame as an example, the specific methods for fusing the video features of the video frame with the audio features of the corresponding audio frame include:

[0080] 1. Add the video features of the video frame to the audio features of the corresponding audio frame, and use the result of the addition as the corresponding dual-stream fusion feature;

[0081] 2. Perform a dot product between the video features of the video frame and the audio features of the corresponding audio frame, and use the result of the dot product as the corresponding dual-stream fusion feature.

[0082] 3. The video features of the video frame are concatenated with the audio features of the corresponding audio frame, and the concatenation result is used as the corresponding dual-stream fusion feature.

[0083] Fourth, first, obtain the type representation, which includes the distribution feature vector corresponding to the video type and the distribution feature vector corresponding to the audio type; add the video feature of the video frame to the distribution feature vector corresponding to the video type to obtain the video input vector; add the audio feature of the corresponding audio frame to the distribution feature vector corresponding to the audio type to obtain the audio input vector; input the video input vector and the audio input vector together into the predetermined model to obtain the vector output by the predetermined model as the corresponding dual-stream fusion feature.

[0084] Among them, the predefined model is an attention-based model, such as the transformer model.

[0085] In particular, step S104 can be completed by the fusion model, and this type representation is obtained synchronously during the training of the model parameters in the end-to-end classification model.

[0086] It is worth noting that in the fourth method described above, an attention mechanism within a predefined model is used to fuse video and audio features. The type representation includes the distribution feature vectors corresponding to the video and audio types. Understandably, the video and audio features extracted by the backbone network differ significantly in their feature vector distribution spaces. By adding the video features to a distribution feature vector to obtain the video input vector, and adding the audio features to the same distribution feature vector to obtain the audio input vector, the video and audio features are adjusted to have similar feature distribution spaces. Then, based on the attention mechanism, the video and audio input vectors with similar feature distribution spaces are fused, resulting in better fusion performance.

[0087] It is understood that in step S104, any one or a combination of the above methods can be selected to fuse video features with corresponding audio features.

[0088] Furthermore, in step S105, the N dual-stream fusion features are fused in time to obtain the target fusion feature.

[0089] The specific methods for temporally fusing N dual-stream fusion features include:

[0090] 1. Pool the N dual-stream fusion features to obtain the target fusion feature;

[0091] Second, firstly, a temporal representation is obtained, which includes N temporal feature vectors. These N temporal feature vectors represent inter-frame temporal information. For each dual-stream fusion feature, the dual-stream fusion feature is added to the corresponding temporal feature vector in the temporal representation, and this is used as a temporal input vector. For example, the dual-stream fusion feature of the k-th frame is added to the temporal feature vector corresponding to the k-th frame in the temporal representation. The N temporal input vectors are then input into a predetermined model to obtain the vector output by the predetermined model, thus obtaining the target fusion feature.

[0092] Among them, the predefined model is an attention-based model, such as the transformer model.

[0093] In particular, step S105 can be completed by the fusion model, and this temporal representation is obtained synchronously during the training of the model parameters in the end-to-end classification model.

[0094] It is worth noting that in the second method described above, the attention mechanism in the predefined model is used to fuse the dual-stream fusion features of multiple different moments in the target video in a temporal sequence. This temporal representation includes N temporal feature vectors, which represent the inter-frame temporal information. For example, if the target video is a video of playing badminton, in the N video frames, the i-th frame shows the swing, the (i+1)-th frame shows the shuttlecock being hit, and the (i+2)-th frame shows the shuttlecock in the air. It is clear that the order of the frames actually contains certain information, which can be represented by a vector, namely the temporal feature vector.

[0095] In this disclosure, based on the attention mechanism, the inter-frame temporal information in the temporal representation is used to fuse multiple temporally sequential dual-stream fusion features, resulting in better fusion performance.

[0096] Furthermore, embodiments of this disclosure also provide a model training method for training an end-to-end classification model. Figure 2 A flowchart of a model training method in an exemplary embodiment is shown below. Please refer to... Figure 2 The method includes:

[0097] Step S201: Determine multiple training samples, each training sample containing a training video and a corresponding training audio.

[0098] Step S202: For each training sample, extract N video frames from the training video and divide the training audio into N audio frames that correspond one-to-one with the N video frames.

[0099] Where N is a positive integer greater than 1.

[0100] In one implementation, the training video is divided into N video segments at equal intervals, and one frame is extracted from each video segment to obtain a total of N video frames. Similarly, the training audio is divided into N audio segments at equal intervals, with each audio segment serving as an audio frame to obtain a total of N audio frames.

[0101] Step S203: Input N video frames and N audio frames into the end-to-end classification model.

[0102] Step S204: Calculate the training loss based on the classification labels output by the end-to-end classification model and the real labels of the training videos.

[0103] Step S205: Based on the training loss, update the model parameters in the end-to-end classification model using gradient descent.

[0104] In this disclosure, the end-to-end classification model includes a backbone network, a fusion model, and a classifier.

[0105] The backbone network is used to extract video features corresponding to each video frame and audio features corresponding to each audio frame, and output them to the fusion model.

[0106] In one implementation, the backbone network includes a video backbone network and an audio backbone network. Therefore, each video frame is input into the video backbone network, which extracts video features corresponding to each video frame and outputs these extracted features to the next-level fusion model. Similarly, each audio frame is input into the audio backbone network, which extracts audio features corresponding to each audio frame and outputs these extracted features to the next-level fusion model.

[0107] In another implementation, the end-to-end classification model further includes a preprocessing model, and the backbone network is a video backbone network. It should be noted beforehand that each video frame extracted from the target video is represented by a two-dimensional or three-dimensional array, and each audio frame segmented from the target audio is represented by a one-dimensional array. Inputting video frames or audio frames into the network / model means inputting the arrays representing the video frames or audio frames into the network / model.

[0108] Therefore, each two-dimensional or three-dimensional array representing a video frame is input into the video backbone network, which extracts the video features corresponding to each video frame and outputs these features to the next-level fusion model. Similarly, each one-dimensional array representing an audio frame is input into the preprocessing model.

[0109] This preprocessing model is used to transform the one-dimensional array of audio frames into a two-dimensional array using a Fast Fourier Transform (FFT). The values ​​in the two-dimensional array are then normalized to obtain a target array representing the audio frames. Optionally, this two-dimensional array is normalized to a mean of 0 and a variance of 1. Each target array representing an audio frame is then output to the video backbone network. This video backbone network is also used to extract features corresponding to each audio frame as audio features, and outputs these extracted audio features to the next-level fusion model.

[0110] The fusion model is used to fuse the video features of each video frame with the audio features of the corresponding audio frame to obtain the corresponding dual-stream fusion features. The N dual-stream fusion features are fused in time to obtain the target fusion feature, and the target fusion feature is output to the classifier.

[0111] The classifier is used to output the corresponding classification label based on the fused features of the target.

[0112] Understandably, in this embodiment, N video frames are first extracted from the training video, and the training audio is divided into N audio frames. Then, the N video frames and N audio frames are input into an end-to-end classification model. In this model, the correspondence between video and audio is used to identify the classification label of the training video. Based on the classification label and the real label of the training video, the model is trained so that the final end-to-end classification model can have a better recognition effect on the classification label of the video.

[0113] Furthermore, this fusion model is specifically used for:

[0114] Obtain the type representation, which includes the distribution feature vectors corresponding to the video type and the distribution feature vectors corresponding to the audio type;

[0115] For each video frame, the video features of that video frame are added to the distribution feature vector corresponding to the video type, and this is used as the video input vector.

[0116] The audio features of the corresponding audio frame are added to the distribution feature vector corresponding to the audio type, and this is used as the audio input vector.

[0117] The video input vector and audio input vector are input into a predetermined model, and the output vector of the predetermined model is obtained to obtain the corresponding dual-stream fusion features.

[0118] Obtain a temporal representation, which includes N temporal feature vectors, and these N temporal feature vectors represent inter-frame temporal information;

[0119] For each dual-stream fusion feature, the dual-stream fusion feature is added to the corresponding temporal feature vector in the temporal representation to form the temporal input vector;

[0120] N time-series input vectors are input into a predetermined model, the output vector of the predetermined model is obtained, the target fusion feature is obtained, and the target fusion feature is output to the next level classifier.

[0121] Among them, the predefined model is an attention-based model, such as the transformer model.

[0122] It's important to note that in the initial end-to-end classification model, the type and temporal representations in the fusion model are generated during initialization and have no practical meaning. During training, while updating the model parameters through gradient descent, the distribution feature vector in the type representation and the temporal feature vector in the temporal representation are simultaneously updated. Therefore, as the end-to-end classification model trains, the distribution feature vector in the type representation and the temporal feature vector in the temporal representation gradually learn relevant information from training, becoming more accurate.

[0123] It should be noted that this fusion model can also obtain the corresponding dual-stream fusion features by adding, multiplying, or concatenating the video features of each video frame with the audio features of the corresponding audio frame.

[0124] Alternatively, the fusion model can also use a pooling layer to pool the N dual-stream fusion features to obtain the target fusion feature output by the pooling layer.

[0125] Optionally, in step S204, the training loss is calculated based on the classification labels output by the end-to-end classification model and the real labels of the training videos, including: calculating the classification error based on the classification labels output by the end-to-end classification model and the real labels of the training videos; calculating the similarity error based on the similarity between the video features corresponding to each video frame and the audio features of the corresponding audio frame; and calculating the training loss based on the classification error and the similarity error.

[0126] Therefore, when training the model, using the similarity between the video features of each video frame and the audio features of the corresponding audio frame as an additional learning objective can improve the model's learning performance. Here, similarity can refer to the cosine similarity between the two vectors of video features and audio features.

[0127] Figure 3 A schematic diagram of an end-to-end classification model in an exemplary embodiment is shown. It should be noted that the specific implementation of the end-to-end classification model in this disclosure can be referred to the relevant description in the video classification method section above, and will not be repeated here.

[0128] Figure 4 A block diagram of a video classification apparatus in an exemplary embodiment is shown. Please refer to... Figure 4 The video classification device 300 includes:

[0129] The audio / video determination module 301 is used to determine the target video and the target audio corresponding to the target video;

[0130] The audio and video frame segmentation module 302 is used to extract N video frames from the target video and to segment the target audio into N audio frames that correspond one-to-one with the N video frames, where N is a positive integer greater than 1.

[0131] The feature extraction module 303 is used to extract the video features corresponding to each video frame and the audio features corresponding to each audio frame.

[0132] The dual-stream fusion module 304 is used to fuse the video features of the video frame with the audio features of the corresponding audio frame for each video frame to obtain the corresponding dual-stream fusion features.

[0133] The temporal fusion module 305 is used to fuse N dual-stream fusion features in a temporal sequence to obtain the target fusion feature;

[0134] The video classification module 306 is used to determine the classification label of the target video based on the target fusion features.

[0135] Optionally, each video frame extracted from the target video is represented as a two-dimensional or three-dimensional array, and each audio frame segmented from the target audio is represented as a one-dimensional array; the feature extraction module 303 includes:

[0136] The Fourier transform module is used to transform the one-dimensional array of audio frames into a two-dimensional array through a fast Fourier transform.

[0137] A normalization module is used to normalize the values ​​in the two-dimensional array to obtain a target array representing the audio frame;

[0138] An array input module is used to input a two-dimensional or three-dimensional array representing the video frame and a target array representing the audio frame into the same backbone network, so as to extract the video features corresponding to the video frame and the audio features corresponding to the audio frame, respectively.

[0139] Optionally, the dual-stream fusion module 304 includes:

[0140] The type representation acquisition module is used to acquire type representations, which include the distribution feature vectors corresponding to video types and the distribution feature vectors corresponding to audio types.

[0141] The first feature adjustment module is used to add the video features of the video frame to the distribution feature vector corresponding to the video type, and use it as the video input vector;

[0142] The second feature adjustment module is used to add the audio features of the corresponding audio frame to the distribution feature vector corresponding to the audio type, and use it as the audio input vector;

[0143] The first vector input module is used to input the video input vector and the audio input vector into a predetermined model, obtain the vector output by the predetermined model, and obtain the corresponding dual-stream fusion feature.

[0144] Optionally, the timing fusion module 305 includes:

[0145] A timing representation acquisition module is used to acquire timing representations, which include N timing feature vectors, and the N timing feature vectors represent inter-frame timing information;

[0146] The third feature adjustment module is used to add the dual-stream fusion feature to the corresponding temporal feature vector in the temporal representation for each dual-stream fusion feature, and use it as a temporal input vector.

[0147] The second vector input module is used to input the N time-series input vectors into a predetermined model, obtain the vector output by the predetermined model, and obtain the target fusion feature.

[0148] Figure 5 A block diagram of a model training apparatus in an exemplary embodiment is shown. This apparatus is used to train an end-to-end classification model, which includes a backbone network, a fusion model, and a classifier. Please refer to... Figure 5 The model training device 400 includes:

[0149] The training data determination module 401 is used to determine multiple training samples, each training sample containing a training video and a corresponding training audio.

[0150] The training data frame segmentation module 402 is used to extract N video frames from the training video for the training samples, and to segment the training audio into N audio frames that correspond one-to-one with the N video frames, where N is a positive integer greater than 1.

[0151] The frame input module 403 is used to input the N video frames and the N audio frames into the end-to-end classification model;

[0152] The loss calculation module 404 is used to calculate the training loss based on the classification labels output by the end-to-end classification model and the real labels of the training videos.

[0153] The parameter training module 405 is used to update the model parameters in the end-to-end classification model by gradient descent based on the training loss.

[0154] In the end-to-end classification model, the backbone network is used to extract video features corresponding to each video frame and audio features corresponding to each audio frame, and output them to the fusion model.

[0155] The fusion model is used to fuse the video features of each video frame with the audio features of the corresponding audio frame to obtain the corresponding dual-stream fusion features, fuse N dual-stream fusion features in time sequence to obtain the target fusion feature, and output the target fusion feature to the classifier.

[0156] The classifier is used to output corresponding classification labels based on the target fusion features.

[0157] Optionally, each video frame extracted from the training video is represented as a two-dimensional or three-dimensional array, and each audio frame segmented from the training audio is represented as a one-dimensional array. The end-to-end classification model also includes a preprocessing model.

[0158] The frame input module 403 is used to: input a two-dimensional or three-dimensional array representing the video frame into the backbone network; and input a one-dimensional array representing the audio frame into the preprocessing model.

[0159] The preprocessing model is used to transform the one-dimensional array of audio frames into a two-dimensional array through a fast Fourier transform, normalize the values ​​in the two-dimensional array to obtain a target array representing the audio frames, and output the target array representing the audio frames to the backbone network.

[0160] Optionally, the fusion model is specifically used for:

[0161] Obtain type representations, which include distribution feature vectors corresponding to video types and distribution feature vectors corresponding to audio types;

[0162] The video features of the video frame are added to the distribution feature vector corresponding to the video type to form the video input vector;

[0163] The audio features of the corresponding audio frame are added to the distribution feature vector corresponding to the audio type to form the audio input vector;

[0164] The video input vector and the audio input vector are input into a predetermined model, and the vector output by the predetermined model is obtained to obtain the corresponding dual-stream fusion feature.

[0165] Acquire a temporal representation, which includes N temporal feature vectors, and the N temporal feature vectors represent inter-frame temporal information;

[0166] For each dual-stream fusion feature, the dual-stream fusion feature is added to the corresponding temporal feature vector in the temporal representation to form a temporal input vector;

[0167] The N time-series input vectors are input into the predetermined model, the output vector of the predetermined model is obtained, the target fusion feature is obtained, and the target fusion feature is output to the classifier.

[0168] The parameter training module 405 is used to: update the model parameters in the end-to-end classification model through gradient descent, and update the distribution feature vector in the type representation and the temporal feature vector in the temporal representation.

[0169] Optionally, the loss calculation module 404 is used to: calculate the classification error based on the classification label output by the end-to-end classification model and the real label of the training video; calculate the similarity error based on the similarity between the video features corresponding to each video frame and the audio features of the corresponding audio frame; and calculate the training loss based on the classification error and the similarity error.

[0170] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.

[0171] This disclosure provides a computer-readable medium having a computer program stored thereon, which, when executed by a processing device, implements the steps of the video classification method or model training method described in the foregoing embodiments.

[0172] This disclosure provides an electronic device, including:

[0173] A storage device on which computer programs are stored;

[0174] A processing device is configured to execute the computer program in the storage device to implement the steps of the video classification method or model training method in the foregoing embodiments.

[0175] The following is for reference. Figure 6 The diagram illustrates a block diagram suitable for implementing an electronic device 600 according to embodiments of the present disclosure. The electronic device in embodiments of the present disclosure may include, but is not limited to, devices such as mobile phones, laptops, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), digital TVs, desktop computers, etc. Figure 6The electronic device shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments disclosed herein.

[0176] like Figure 6 As shown, electronic device 600 may include a processing device (e.g., a central processing unit, a graphics processor, etc.) 601, which can perform various appropriate actions and processes according to a program stored in read-only memory (ROM) 602 or a program loaded from storage device 608 into random access memory (RAM) 603. RAM 603 also stores various programs and data required for the operation of electronic device 600. Processing device 601, ROM 602, and RAM 603 are interconnected via bus 604. Input / output (I / O) interface 605 is also connected to bus 604.

[0177] Typically, the following devices can be connected to I / O interface 605: input devices 606 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 607 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 608 including, for example, magnetic tapes, hard disks, etc.; and communication devices 609. Communication device 609 allows electronic device 600 to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 6 An electronic device 600 with various devices is shown; however, it should be understood that it is not required to implement or possess all of the devices shown. More or fewer devices may be implemented or possessed alternatively.

[0178] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device 609, or installed from a storage device 608, or installed from a ROM 602. When the computer program is executed by the processing device 601, it performs the functions defined in the methods of embodiments of this disclosure.

[0179] It should be noted that the computer-readable medium described in this disclosure can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in connection with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.

[0180] The aforementioned computer-readable medium may be included in the aforementioned electronic device; or it may exist independently and not assembled into the electronic device.

[0181] The aforementioned computer-readable medium carries one or more programs that, when executed by the electronic device, cause the electronic device to: determine a target video and a target audio corresponding to the target video; extract N video frames from the target video and segment the target audio into N audio frames corresponding one-to-one with the N video frames, where N is a positive integer greater than 1; extract video features corresponding to each video frame and audio features corresponding to each audio frame; for each video frame, fuse the video features of the video frame with the audio features of the corresponding audio frame to obtain corresponding dual-stream fusion features; fuse the N dual-stream fusion features in a temporal sequence to obtain target fusion features; and determine the classification label of the target video based on the target fusion features.

[0182] Alternatively, the aforementioned computer-readable medium carries one or more programs that, when executed by the electronic device, cause the electronic device to: determine a plurality of training samples, each training sample containing a training video and a corresponding training audio; for each training sample, extract N video frames from the training video and segment the training audio into N audio frames corresponding one-to-one with the N video frames, where N is a positive integer greater than 1; input the N video frames and the N audio frames into an end-to-end classification model; calculate the training loss based on the classification labels output by the end-to-end classification model and the true labels of the training videos; and update the model parameters in the end-to-end classification model using gradient descent based on the training loss.

[0183] Computer program code for performing the operations of this disclosure can be written in one or more programming languages ​​or a combination thereof, including but not limited to object-oriented programming languages ​​such as Java, Smalltalk, and C++, as well as conventional procedural programming languages ​​such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0184] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0185] The modules described in the embodiments of this disclosure can be implemented in software or in hardware. The names of the modules are not necessarily limiting in certain circumstances; for example, an audio / video determination module can also be described as "a module for determining a target video and the target audio corresponding to the target video".

[0186] The functions described above in this document can be performed, at least in part, by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application Standard Products (ASSPs), System-on-Chip (SoCs), Complex Programmable Logic Devices (CPLDs), and so on.

[0187] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0188] According to one or more embodiments of this disclosure, Example 1 provides a video classification method, including:

[0189] Identify the target video and the target audio corresponding to the target video;

[0190] N video frames are extracted from the target video, and the target audio is divided into N audio frames that correspond one-to-one with the N video frames, where N is a positive integer greater than 1;

[0191] Extract the video features corresponding to each video frame and the audio features corresponding to each audio frame;

[0192] For each video frame, the video features of the video frame are fused with the audio features of the corresponding audio frame to obtain the corresponding dual-stream fusion features;

[0193] The target fused feature is obtained by fusing N dual-stream fusion features in time sequence.

[0194] Based on the target fusion features, the classification label of the target video is determined.

[0195] According to one or more embodiments of this disclosure, Example 2 provides the method of Example 1, wherein each video frame extracted from the target video is represented by a two-dimensional or three-dimensional array, and each audio frame segmented from the target audio is represented by a one-dimensional array; the step of extracting the video features corresponding to each video frame and the audio features corresponding to each audio frame includes:

[0196] The one-dimensional array of audio frames is transformed into a two-dimensional array using a fast Fourier transform.

[0197] The values ​​in the two-dimensional array are normalized to obtain a target array representing the audio frame;

[0198] A two-dimensional or three-dimensional array representing the video frame and a target array representing the audio frame are input into the same backbone network to extract the video features corresponding to the video frame and the audio features corresponding to the audio frame, respectively.

[0199] According to one or more embodiments of this disclosure, Example 3 provides the method of Example 1, wherein fusing the video features of the video frame with the audio features of the corresponding audio frame to obtain the corresponding dual-stream fusion features includes:

[0200] Obtain type representations, which include distribution feature vectors corresponding to video types and distribution feature vectors corresponding to audio types;

[0201] The video features of the video frame are added to the distribution feature vector corresponding to the video type to form the video input vector;

[0202] The audio features of the corresponding audio frame are added to the distribution feature vector corresponding to the audio type to form the audio input vector;

[0203] The video input vector and the audio input vector are input together into a predetermined model, and the vector output by the predetermined model is obtained to obtain the corresponding dual-stream fusion feature.

[0204] According to one or more embodiments of this disclosure, Example 4 provides the method of Example 1, wherein fusing N dual-stream fusion features in a temporal sequence to obtain a target fusion feature includes:

[0205] Acquire a temporal representation, which includes N temporal feature vectors, and the N temporal feature vectors represent inter-frame temporal information;

[0206] For each dual-stream fusion feature, the dual-stream fusion feature is added to the corresponding temporal feature vector in the temporal representation to form a temporal input vector;

[0207] The N time-series input vectors are input into a predetermined model, and the vector output by the predetermined model is obtained to obtain the target fusion feature.

[0208] According to one or more embodiments of this disclosure, Example 5 provides a model training method for training an end-to-end classification model, the end-to-end classification model including a backbone network, a fusion model, and a classifier, the method comprising:

[0209] Multiple training samples are identified, each containing a training video and its corresponding training audio.

[0210] For the training samples, N video frames are extracted from the training video, and the training audio is divided into N audio frames that correspond one-to-one with the N video frames, where N is a positive integer greater than 1.

[0211] The N video frames and the N audio frames are input into the end-to-end classification model;

[0212] The training loss is calculated based on the classification labels output by the end-to-end classification model and the real labels of the training videos.

[0213] Based on the training loss, the model parameters in the end-to-end classification model are updated using gradient descent.

[0214] In the end-to-end classification model, the backbone network is used to extract video features corresponding to each video frame and audio features corresponding to each audio frame, and output them to the fusion model.

[0215] The fusion model is used to fuse the video features of each video frame with the audio features of the corresponding audio frame to obtain the corresponding dual-stream fusion features, fuse N dual-stream fusion features in time sequence to obtain the target fusion feature, and output the target fusion feature to the classifier.

[0216] The classifier is used to output corresponding classification labels based on the target fusion features.

[0217] According to one or more embodiments of this disclosure, Example 6 provides the method of Example 5, wherein each video frame extracted from the training video is represented as a two-dimensional or three-dimensional array, and each audio frame segmented from the training audio is represented as a one-dimensional array. The end-to-end classification model further includes a preprocessing model, and the step of inputting the N video frames and the N audio frames into the end-to-end classification model includes:

[0218] A two-dimensional or three-dimensional array representing the video frames is input into the backbone network;

[0219] The one-dimensional array representing the audio frame is input into the preprocessing model;

[0220] The preprocessing model is used to transform the one-dimensional array of audio frames into a two-dimensional array through a fast Fourier transform, normalize the values ​​in the two-dimensional array to obtain a target array representing the audio frames, and output the target array representing the audio frames to the backbone network.

[0221] According to one or more embodiments of this disclosure, Example 7 provides the method of Example 5, wherein the fusion model is specifically used for:

[0222] Obtain type representations, which include distribution feature vectors corresponding to video types and distribution feature vectors corresponding to audio types;

[0223] The video features of the video frame are added to the distribution feature vector corresponding to the video type to form the video input vector;

[0224] The audio features of the corresponding audio frame are added to the distribution feature vector corresponding to the audio type to form the audio input vector;

[0225] The video input vector and the audio input vector are input into a predetermined model, and the vector output by the predetermined model is obtained to obtain the corresponding dual-stream fusion feature.

[0226] Acquire a temporal representation, which includes N temporal feature vectors, and the N temporal feature vectors represent inter-frame temporal information;

[0227] For each dual-stream fusion feature, the dual-stream fusion feature is added to the corresponding temporal feature vector in the temporal representation to form a temporal input vector;

[0228] The N time-series input vectors are input into the predetermined model, the vector output by the predetermined model is obtained, the target fusion feature is obtained, and the target fusion feature is output to the classifier.

[0229] The step of updating the model parameters in the end-to-end classification model through gradient descent includes:

[0230] The model parameters in the end-to-end classification model are updated by gradient descent, as well as the distribution feature vector in the type representation and the temporal feature vector in the temporal representation.

[0231] According to one or more embodiments of this disclosure, Example 8 provides a method of any of Examples 5 to 7, wherein calculating the training loss based on the classification labels output by the end-to-end classification model and the true labels of the training videos includes:

[0232] The classification error is calculated based on the classification labels output by the end-to-end classification model and the real labels of the training videos.

[0233] The similarity error is calculated based on the similarity between the video features corresponding to each video frame and the audio features corresponding to the audio frame.

[0234] The training loss is calculated based on the classification error and the similarity error.

[0235] According to one or more embodiments of this disclosure, Example 9 provides a video classification apparatus, comprising:

[0236] The audio and video determination module is used to determine the target video and the target audio corresponding to the target video;

[0237] The audio and video frame segmentation module is used to extract N video frames from the target video and to segment the target audio into N audio frames that correspond one-to-one with the N video frames, where N is a positive integer greater than 1.

[0238] The feature extraction module is used to extract the video features corresponding to each video frame and the audio features corresponding to each audio frame.

[0239] The dual-stream fusion module is used to fuse the video features of each video frame with the audio features of the corresponding audio frame to obtain the corresponding dual-stream fusion features.

[0240] The temporal fusion module is used to fuse N dual-stream fusion features in a temporal sequence to obtain the target fusion feature;

[0241] The video classification module is used to determine the classification label of the target video based on the target fusion features.

[0242] According to one or more embodiments of this disclosure, Example 10 provides a model training apparatus for training an end-to-end classification model, the end-to-end classification model including a backbone network, a fusion model, and a classifier, the apparatus comprising:

[0243] The training data determination module is used to determine multiple training samples, each of which contains a training video and a corresponding training audio.

[0244] The training data frame segmentation module is used to extract N video frames from the training video for the training samples, and to segment the training audio into N audio frames that correspond one-to-one with the N video frames, where N is a positive integer greater than 1.

[0245] A frame input module is used to input the N video frames and the N audio frames into the end-to-end classification model;

[0246] The loss calculation module is used to calculate the training loss based on the classification labels output by the end-to-end classification model and the real labels of the training videos.

[0247] The parameter training module is used to update the model parameters in the end-to-end classification model using gradient descent based on the training loss.

[0248] In the end-to-end classification model, the backbone network is used to extract video features corresponding to each video frame and audio features corresponding to each audio frame, and output them to the fusion model.

[0249] The fusion model is used to fuse the video features of each video frame with the audio features of the corresponding audio frame to obtain the corresponding dual-stream fusion features, fuse N dual-stream fusion features in time sequence to obtain the target fusion feature, and output the target fusion feature to the classifier.

[0250] The classifier is used to output corresponding classification labels based on the target fusion features.

[0251] According to one or more embodiments of the present disclosure, Example 11 provides a computer-readable medium having a computer program stored thereon that, when executed by a processing device, implements the method of any of Examples 1 to 4 or any of Examples 5 to 8.

[0252] According to one or more embodiments of this disclosure, Example 12 provides an electronic device, including:

[0253] A storage device on which computer programs are stored;

[0254] A processing device for executing the computer program in the storage device to implement the method of any of Examples 1 to 4 or any of Examples 5 to 8.

[0255] The above description is merely a preferred embodiment of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features disclosed in this disclosure that have similar functions.

[0256] Furthermore, while the operations are described in a specific order, this should not be construed as requiring these operations to be performed in the specific order shown or in a sequential order. In certain environments, multitasking and parallel processing may be advantageous. Similarly, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of this disclosure. Certain features described in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments.

[0257] Although the subject matter has been described using language specific to structural features and / or methodological logic, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. Rather, the specific features and actions described above are merely illustrative examples of implementing the claims.

Claims

1. A method of video classification, characterized by, include: Identify the target video and the target audio corresponding to the target video; N video frames are extracted from the target video, and the target audio is divided into N audio frames that correspond one-to-one with the N video frames, where N is a positive integer greater than 1; Extract the video features corresponding to each video frame and the audio features corresponding to each audio frame; For each video frame, the video features of the video frame are fused with the audio features of the corresponding audio frame to obtain the corresponding dual-stream fusion features; The target fused feature is obtained by fusing N dual-stream fusion features in time sequence. Based on the target fusion features, determine the classification label of the target video; The step of fusing the video features of the video frame with the audio features of the corresponding audio frame to obtain the corresponding dual-stream fusion features includes: Obtain type representations, which include distribution feature vectors corresponding to video types and distribution feature vectors corresponding to audio types; The video features of the video frame are added to the distribution feature vector corresponding to the video type to form the video input vector; The audio features of the corresponding audio frame are added to the distribution feature vector corresponding to the audio type to form the audio input vector; The video input vector and the audio input vector are input into a predetermined model, and the vector output by the predetermined model is obtained to obtain the corresponding dual-stream fusion feature. The step of fusing N dual-stream fusion features in time sequence to obtain the target fusion feature includes: Acquire a temporal representation, which includes N temporal feature vectors, and the N temporal feature vectors represent inter-frame temporal information; For each dual-stream fusion feature, the dual-stream fusion feature is added to the corresponding temporal feature vector in the temporal representation to form a temporal input vector; The N time-series input vectors are input into a predetermined model, and the vector output by the predetermined model is obtained to obtain the target fusion feature.

2. The method according to claim 1, characterized in that, Each video frame extracted from the target video is represented as a two-dimensional or three-dimensional array, and each audio frame segmented from the target audio is represented as a one-dimensional array; the extraction of video features corresponding to each video frame and audio features corresponding to each audio frame includes: The one-dimensional array of audio frames is transformed into a two-dimensional array using a fast Fourier transform. The values ​​in the two-dimensional array are normalized to obtain a target array representing the audio frame; A two-dimensional or three-dimensional array representing the video frame and a target array representing the audio frame are input into the same backbone network to extract the video features corresponding to the video frame and the audio features corresponding to the audio frame, respectively.

3. A model training method, characterized in that, For training an end-to-end classification model, the end-to-end classification model comprising a backbone network, a fusion model, and a classifier, the method includes: Multiple training samples are identified, each containing a training video and its corresponding training audio. For the training samples, N video frames are extracted from the training video, and the training audio is divided into N audio frames that correspond one-to-one with the N video frames, where N is a positive integer greater than 1. The N video frames and the N audio frames are input into the end-to-end classification model; The training loss is calculated based on the classification labels output by the end-to-end classification model and the real labels of the training videos. Based on the training loss, the model parameters in the end-to-end classification model are updated using gradient descent. Obtain type representations, which include distribution feature vectors corresponding to video types and distribution feature vectors corresponding to audio types; The video features of the video frame are added to the distribution feature vector corresponding to the video type to form the video input vector; The audio features of the corresponding audio frame are added to the distribution feature vector corresponding to the audio type to form the audio input vector; The video input vector and the audio input vector are input into a predetermined model, and the vector output by the predetermined model is obtained to obtain the corresponding dual-stream fusion feature. In the end-to-end classification model, the backbone network is used to extract video features corresponding to each video frame and audio features corresponding to each audio frame, and output them to the fusion model. The fusion model is used to fuse the video features of each video frame with the audio features of the corresponding audio frame to obtain the corresponding dual-stream fusion features, fuse N dual-stream fusion features in time sequence to obtain the target fusion feature, and output the target fusion feature to the classifier. The classifier is used to output a corresponding classification label based on the target fusion features; The fusion model is specifically used for: Acquire a temporal representation, which includes N temporal feature vectors, and the N temporal feature vectors represent inter-frame temporal information; For each dual-stream fusion feature, the dual-stream fusion feature is added to the corresponding temporal feature vector in the temporal representation to form a temporal input vector; The N time-series input vectors are input into the predetermined model, the vector output by the predetermined model is obtained, the target fusion feature is obtained, and the target fusion feature is output to the classifier. The step of updating the model parameters in the end-to-end classification model through gradient descent includes: The model parameters in the end-to-end classification model are updated by gradient descent, as well as the distribution feature vector in the type representation and the temporal feature vector in the temporal representation.

4. The method according to claim 3, characterized in that, Each video frame extracted from the training video is represented as a two-dimensional or three-dimensional array, and each audio frame segmented from the training audio is represented as a one-dimensional array. The end-to-end classification model also includes a preprocessing model. The step of inputting the N video frames and the N audio frames into the end-to-end classification model includes: A two-dimensional or three-dimensional array representing the video frames is input into the backbone network; The one-dimensional array representing the audio frame is input into the preprocessing model; The preprocessing model is used to transform the one-dimensional array of audio frames into a two-dimensional array through a fast Fourier transform, normalize the values ​​in the two-dimensional array to obtain a target array representing the audio frames, and output the target array representing the audio frames to the backbone network.

5. The method according to claim 3 or 4, characterized in that, The step of calculating the training loss based on the classification labels output by the end-to-end classification model and the true labels of the training videos includes: The classification error is calculated based on the classification labels output by the end-to-end classification model and the real labels of the training videos. The similarity error is calculated based on the similarity between the video features corresponding to each video frame and the audio features corresponding to the audio frame. The training loss is calculated based on the classification error and the similarity error.

6. A video classification device, characterized in that, include: The audio and video determination module is used to determine the target video and the target audio corresponding to the target video; The audio and video frame segmentation module is used to extract N video frames from the target video and to segment the target audio into N audio frames that correspond one-to-one with the N video frames, where N is a positive integer greater than 1. The feature extraction module is used to extract the video features corresponding to each video frame and the audio features corresponding to each audio frame. The dual-stream fusion module is used to fuse the video features of each video frame with the audio features of the corresponding audio frame to obtain the corresponding dual-stream fusion features. The temporal fusion module is used to fuse N dual-stream fusion features in a temporal sequence to obtain the target fusion feature; A video classification module is used to determine the classification label of the target video based on the target fusion features; The dual-stream fusion module is further used to obtain type representations, which include distribution feature vectors corresponding to video types and distribution feature vectors corresponding to audio types; the video features of the video frame are added to the distribution feature vectors corresponding to the video types to obtain a video input vector; the audio features of the corresponding audio frame are added to the distribution feature vectors corresponding to the audio types to obtain an audio input vector; the video input vector and the audio input vector are input into a predetermined model to obtain the vector output by the predetermined model, thus obtaining the corresponding dual-stream fusion features; The temporal fusion module is further used to obtain temporal representations, which include N temporal feature vectors, and the N temporal feature vectors represent inter-frame temporal information; For each dual-stream fusion feature, the dual-stream fusion feature is added to the corresponding temporal feature vector in the temporal representation to form a temporal input vector; The N time-series input vectors are input into a predetermined model, and the vector output by the predetermined model is obtained to obtain the target fusion feature.

7. A model training device, characterized in that, For training an end-to-end classification model, the end-to-end classification model including a backbone network, a fusion model, and a classifier, the apparatus includes: The training data determination module is used to determine multiple training samples, each of which contains a training video and a corresponding training audio. The training data frame segmentation module is used to extract N video frames from the training video for the training samples, and to segment the training audio into N audio frames that correspond one-to-one with the N video frames, where N is a positive integer greater than 1. A frame input module is used to input the N video frames and the N audio frames into the end-to-end classification model; The loss calculation module is used to calculate the training loss based on the classification labels output by the end-to-end classification model and the real labels of the training videos. The parameter training module is used to update the model parameters in the end-to-end classification model using gradient descent based on the training loss. In the end-to-end classification model, the backbone network is used to extract video features corresponding to each video frame and audio features corresponding to each audio frame, and output them to the fusion model. The fusion model is used to: fuse the video features of each video frame with the audio features of the corresponding audio frame to obtain a corresponding dual-stream fusion feature; fuse N dual-stream fusion features in a temporal sequence to obtain a target fusion feature; and output the target fusion feature to the classifier; obtain a type representation, which includes a distribution feature vector corresponding to the video type and a distribution feature vector corresponding to the audio type; add the video features of the video frame to the distribution feature vector corresponding to the video type to obtain a video input vector; add the audio features of the corresponding audio frame to the distribution feature vector corresponding to the audio type to obtain an audio input vector; input the video input vector and the audio input vector together into a predetermined model, obtain the vector output by the predetermined model, and obtain the corresponding dual-stream fusion feature; The classifier is used to output a corresponding classification label based on the target fusion features; The fusion model is specifically used for: Acquire a temporal representation, which includes N temporal feature vectors, and the N temporal feature vectors represent inter-frame temporal information; For each dual-stream fusion feature, the dual-stream fusion feature is added to the corresponding temporal feature vector in the temporal representation to form a temporal input vector; The N time-series input vectors are input into the predetermined model, the vector output by the predetermined model is obtained, the target fusion feature is obtained, and the target fusion feature is output to the classifier. The step of updating the model parameters in the end-to-end classification model through gradient descent includes: The model parameters in the end-to-end classification model are updated by gradient descent, as well as the distribution feature vector in the type representation and the temporal feature vector in the temporal representation.

8. A computer-readable medium having a computer program stored thereon, characterized in that, When executed by the processing device, the program implements the steps of the method described in any one of claims 1-2 or any one of claims 3-5.

9. An electronic device, characterized in that, include: A storage device on which computer programs are stored; A processing device for executing the computer program in the storage device to implement the steps of the method according to any one of claims 1-2 or any one of claims 3-5.

Citation Information

Patent Citations

  • Video classification method and device and storage medium

    CN114238690A