Video data feature extraction method based on audio and video multi-mode time sequence prediction

A feature extraction and time series prediction technology, applied in character and pattern recognition, instrumentation, computing, etc., can solve the problems of reducing the accurate classification of video actions, reducing the robustness of model noise, and reducing the accuracy of action recognition, so as to reduce the burden of network learning. , The effect of removing modal redundant features and improving understanding ability

Active Publication Date: 2021-06-04
HEFEI UNIV OF TECH
View PDF10 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The single mode is often imitated in the image field, such as the generation of missing frames in video clips, video rate prediction, etc. However, for the understanding of video, it is far from enough to rely on a single mode of video stream, such as the laughing of the observed Sound and background cheerful music are helpful to improve the classification accuracy of a funny video; for the latter multimodal video representation learning, some researchers use the clustering results to construct pseudo-labels and guide each other in feature classification. The performance of the method depends heavily on the selected clustering method or the number of category clusters needs to be set in advance; in addition, most researchers use a multimodal representation learning method with another modality of audio flow or optical flow , where the extraction of optical flow information is time-consuming, and the quality of optical flow depends on the performance of the selected optical flow extraction network. For audio and video representation learning, timing alignment between two streams is usually used, and a large number of negative example pairs are introduced. Carry out self-supervised comparative learning, but the existing audio and video multi-modal data feature extraction methods ignore the timing between audio and video, that is, the connection between frames is not considered, and timing is exactly what video compares with The unique nature of the image itself, independent processing between frames will lead to the loss of some important time-series coherent information, making the machine's understanding of the video limited, and the noise information of the frame will easily reduce the accuracy of action recognition, while the loss of time-series information will reduce Accurate Classification of Video Actions and Noise Robustness of Reduced Models

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Video data feature extraction method based on audio and video multi-mode time sequence prediction
  • Video data feature extraction method based on audio and video multi-mode time sequence prediction
  • Video data feature extraction method based on audio and video multi-mode time sequence prediction

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0047] In this example, if figure 1 As shown, a video data feature extraction method based on audio and video multimodal timing prediction comprises the following steps:

[0048] Step 1. Utilize the video acquisition device to obtain the video data set, denoted as X={X 1 ,X 2 ,...,X i ,...,X N},X i Represents the i-th video, 1≤i≤N, N represents the total number of videos, extract the audio stream A and video stream V from the video data set X, denoted as in, Denotes the i-th video X i audio stream, Denotes the i-th video X i video stream of Represent the i-th audio and video data pair, thereby constructing the audio and video data pair set S={S 1 ,S 2 ,...,S i ,...,S N};

[0049] In the specific implementation, for example, use opencv and moviepy tools (other methods can also be used in actual operation) to extract video frames and audio respectively for a piece of video, construct a set S of audio and video data pairs, and retain frame timestamps for subsequ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a video data feature extraction method based on audio and video multi-mode time sequence prediction, and the method comprises the steps: 1, obtaining a video data set through a video collection device, and constructing an audio and video double-flow data pair; 2, for each video frame in the video stream and each audio clip in the audio stream, adopting a series of data enhancement operations in respective modes, and converting a one-dimensional audio into a two-dimensional spectrogram; 3, constructing an audio and video multi-modal prediction model which comprises a video stream feature extraction network unit, an audio stream feature extraction network unit, a time sequence information aggregation network unit and a multi-modal interaction prediction network unit; and 4, according to the uncertain features obtained by the multi-modal interaction prediction, calculating the total loss of the multi-modal prediction of the audio and video and optimizing the network. According to the method, useful information of the video can be effectively mined in a self-supervised manner by utilizing the time sequence of the video and combining the interaction between audio and video double streams, so that the effectiveness of feature extraction is improved, and actual downstream tasks such as video understanding, sound source localization, anomaly detection and the like are facilitated.

Description

technical field [0001] The invention relates to the field of video data processing and analysis, in particular to a video data feature extraction method for audio and video multimodal time series prediction. Background technique [0002] In the context of today's Internet big data, it is becoming more and more important to process and analyze specific data. This kind of data analysis can also be called "representation learning" in some fields of artificial intelligence, that is, to extract useful information from data. Machine learning, especially deep learning algorithms, largely rely on data representation, so how to use the Internet Shanghai The self-supervised mining of its own potential effective information has attracted extensive attention of researchers. As we all know, human cognition is a reaction based on the combination of multiple modal information perceptions, in which the visual and auditory senses usually coexist with each other, for example, the wind whistl...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/00
CPCG06V20/46
Inventor 陈雁翔赵鹏铖朱玉鹏盛振涛
Owner HEFEI UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products