Video data feature extraction method based on audio and video multi-mode time sequence prediction

A feature extraction and time series prediction technology, applied in character and pattern recognition, instrumentation, computing, etc., can solve the problems of reducing the accurate classification of video actions, reducing the robustness of model noise, and reducing the accuracy of action recognition, so as to reduce the burden of network learning. , The effect of removing modal redundant features and improving understanding ability
CN112906624AActive Publication Date: 2021-06-04HEFEI UNIV OF TECH

Patent Information

Authority / Receiving Office
CN ยท China
Patent Type
Applications(China)
Current Assignee / Owner
HEFEI UNIV OF TECH
Publication Date
2021-06-04

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention discloses a video data feature extraction method based on audio and video multi-mode time sequence prediction, and the method comprises the steps: 1, obtaining a video data set through a video collection device, and constructing an audio and video double-flow data pair; 2, for each video frame in the video stream and each audio clip in the audio stream, adopting a series of data enhancement operations in respective modes, and converting a one-dimensional audio into a two-dimensional spectrogram; 3, constructing an audio and video multi-modal prediction model which comprises a video stream feature extraction network unit, an audio stream feature extraction network unit, a time sequence information aggregation network unit and a multi-modal interaction prediction network unit; and 4, according to the uncertain features obtained by the multi-modal interaction prediction, calculating the total loss of the multi-modal prediction of the audio and video and optimizing the network. According to the method, useful information of the video can be effectively mined in a self-supervised manner by utilizing the time sequence of the video and combining the interaction between audio and video double streams, so that the effectiveness of feature extraction is improved, and actual downstream tasks such as video understanding, sound source localization, anomaly detection and the like are facilitated.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention relates to the field of video data processing and analysis, in particular to a video data feature extraction method for audio and video multimodal time series prediction. Background technique

[0002] In the context of today's Internet big data, it is becoming more and more important to process and analyze specific data. This kind of data analysis can also be called "representation learning" in some fields of artificial intelligence, that is, to extract useful information from data. Machine learning, especially deep learning algorithms, largely rely on data representation, so how to use the Internet Shanghai The self-supervised mining of its own potential effective information has attracted extensive attention of researchers. As we all know, human cognition is a reaction based on the combination of multiple modal information perceptions, in which the visual and auditory senses usually coexist with each other, for example, the wind whistl...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More