Video data feature extraction method based on audio and video multi-mode time sequence prediction

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A feature extraction and time series prediction technology, applied in character and pattern recognition, instrumentation, computing, etc., can solve the problems of reducing the accurate classification of video actions, reducing the robustness of model noise, and reducing the accuracy of action recognition, so as to reduce the burden of network learning. , The effect of removing modal redundant features and improving understanding ability

Active Publication Date: 2021-06-04

HEFEI UNIV OF TECH

View PDF10 Cites 4 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

The single mode is often imitated in the image field, such as the generation of missing frames in video clips, video rate prediction, etc. However, for the understanding of video, it is far from enough to rely on a single mode of video stream, such as the laughing of the observed Sound and background cheerful music are helpful to improve the classification accuracy of a funny video; for the latter multimodal video representation learning, some researchers use the clustering results to construct pseudo-labels and guide each other in feature classification. The performance of the method depends heavily on the selected clustering method or the number of category clusters needs to be set in advance; in addition, most researchers use a multimodal representation learning method with another modality of audio flow or optical flow , where the extraction of optical flow information is time-consuming, and the quality of optical flow depends on the performance of the selected optical flow extraction network. For audio and video representation learning, timing alignment between two streams is usually used, and a large number of negative example pairs are introduced. Carry out self-supervised comparative learning, but the existing audio and video multi-modal data feature extraction methods ignore the timing between audio and video, that is, the connection between frames is not considered, and timing is exactly what video compares with The unique nature of the image itself, independent processing between frames will lead to the loss of some important time-series coherent information, making the machine's understanding of the video limited, and the noise information of the frame will easily reduce the accuracy of action recognition, while the loss of time-series information will reduce Accurate Classification of Video Actions and Noise Robustness of Reduced Models

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0047] In this example, if figure 1 As shown, a video data feature extraction method based on audio and video multimodal timing prediction comprises the following steps:

[0048] Step 1. Utilize the video acquisition device to obtain the video data set, denoted as X={X 1 ,X 2 ,...,X i ,...,X N},X i Represents the i-th video, 1≤i≤N, N represents the total number of videos, extract the audio stream A and video stream V from the video data set X, denoted as in, Denotes the i-th video X i audio stream, Denotes the i-th video X i video stream of Represent the i-th audio and video data pair, thereby constructing the audio and video data pair set S={S 1 ,S 2 ,...,S i ,...,S N};

[0049] In the specific implementation, for example, use opencv and moviepy tools (other methods can also be used in actual operation) to extract video frames and audio respectively for a piece of video, construct a set S of audio and video data pairs, and retain frame timestamps for subsequ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a video data feature extraction method based on audio and video multi-mode time sequence prediction, and the method comprises the steps: 1, obtaining a video data set through a video collection device, and constructing an audio and video double-flow data pair; 2, for each video frame in the video stream and each audio clip in the audio stream, adopting a series of data enhancement operations in respective modes, and converting a one-dimensional audio into a two-dimensional spectrogram; 3, constructing an audio and video multi-modal prediction model which comprises a video stream feature extraction network unit, an audio stream feature extraction network unit, a time sequence information aggregation network unit and a multi-modal interaction prediction network unit; and 4, according to the uncertain features obtained by the multi-modal interaction prediction, calculating the total loss of the multi-modal prediction of the audio and video and optimizing the network. According to the method, useful information of the video can be effectively mined in a self-supervised manner by utilizing the time sequence of the video and combining the interaction between audio and video double streams, so that the effectiveness of feature extraction is improved, and actual downstream tasks such as video understanding, sound source localization, anomaly detection and the like are facilitated.

Description

technical field [0001] The invention relates to the field of video data processing and analysis, in particular to a video data feature extraction method for audio and video multimodal time series prediction. Background technique [0002] In the context of today's Internet big data, it is becoming more and more important to process and analyze specific data. This kind of data analysis can also be called "representation learning" in some fields of artificial intelligence, that is, to extract useful information from data. Machine learning, especially deep learning algorithms, largely rely on data representation, so how to use the Internet Shanghai The self-supervised mining of its own potential effective information has attracted extensive attention of researchers. As we all know, human cognition is a reaction based on the combination of multiple modal information perceptions, in which the visual and auditory senses usually coexist with each other, for example, the wind whistl...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06K9/00

CPCG06V20/46

Inventor 陈雁翔赵鹏铖朱玉鹏盛振涛

Owner HEFEI UNIV OF TECH

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Video data feature extraction method based on audio and video multi-mode time sequence prediction

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology