Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A Cross-modal Feature Fusion System Based on Attention Mechanism

A feature fusion and cross-modal technology, applied in computer parts, character and pattern recognition, biological neural network models, etc., can solve problems that are not involved

Active Publication Date: 2022-07-05
SUN YAT SEN UNIV
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this invention does not involve any technical content about using audio and RGB pictures as input to achieve video representation learning

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Cross-modal Feature Fusion System Based on Attention Mechanism
  • A Cross-modal Feature Fusion System Based on Attention Mechanism
  • A Cross-modal Feature Fusion System Based on Attention Mechanism

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0063] like figure 1 shown, a cross-modal feature fusion system based on attention mechanism, including:

[0064] Audio and video correlation analysis module, used to align the two modalities of audio and video RGB images;

[0065] A supervised contrastive learning module is used to extract modal features from two modalities of audio and video RGB images;

[0066] A cross-modal feature fusion module for learning global contextual representations by exploiting the relevant knowledge between modalities.

[0067] The audio-video correlation analysis module continuously collects 16 frames of RGB images from a video i to generate RGB segments v i As the input of RGB image mode; at this time, only one segment is sampled in a video, in order to make full use of the effective audio information in a video, the audio extracted from the entire video i is converted into the mel spectrogram of the video a i As the input of the audio modality; where, i=1,...,N.

[0068] The specific pr...

Embodiment 2

[0076] like figure 1 shown, a cross-modal feature fusion system based on attention mechanism, including:

[0077] Audio and video correlation analysis module, used to align the two modalities of audio and video RGB images;

[0078] A supervised contrastive learning module is used to extract modal features from two modalities of audio and video RGB images;

[0079] A cross-modal feature fusion module for learning global contextual representations by exploiting the relevant knowledge between modalities.

[0080] The audio-video correlation analysis module continuously collects 16 frames of RGB images from a video i to generate RGB segments v i As the input of RGB image mode; at this time, only one segment is sampled in a video, in order to make full use of the effective audio information in a video, the audio extracted from the entire video i is converted into the mel spectrogram of the video a i As the input of the audio modality; where, i=1,...,N.

[0081] The specific pr...

Embodiment 3

[0119] For the convenience of describing each module, given N different videos, the segment of each video consists of a size of where c is the number of channels, l is the number of frames, and h and w represent the height and width of the frame. The size of the 3D convolution kernel is t×d×d, where t is the time length and d is the spatial size; the video RGB image sequence is defined as , where v i An RGB segment generated for consecutively sampling m frames from a video i (i=1,...,N). The audio modality is the mel spectrogram generated by the short-time Fourier transform of the entire audio of a video; a segment of the video RGB image and the mel spectrogram generated by the entire video are aligned as input; the audio mel spectrogram sequence Expressed as , where a i A mel-spectrogram generated for audio extracted from a video i. is the category label for video i.

[0120] 1), audio and video correlation analysis (audio and video alignment)

[0121] The sound si...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a cross-modal feature fusion system based on an attention mechanism. Based on the complementary relationship between the information of audio and video images, the system proposes a method of using supervised comparative learning as a framework to extract two modalities of audio and video. At the same time, an audio-video correlation analysis module is constructed to realize audio-video alignment, and a cross-modal feature fusion module based on attention mechanism is designed to realize the fusion of audio and video features. Audio and RGB images are used as input to learn video representations.

Description

technical field [0001] The invention relates to the technical field of audio and video processing, and more particularly, to a cross-modal feature fusion system based on an attention mechanism. Background technique [0002] For video representation learning, a large number of supervised learning methods have received increasing attention. Methods include traditional methods and deep learning methods. For example, two-stream CNN judges video images and dense optical flow separately, and then directly fuses the class scores of these two networks to obtain classification results. C3D processes video using 3D convolution kernels. A temporal segment network (TSN) samples each video into several segments to model the long-range temporal structure of the video. Temporal Relational Network (TRN) introduces an interpretable network to learn and reason about temporal dependencies between video frames at multiple temporal scales. The Time Shift Module (TSM) shifts part of the chann...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/583G06F16/55G06F16/65G06F16/683G06V10/764G06V10/82G06V10/80G06K9/62G06N3/04G06N3/08
Inventor 王青兰浩源刘阳林倞
Owner SUN YAT SEN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products