Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium

An attention, cross-modal technology, applied in the video field, can solve the problem of maintaining semantic similarity of data

Pending Publication Date: 2021-01-19
HARBIN INST OF TECH SHENZHEN GRADUATE SCHOOL
View PDF0 Cites 16 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] There are several problems in cross-modal video retrieval based on deep learning: 1) It is an NP problem to map the or

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium
  • Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium
  • Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0032] The invention discloses a cross-modal video retrieval method based on a multi-head self-attention mechanism. The invention mainly aims at the problem of how to fully mine the semantic information inside multi-modal data and generate efficient vectors. Through the form of supervised training, make full use of the semantic information in training multi-modal data for training, and introduce a multi-head self-attention mechanism to capture the subtle interactions within videos and texts, and selectively focus on the key points of multi-modal data Information is used to enhance the representation ability of the model, better mine data semantics, and ensure the consistency of data distances in the original space and in the shared subspace. When training the model, a supervised machine learning method is used, and a triple-based sorting loss function is used to introduce the order of positive samples in each batch, which better corrects the sorting error. For two different mo...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a cross-modal video retrieval method and system based on a multi-head self-attention mechanism and a storage medium, and the cross-modal video retrieval method comprises a videocoding step, a text coding step and a joint embedding step. Semantic information in the training multi-modal data is fully utilized for training, a multi-head self-attention mechanism is introduced,fine interaction in videos and texts is captured, key information of the multi-modal data is selectively concerned to enhance the characterization capability of the model, data semantics are better mined, and the invention has the advantages of being high in practicability and easy to popularize. Consistency of the distances of the data in the original space and the shared subspace is ensured. Theinvention has the beneficial effects that experiments prove that the similarity of the data in the original space can be effectively maintained, and the retrieval accuracy can be improved.

Description

technical field [0001] The present invention relates to the field of video technology, in particular to a cross-modal video retrieval method, system and storage medium based on a multi-head self-attention mechanism. Background technique [0002] With the explosive growth of multimedia data, traditional single-modal retrieval has been difficult to meet people's retrieval needs in the multimedia field. Users are eager to use one of the modal data as a query object to retrieve another with similar semantics. Modal data content, such as image retrieval text, text retrieval image or video, etc., that is, cross-modal retrieval. [0003] Cross-modal retrieval needs to process data of different modalities at the same time. These data have a certain similarity in content, but their underlying features are heterogeneous, and it is difficult to directly calculate their similarity, that is, there is a "semantic gap problem". The method of mapping data from different modalities to a com...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/732G06F16/783G06K9/00G06N3/04
CPCG06F16/732G06F16/783G06V20/40G06N3/045
Inventor 漆舒汉王轩丁洛张加佳廖清刘洋夏文蒋琳
Owner HARBIN INST OF TECH SHENZHEN GRADUATE SCHOOL
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products