Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
An attention, cross-modal technology, applied in the video field, can solve the problem of maintaining semantic similarity of data

Pending Publication Date: 2021-01-19

HARBIN INST OF TECH SHENZHEN GRADUATE SCHOOL

View PDF0 Cites 16 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] There are several problems in cross-modal video retrieval based on deep learning: 1) It is an NP problem to map the original data feature space to the shared space; 2) How to maintain the semantic similarity between data during feature mapping

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0032] The invention discloses a cross-modal video retrieval method based on a multi-head self-attention mechanism. The invention mainly aims at the problem of how to fully mine the semantic information inside multi-modal data and generate efficient vectors. Through the form of supervised training, make full use of the semantic information in training multi-modal data for training, and introduce a multi-head self-attention mechanism to capture the subtle interactions within videos and texts, and selectively focus on the key points of multi-modal data Information is used to enhance the representation ability of the model, better mine data semantics, and ensure the consistency of data distances in the original space and in the shared subspace. When training the model, a supervised machine learning method is used, and a triple-based sorting loss function is used to introduce the order of positive samples in each batch, which better corrects the sorting error. For two different mo...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a cross-modal video retrieval method and system based on a multi-head self-attention mechanism and a storage medium, and the cross-modal video retrieval method comprises a videocoding step, a text coding step and a joint embedding step. Semantic information in the training multi-modal data is fully utilized for training, a multi-head self-attention mechanism is introduced,fine interaction in videos and texts is captured, key information of the multi-modal data is selectively concerned to enhance the characterization capability of the model, data semantics are better mined, and the invention has the advantages of being high in practicability and easy to popularize. Consistency of the distances of the data in the original space and the shared subspace is ensured. Theinvention has the beneficial effects that experiments prove that the similarity of the data in the original space can be effectively maintained, and the retrieval accuracy can be improved.

Description

technical field [0001] The present invention relates to the field of video technology, in particular to a cross-modal video retrieval method, system and storage medium based on a multi-head self-attention mechanism. Background technique [0002] With the explosive growth of multimedia data, traditional single-modal retrieval has been difficult to meet people's retrieval needs in the multimedia field. Users are eager to use one of the modal data as a query object to retrieve another with similar semantics. Modal data content, such as image retrieval text, text retrieval image or video, etc., that is, cross-modal retrieval. [0003] Cross-modal retrieval needs to process data of different modalities at the same time. These data have a certain similarity in content, but their underlying features are heterogeneous, and it is difficult to directly calculate their similarity, that is, there is a "semantic gap problem". The method of mapping data from different modalities to a com...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F16/732G06F16/783G06K9/00G06N3/04

CPCG06F16/732G06F16/783G06V20/40G06N3/045

Inventor 漆舒汉王轩丁洛张加佳廖清刘洋夏文蒋琳

Owner HARBIN INST OF TECH SHENZHEN GRADUATE SCHOOL

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium

What is Al technical title? Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document. An attention, cross-modal technology, applied in the video field, can solve the problem of maintaining semantic similarity of data

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
An attention, cross-modal technology, applied in the video field, can solve the problem of maintaining semantic similarity of data

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology