Video and text cross-modal retrieval method based on relational reasoning network

A cross-modal and relational technology, applied in the field of cross-modal retrieval of video and text, can solve the problem of ignoring the internal information relationship of a single modality, failing to extract information in the time domain well, and the expression of single-modal information is not complete and sufficiency, etc., to achieve a good cross-modal retrieval effect

Active Publication Date: 2021-08-10
CHENGDU KOALA URAN TECH CO LTD
View PDF7 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

For a video, although the existing convolutional neural network can extract a lot of spatial information, when it comes to information such as spatial transformation, background transformation, or temporal action, the convolutional neural network does not perform well, and it cannot perform well. The information extracted to the time domain
[0006] Another shortcoming o

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Video and text cross-modal retrieval method based on relational reasoning network
  • Video and text cross-modal retrieval method based on relational reasoning network
  • Video and text cross-modal retrieval method based on relational reasoning network

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0048] This embodiment proposes a cross-modal retrieval method for video and text based on a relational reasoning network, the flow chart of which is shown in figure 1 , wherein the method includes the following steps:

[0049] Step 1. Extract video data features and text data features.

[0050] Step 2. Use the cyclic neural network to obtain the global features of the video and the global features of the text.

[0051] Step 3. Use the multi-scale relational inference network to construct video local relation features and text local relation features.

[0052] Step 4. Fusion of global features and local relational features of single-modal data to obtain video fusion features and text fusion features.

[0053] Step 5. Map video fusion features and text fusion features to the common space, and align the distribution of video fusion features and text fusion features in the common space.

[0054] Step 6. Train the overall network of steps 1-5.

[0055] Step 7. Utilize the trai...

Embodiment 2

[0057] see figure 2 and image 3 , the video and text cross-modal retrieval method based on the relational reasoning network proposed in this embodiment can extract the dependencies between different frames at multiple time scales through the relational reasoning network according to the dependencies between video frames, Construct the implicit relationship between multiple frames, obtain local relationship features, and construct global features at the same time, and fuse multi-scale local relationship features and global features to form a strong semantic semantic feature as a fusion feature of the video.

[0058]In addition, according to the dependency relationship between text words, through the relationship reasoning network, the dependency relationship between different words is extracted at multiple scales, the implicit relationship between multiple words is constructed, local relationship features are obtained, and global features are constructed at the same time. An...

Embodiment 3

[0063] see Figure 4 , the cross-modal retrieval method of video and text based on the relational reasoning network proposed in this embodiment first constructs the model for training, then trains the entire network, and then performs cross-modal retrieval, mainly including steps S1-step S6 .

[0064] Step S1: Extract multimodal data features.

[0065] Multimodal data includes video, text, etc. These raw data are expressed in a way that humans can accept, but computers cannot directly process them. Their features need to be extracted and expressed in numbers that computers can process.

[0066] Wherein, step S1 specifically includes the following steps:

[0067] Step S11: For the video, use the convolutional neural network ResNet to perform feature extraction, and the video feature sequence is expressed as , where n is the number of frame sequences;

[0068] Step S12: For text, use Glove to carry out feature extraction, text feature sequence is expressed as , where m is...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the field of cross-modal retrieval, and discloses a video and text cross-modal retrieval method based on a relational reasoning network, which comprises the following steps: extracting video data features and text data features; obtaining video global features and text global features by using a recurrent neural network; constructing video local relation features and text local relation features by using a multi-scale relation reasoning network; respectively fusing the global features and the local relation features of the single-mode data to obtain video fusion features and text fusion features; mapping the video fusion features and the text fusion features to a public space, and aligning video fusion feature distribution and text fusion feature distribution in the public space; and training the whole network. According to the method, global features and local relation features are concerned at the same time, key information in single-mode data can be more effectively focused, and then cross-mode retrieval is achieved.

Description

technical field [0001] The invention relates to the field of cross-modal retrieval, in particular to a video and text cross-modal retrieval method based on a relational reasoning network. Background technique [0002] Cross-media retrieval means that users can retrieve semantically related data in all media types by inputting query data of any media type. In the present invention, it is specifically the mutual retrieval of video and text. In general, videos and corresponding video description texts will be provided in the data set. The task of cross-media retrieval is: for any video, retrieve the video description text most relevant to its content description, or for any video description text, retrieve the The video most relevant to its description. With the increasing amount of multimedia data such as text, images, and videos on the Internet, retrieval across different modalities has become a new trend in information retrieval. The difficulty of this problem lies in how...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/332G06F16/33G06F16/532G06F16/583G06K9/00G06N3/04
CPCG06F16/332G06F16/334G06F16/532G06F16/583G06V20/40G06N3/044Y02D10/00
Inventor 沈复民徐行王妮邵杰申恒涛
Owner CHENGDU KOALA URAN TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products