Cross-modal image text retrieval method of hybrid fusion model

A fusion model and image fusion technology, applied in the field of cross-modal retrieval, can solve problems such as roughness, insufficient cross-modal learning, and general performance, so as to improve accuracy, promote cross-modal information exchange, and enhance expression ability Effect

Active Publication Date: 2021-05-11
UNIV OF ELECTRONICS SCI & TECH OF CHINA
View PDF2 Cites 24 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] At present, the mainstream cross-modal retrieval method adopts the late fusion strategy, and the image and text data are respectively embedded and coded by using a more complex network design. Such methods often have the problem of insufficient cross-modal learning, and at the same time, the calculation cost is high.
On the other hand, the existing early fusion methods are often relatively rough, and can only fuse image and text data at the global level, and their performance is relatively average.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Cross-modal image text retrieval method of hybrid fusion model
  • Cross-modal image text retrieval method of hybrid fusion model
  • Cross-modal image text retrieval method of hybrid fusion model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0054] figure 1 It is a flowchart of a cross-modal image text retrieval method of a hybrid fusion model of the present invention.

[0055] In this example, if figure 1 As shown, a cross-modal image text retrieval method of a hybrid fusion model of the present invention comprises the following steps:

[0056] S1. Extracting cross-modal data features;

[0057] S1.1. Download cross-modal image-text pair data including N groups of images and their corresponding descriptive texts;

[0058] S1.2. In each set of cross-modal image-text pair data, use the region-based convolutional neural network FasterR-CNN to extract the image region feature set V={v i}, where v i Represent the i-th image region feature, i=1,2,...,k, k represents the number of elements in the image region feature set, and k is taken as 36 in the present embodiment; Utilize the text word feature based on the gated recurrent unit GRU Set T = {t j}, where t j Represent the jth text word feature, j=1,2,...,l, l is...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a cross-modal image text retrieval method of a hybrid fusion model. The method comprises the following steps of: in an early fusion structure, firstly combining local visual area features and original global features of a text to obtain a unified cross-modal fusion representation, and then taking the fusion features as input, enhancing interaction between the local visual features and the language information in a subsequent embedded network; meanwhile, on the basis of a traditional late fusion structure, inputting an original image and sentence features into a vision encoder and a text encoder respectively for intra-modal feature enhancement, and enriching semantic information of respective modals; and finally, the whole network similarity is a weighted linear combination of early fusion similarity and late fusion similarity so that complementation of early fusion in a cross-modal learning layer and late fusion in an intra-modal learning layer is realized, and potential alignment between image and text modals is completed.

Description

technical field [0001] The invention belongs to the technical field of cross-modal retrieval, and more specifically relates to a cross-modal image text retrieval method of a hybrid fusion model. Background technique [0002] Cross-modal retrieval means that users retrieve semantically relevant data in all modalities by inputting query data in any modality. With the increasing amount of multi-modal data such as text, images, and videos in the mobile Internet, retrieval across different modalities has become a new trend in information retrieval. Realizing fast and accurate image-text retrieval has great application value and economic benefits. [0003] Since computer vision features from image data and language features from text data naturally have a "heterogeneous gap" in data distribution and underlying feature representation, how to measure the high-level semantic correlation between images and text is still a challenge. The solution of the current method is usually to fu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/583G06F40/194G06F40/30G06K9/62G06N3/04G06N3/08G06N5/04
CPCG06F16/5846G06F40/194G06F40/30G06N3/04G06N3/08G06N5/04G06F18/22G06F18/214G06F18/253
Inventor 徐行王依凡杨阳邵杰申恒涛
Owner UNIV OF ELECTRONICS SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products