Multi-modal pre-training method based on image-text linear combination

A linear combination and pre-training technology, applied in neural learning methods, character and pattern recognition, biological neural network models, etc., to improve processing speed, performance, and accuracy

Pending Publication Date: 2022-04-08
HUNAN UNIV OF TECH
View PDF0 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0014] In view of the above-mentioned technical problems, the present invention proposes a multi-modal pre-training method based on the linear combination of images and texts, which solves the bottleneck problem of model operation time and the performance problem of the improved pre-training model after fine-tuning, which has important scientific significance and Practical application value

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-modal pre-training method based on image-text linear combination
  • Multi-modal pre-training method based on image-text linear combination
  • Multi-modal pre-training method based on image-text linear combination

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0091] Facing the application scenario of image search text, the number of super-parameter images, the number of image-related descriptions (number of text annotations / sentences), image weight, and text weight respectively use a=1; b=3; ξ=μ=1 The strategy can obtain a better recall rate. Such as Figure 5 A diagram of the entire pre-trained model structure. The following is a detailed description with examples:

[0092] S1-S2.1: Image-Text Pair (Image-Text Pair), such as Figure 6 and Figure 7 After the feature extraction operation, splicing is performed to obtain the feature sequence Y:

[0093] Y=[V type +v;T type +t] = [0.87, 0.15, ..., 0.857]

[0094] S2.2: Input the feature sequence Y to the Transformer Encoder interaction layer to calculate the attention value through the attention mechanism, and finally obtain the final feature sequence Y through the nonlinear activation function tanh() p .

[0095]

[0096] Y p =tanh(Attention(Q,K,V))=[0.108, 0.732, -0.852,...

Embodiment 2

[0112] Facing the application scenario of text search for pictures, the input strategy of a=2; b=3; ξ=μ=1 can obtain a better recall rate. The following is a detailed description with examples:

[0113] S1-S2.1: Image-Text Pair (Image-Text Pair), such as Figure 6 and Figure 7 After the feature extraction operation, splicing is performed to obtain the feature sequence Y:

[0114] Y=[V type +v;T type +t] = [0.27, 0.59, ..., 0.437]

[0115] S2.2: Input the feature sequence Y to the Transformer Encoder interaction layer to calculate the attention value through the attention mechanism, and finally obtain the final feature sequence Y through the nonlinear activation function tanh() p .

[0116]

[0117] Y p =tanh(Attention(Q,K,V))=[0.271, -0.842, -0.312, . . . , 0.662].

[0118] S3: After obtaining the feature sequence of two modal interactions, different downstream tasks can be connected. In this embodiment 2, as mentioned above in the application scenario of text sea...

Embodiment 3

[0132] In the image-text multimodal classification task, in the face of the application scenario of VQA, the input strategy of a=1; b=2; ξ=μ=1 can obtain better accuracy. The following is a detailed description with examples:

[0133] S1-S2.1: Image-Text Pair (Image-Text Pair), such as Figure 6 and Figure 7 After the feature extraction operation, splicing is performed to obtain the feature sequence Y:

[0134] Y=[V type +v;T type +t] = [0.821, -0.159, ..., -0.825]

[0135] S2.2: Input the feature sequence Y to the Transformer Encoder interaction layer to calculate the attention value through the attention mechanism, and finally obtain the final feature sequence Y through the nonlinear activation function tanh() p .

[0136]

[0137] Y p =tanh(Attention(Q,K,V))=[0.172, -0.451, -0.312, . . . , -0.662].

[0138] S3: In Embodiment 3, it is mentioned above that facing the application scenario of VQA, the strategy of a=1; b=2; ξ=μ=1 can obtain a better accuracy rate. V...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A multi-modal pre-training method based on image-text linear combination belongs to the technical field of image-text multi-modal retrieval, and comprises the following steps: S1, respectively carrying out feature extraction on a text and an image; s2, establishing a relation between two modes of a text and an image in an interaction layer; s2.1, jointly inputting the feature vectors of the visual modality and the language modality obtained in the step S1 into an interaction layer of a multi-modality pre-training model; s2.2, utilizing an attention mechanism in the Transform to enable the two modes to be connected with each other; s3, taking the image-text matching or shielding language model as a pre-training target, and training the model to be available; and S4, taking a specific application scene and a downstream task as training targets, carrying out fine tuning training on the pre-training model, and enabling the performance of the model to be optimal in the scene. The training method provided by the invention solves the bottleneck problem of model operation time and the performance problem of the improved pre-training model after fine tuning, and has relatively important scientific significance and practical application value.

Description

technical field [0001] The invention belongs to the technical field of image-text multimodal retrieval, and more specifically relates to a multi-modal pre-training method based on linear combination of image and text. Background technique [0002] Modality is the way things are experienced and happened. We live in a world composed of multiple modal information, including visual information, auditory information, text information, olfactory information, etc. When the research problem or data set contains multiple such When the modal information of , we call it a multimodal problem. Studying multimodal problems is the key to advancing artificial intelligence to better understand and recognize the world around us. [0003] Today's more common applications include media description, event recognition, multimedia retrieval, visual reasoning, visual question answering, and more. At present, many visual tasks are fine-tuned on a fully pre-trained convolutional model. In addition,...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62G06N3/04G06N3/08G06F40/284G06V10/46G06V10/764G06V10/774
Inventor 袁鑫攀张知奇陈博王克李长云
Owner HUNAN UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products