Multimodal image processing method and system based on Transform network and hypersphere space learning

A multi-modal image and processing method technology, applied in the field of deep learning, can solve problems such as unreasonable application settings and limited performance of the basic network structure, improve the ability of modeling and aligning multi-modal distributions, and achieve zero-sample spanning Modal retrieval, the effect of eliminating the problem of modality difference

Active Publication Date: 2022-03-25
UNIV OF ELECTRONIC SCI & TECH OF CHINA
View PDF6 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] In view of this, the present invention provides a multi-modal image processing method and its processing system based on Transformer network and hypersphere space learning, which solves the problems of unreasonable application settings and basic network problems in the existing multi-modal image processing methods. Problems with structure-limited performance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multimodal image processing method and system based on Transform network and hypersphere space learning
  • Multimodal image processing method and system based on Transform network and hypersphere space learning
  • Multimodal image processing method and system based on Transform network and hypersphere space learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0025] Such as figure 1 As shown, the present invention is a multimodal image processing method based on Transformer network and hypersphere space learning, including steps S1 to S5.

[0026] S1: Obtain the pre-trained Transformer network model, and fine-tune the pre-trained Transformer network model in a self-supervised manner based on the image data of each modality to obtain the teacher model.

[0027] The Transformer network was first proposed in the field of natural language processing, with serialized text data as input. Recently, the Transformer network structure has been improved to handle image data and perform well in the field of computer vision. Such as image 3 (Ignoring the fusion mark and distillation mark), the Transformer network structure consists of L layers of multi-head self-attention modules alternately and feed-forward neural network modules, where each module contains pre-layer normalization and residual connections. It cuts each image into a series ...

Embodiment 2

[0062] In this embodiment, on the basis of embodiment 1, experimental verification is carried out. In this embodiment, three mainstream datasets in the field of zero-sample cross-modal retrieval are used as training and testing datasets, namely Sketchy, TU-Berlin, and QuickDraw. They both contain data and labels for photo modality and sketch modality for zero-shot photo-sketch retrieval task. Specifically, Sketchy is initially composed of 75,471 sketches and 12,500 photos in 125 categories, and there is a pairing relationship between sketches and images. Since then, Sketchy's photo collection has been expanded to 73,002. TU-Berlin consists of 20,000 sketches and 204,489 photos of 250 categories, so the number of sketches and photos is seriously unbalanced, and the abstraction of sketches is high; QuickDraw is the largest of the three datasets, consisting of 110 categories Consisting of 330,000 sketches and 204,000 photos, the sketches are the most abstract.

[0063] For Ske...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a multi-modal image processing method and system based on a Transform network and hypersphere space learning, and the method comprises the steps: obtaining a pre-trained Transform network model, and obtaining a teacher model; constructing a multi-branch model composed of a teacher model and a multi-modal fusion model; extracting teacher distillation vectors and student distillation vectors, and features and classification probabilities of the modal images in a unit hypersphere space; calculating distillation loss, inter-modal center alignment loss, intra-modal uniformity loss and classification loss of each modal, and updating the multi-modal fusion model according to the distillation loss, the inter-modal center alignment loss, the intra-modal uniformity loss and the classification loss; and adopting the updated multi-modal fusion model to generate a zero-sample cross-modal retrieval result based on the image of the to-be-detected modal and the image of the to-be-queried modal. According to the method, the modeling capability and the multi-modal distribution aligning capability of the multi-modal fusion model can be effectively improved, and the modal difference problem between different modals is eliminated, so that zero-sample cross-modal retrieval is realized.

Description

technical field [0001] The invention relates to the field of deep learning, in particular to a multimodal image processing method and system based on Transformer network and hypersphere space learning. Background technique [0002] With the rapid development of science and technology, image data is becoming more and more accessible. These image data have various sources, perspectives, styles, etc., forming a multi-modal image data set. For example, sketches and photos are two modal images with different styles. Sketches are highly abstract and depict structural details of objects, while photos have rich visual features and complex background information of objects. Data processing and retrieval of multimodal images has become a research hotspot in the field of deep learning technology. [0003] However, most of the existing multimodal image processing methods assume that the categories of the image to be queried and the image of the queried modality in actual application a...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06V10/80G06V10/778G06V10/764G06K9/62
CPCG06F18/217G06F18/25G06F18/2415
Inventor 徐行田加林沈复民申恒涛
Owner UNIV OF ELECTRONIC SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products