A cross-modal hashing retrieval method

By using a cross-modal feature fusion graph attention classification learning module and a deep feature extraction network, the problems of semantic gap and modal correlation loss in cross-modal hash retrieval are solved, improving the feature representation capabilities of image and text modalities and enhancing the accuracy and efficiency of retrieval.

CN118643198BActive Publication Date: 2026-06-26XINJIANG UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
XINJIANG UNIVERSITY
Filing Date
2023-11-29
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing cross-modal hashing retrieval methods suffer from semantic gaps, modal correlation loss, and limited feature representation capabilities. In particular, the semantic differences and insufficient feature fusion between image and text modalities lead to poor retrieval performance.

Method used

A cross-modal feature fusion graph attention classification learning module is adopted, and the image semantic features are extracted by combining the SwinT Small model. A deep feature extraction module and an automatic encoder-decoder are designed to learn text features. High-quality hash codes are generated through graph attention feature fusion network and multi-scale label feature fusion network.

Benefits of technology

It effectively reduces the semantic gap, enhances modal feature fusion, improves the performance and accuracy of cross-modal hash retrieval, compensates for the lack of rich text features, and improves retrieval efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118643198B_ABST
    Figure CN118643198B_ABST
Patent Text Reader

Abstract

The application discloses a cross-modal hash retrieval method and relates to the technical field of cross-modal learning. The method comprises the following steps: 1) a cross-modal feature learning network comprising an image network, a text network and a label network is used to map each triple information to a high-dimensional Hamming space through a feature fusion graph attention classification learning module; 2) a SwinT-S (SwinT Small model) model is used to extract semantic features of images; 3) a graph attention feature fusion attention module is used to perform deep fusion alignment on the obtained image semantic features and text semantic features; 4) a deep text feature extraction network is used to optimize the generation of text hash codes and generate high-quality text hash codes; and 5) a linear layer (Linear) and a Tanh (t) function are used to map text features into hash code lengths required by the application. The application has the beneficial effect of effectively reducing semantic gap problems, fusing different modal features and improving the performance of cross-modal hash retrieval.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention mainly relates to the field of cross-modal learning technology, specifically a cross-modal hash retrieval method. Background Technology

[0002] Cross-modal hashing learning is a data mining and retrieval technique that has emerged in recent years. It utilizes neural networks to extract effective representations of different modalities and establishes semantic associations between these modalities at a high level, offering advantages such as small storage space and fast retrieval speed. Currently, cross-modal hashing learning aims to maintain the consistency and similarity of sample feature representations. It is an interdisciplinary field of computer vision and natural language processing, and has significant application value in areas such as speech-face matching and retrieval, sign language translation, material recognition and classification, e-commerce, healthcare, and smart cities.

[0003] Existing technologies have the following problems in cross-modal hash retrieval:

[0004] The semantic gap problem: There are natural semantic differences between images and text. When text describes images, there is a lack of information, which causes models to tend to learn the features of the image modality while ignoring the text modality, thus exacerbating the semantic gap problem.

[0005] Modal correlation loss: Existing methods fail to fully consider the correlation between different modalities and lack an effective mechanism to fuse feature representations of different modalities, leading to semantic discrepancies.

[0006] Limited feature representation capability: Some methods have limited ability to represent text features in the feature extraction stage. They use simple fully connected layers to extract text information, which cannot capture deep semantic features. Summary of the Invention

[0007] To address the shortcomings of existing technologies, this invention provides a cross-modal hash retrieval method that can effectively reduce the semantic gap problem, integrate features from different modalities, and improve the performance of cross-modal hash retrieval.

[0008] To achieve the above objectives, the present invention employs the following technical solution:

[0009] A cross-modal hash retrieval method includes the following steps:

[0010] 1) First, a cross-modal feature learning network including an image network, a text network, and a label network is included. The cross-modal feature fusion graph attention classification learning module maps the information of each image-text-label triple to a high-dimensional Hamming space.

[0011] 2) As mentioned above, the (SwinT Small model) is used to extract semantic features of the image;

[0012] Two deep feature extraction modules and an autoencoder are designed as a text network to learn text features. Each text is converted into a bag-of-words (BOW) vector using the bag-of-words method.

[0013] Multi-scale label feature information is extracted using an attention network that integrates multi-scale features;

[0014] 3) The graph attention feature fusion attention module is used to deeply fuse and align the acquired image semantic features and text semantic features;

[0015] 4) Use a deep text feature extraction network to optimize the generation of text hash codes, resulting in higher quality text hash codes.

[0016] 5) Use linear layers and The function maps text features to the corresponding hash code length required by this invention.

[0017] In step 1, the cross-modal feature fusion graph attention classification learning module assumes that the features extracted by the image feature extractor and the text feature extractor are respectively and The features extracted by the label network are The adjacency matrix is ​​defined as A∈R CxC Where C represents the dimension of the label, i.e., the number of categories; this invention defines the symbol of each layer of the GAT network as The feature fusion method of the present invention can be expressed as:

[0018]

[0019] This is the feature tensor formed by splicing the two modes of the present invention;

[0020]

[0021] in This is the feature vector extracted by the first feature depth fusion layer in this invention;

[0022] This invention uses GRU for deeper cross-modal fusion, and the specific operation is as follows:

[0023]

[0024] in The features extracted using GRU in this invention;

[0025] The present invention then performs different weighting operations on the features obtained at different levels, as shown below:

[0026]

[0027]

[0028] in represent The attention generated for Attention generated after GRU;

[0029] The present invention will obtain attention and and To perform multiplication, follow these steps:

[0030]

[0031] Next, the present invention maps the obtained attention back to the original feature space dimension through a linear layer, and performs residual linking with the fused image-text features to enhance the robustness of the network module. The specific operation steps are as follows:

[0032]

[0033] in This represents the fused attention features obtained after residual connections. This indicates the use of the PreLU activation function, and RMSNorm represents root mean square layer normalization. This represents a linear layer that maps attention features back to the original feature dimensions;

[0034]

[0035] in This represents global max pooling, with a pooling window of 100%. , Represents the Local Kernel Alignment module. This represents a one-dimensional convolution with a kernel size of 3 x 3, a stride of 1, and padding of 1.

[0036] The algorithm for LKA is as follows:

[0037] Assume the input tensor is Its shape is:

[0038] ① Create a Copy tensor ;

[0039] ②, will pass Convolutional layer Process and record. ;

[0040] ③ To application Activation function:

[0041] ④ Establish variables ,Right now Copy: ;

[0042] Through three different Convolutional layer , , Calculate attention weights :

[0043]

[0044] These convolutional layers are used to compute attention weights;

[0045] ,Will and Multiply to obtain a weighted eigenvector:

[0046]

[0047] ,Will pass Convolutional layer Processing yields the final output:

[0048]

[0049] ,Will With copy tensor Adding them together results in the residual join:

[0050]

[0051] These formulas represent the LKA algorithm flow, where This represents element-wise multiplication. Represents the ReLU activation function;

[0052] In the GAT module, this invention fuses the obtained fused features with the features obtained from GAT to obtain pseudo-labels. The specific process is as follows:

[0053]

[0054]

[0055] Representative represents a node The feature vectors are the outputs of the previous layer; represents the weight matrix; 'a' represents the attention weight vector, which is the parameter to be learned. express Activation function; || represents vector concatenation operation; This represents the number of nodes; in GAT, each node... With other nodes Each of them has a corresponding attention weight. To adjust the dissemination of information; these attention weights are achieved through... The sum of the features is obtained by linear combination of the node features, and then normalization is used to ensure that their sum is 1; finally, The new feature of a node is a weighted sum of the features of all its neighboring nodes, where the weights are determined by the attention weights. The decision; the operation rules of GAT can be expressed as follows:

[0056]

[0057] Will The function definition of the last level outputs: It accepts two inputs. This represents the weighted fusion feature obtained by the GRU module. This represents the cosine adjacency matrix. The following process is described as follows:

[0058]

[0059] M represents the pseudo-tag obtained through the GAT module and the fusion module. Representing the attention fusion process, this invention calculates a similarity probability score between M and the final pseudo-labels and real labels. The label calculation loss of this invention is as follows:

[0060]

[0061] in Representing the sigmoid function, its expression is:

[0062] .

[0063] In step 2, multi-level modules are used to extract multi-scale features of the labels, and attention is used to fuse the extracted multi-scale features of the labels. The features are then filtered to generate corresponding hash codes. The specific algorithm is as follows:

[0064]

[0065]

[0066]

[0067] in, , , , Represent The first convolutional layer uses non-uniform strides of 10, 5, 3, and 1; where... The algorithm is as follows:

[0068]

[0069] in This represents the stride of the one-dimensional convolution used. This represents samples using different labels; This represents global max pooling, with a pooling window size of (1, 1).

[0070]

[0071]

[0072] in This represents the feature information extracted through multi-scale attention. The structure of the automatic table decoder is as follows:

[0073]

[0074] Where LN represents LayerNorm. This indicates the use of one-dimensional convolution, where u represents the dimensions used, which are 2048 and 8192 respectively.

[0075] Use linear layers and The function maps the obtained multi-scale label fusion features to the corresponding hash code length required by this invention:

[0076]

[0077] in This represents the text hash code, where k represents the length of the hash code. Represents the hash layer.

[0078] In step 2, a deep text feature extraction network is used to convert text tags into bag-of-words (BoW) vectors. The specific algorithm is as follows:

[0079]

[0080]

[0081] in This represents a one-dimensional convolution with a kernel size of 3 x 3, a stride of 1, and padding of 1. Its input dimension is the dimension of the text bag-of-words vector, and its output dimension is 2048. This represents global average pooling, with a window size of (1, 1). This represents input and output dimensions of 2048 and 8192, respectively. Conversely, the input and output dimensions are 8192 and 2048, respectively.

[0082] In step 3, the features obtained from the autoencoder / decoder are fused using deep modules. The specific process is as follows:

[0083]

[0084] in Representing text features, k represents different feature dimensions. This represents a one-dimensional convolution with a kernel size of 3 x 3, a stride of 1, and padding of 1. Its input dimension is 2048, and its output dimension is also 2048.

[0085] In step 4, the linear layer and The function is as follows:

[0086]

[0087] in This represents the text hash code, where k represents the length of the hash code. Represents the hash layer.

[0088] Compared with the prior art, the beneficial effects of the present invention are:

[0089] 1. This invention proposes a cross-modal hash retrieval framework for heterogeneous network tag fusion, which is used to solve the semantic divergence problem in the network and to a certain extent compensates for the semantic loss between modalities.

[0090] 2. Based on cross-modal feature fusion, this invention proposes a graph attention module for modal feature fusion, which can better integrate feature representations from different modalities to enhance network performance.

[0091] 3. Label networks can enhance the semantic features of text. Therefore, this invention also regards label information as a modality and designs a multi-scale label feature fusion attention network, which can better extract label features. Unlike the conventional method of using fully connected layers to extract text information, the text network is changed to use a deep feature extraction module and automatically encoded and decoded text features. Furthermore, a deep feature extraction module is used again before the hash function to better fuse deep text features, thereby improving the utilization rate of text features. Attached Figure Description

[0092] Appendix Figure 1 This is a schematic diagram of the overall framework structure of the present invention;

[0093] Appendix Figure 2 This is a schematic diagram of the modal feature fusion graph attention module structure of the present invention;

[0094] Appendix Figure 3 This is a schematic diagram of the multi-scale label feature fusion attention network structure of the present invention;

[0095] Appendix Figure 4 This is a schematic diagram of the deep feature text network structure of the present invention. Detailed Implementation

[0096] The present invention will be further described in conjunction with the accompanying drawings and specific embodiments. It should be understood that these embodiments are for illustrative purposes only and are not intended to limit the scope of the invention. Furthermore, it should be understood that after reading the teachings of this invention, those skilled in the art can make various alterations or modifications to the invention, and these equivalent forms also fall within the scope defined in this application.

[0097] Like most other methods, this invention uses image-text data pairs for cross-modal hash retrieval. Assume there are N pairs of image-text data and their tags, where the image-text data and tag data are defined using triples. These triples can be represented as... ,in Represents an image. Represents text, Let N represent the number of image-text-label triples. This invention uses sets to represent this relationship, assuming the image set is defined as follows: The text set is and a set of tags The label network uses One-Hot encoding. ,in This represents the number of categories for the label. When an image or text sample belongs to this category, ,otherwise Therefore, this invention can define a set of triples. This set It contains a triplet of all images, text, and tags. The hash code is represented as... , and ,in This represents the length of the hash code, and a sign function is used to generate the corresponding binary hash code. It is generally used for... It represents. Its definition is as follows:

[0098]

[0099] Specifically, as shown in the attached document Figure 2 As shown, the Modal Feature Fusion Graph Attention Module (GAMMF) is used to fuse image and text features, which are then weighted by GRU and attention before being input into the GAT network to obtain predicted pseudo-labels. GAT (Graph Attention Network) is a graph neural network based on a self-attention mechanism. Currently, there are no applications of GAT in cross-modal hash retrieval; GCNs are more commonly used for feature extraction or classification. This invention introduces the GAT network into cross-modal hash retrieval, which can effectively learn the representation of graph structure data. GAT can utilize the structural information of heterogeneous graphs to construct image and text data in a unified space, thereby better capturing high-level semantic relationships between data. By aggregating neighbor features through a multi-layer graph attention network, the expressive power of each node is enhanced, and different weights can be adaptively assigned to different neighbors. The model is trained using adversarial loss and triplet loss to achieve personalized cross-modal retrieval, improving the accuracy and efficiency of retrieval. In summary, GAT is a method that can effectively fuse image and text features, improving the performance of cross-modal hash retrieval. GAT can use a self-attention mechanism to adaptively weight data from different modalities, thereby generating more accurate pseudo-labels. This helps address the problem of insufficient cross-modal data annotation and improves the generalization ability of the model.

[0100] Some methods use GCN as a feature extractor to extract features from different modalities. However, this can degrade retrieval performance because it's impossible to construct graph-structured retrieval data for every retrieval set. Therefore, this invention uses a weighted fusion operation to integrate features from different modalities, allowing the predicted pseudo-labels to compensate for the limited richness of text data and thus supplement it. In this invention, a cross-modal feature fusion graph attention classification learning module based on feature fusion is applied to the retrieval framework. Although the GAT network requires co-occurrence matrix information, this invention reuses the adjacency matrix as the co-occurrence matrix through cosine quantization weighting. Label information can be considered, to some extent, as a weight matrix. By reusing the label and adjacency matrices, feature fusion optimizes the retrieval process, not just during the training phase.

[0101] The cross-modal feature fusion graph attention classification learning module of this invention is specifically configured as follows: Figure 2 As shown, assume that the features extracted by the image feature extractor and the text feature extractor are respectively and The features extracted by the label network are The adjacency matrix is ​​defined as A∈R CxC Where C represents the dimension of the label, i.e., the number of categories. This invention defines the symbol of each layer of the GAT network as... The feature fusion method of the present invention can be expressed as:

[0102]

[0103] This is the feature tensor formed by splicing the two modes of the present invention;

[0104]

[0105] in This is the feature vector extracted after the first feature deep fusion layer in this invention. To better fuse feature representations from different modalities, this invention uses GRU for deeper cross-modal fusion, specifically as follows:

[0106]

[0107] in The features extracted using GRU in this invention are shown below. The different levels of features obtained are then subjected to different weighting operations, as follows:

[0108]

[0109]

[0110] in represent The attention generated for Attention generated after GRU.

[0111] The present invention will obtain attention and and To perform multiplication, follow these steps:

[0112]

[0113] Next, the present invention maps the obtained attention back to the original feature space dimension through a linear layer, and performs residual linking with the fused image-text features to enhance the robustness of the network module. The specific operation steps are as follows:

[0114]

[0115] in This represents the fused attention features obtained after residual connections. This indicates the use of the PreLU activation function, and RMSNorm represents Root Mean Square Layer Normalization. This represents a linear layer that maps attention features to the original feature dimensions.

[0116]

[0117] in This represents global max pooling, with a pooling window of 100%. , This represents the Local Kernel Alignment module. This represents a one-dimensional convolution with a kernel size of 3 x 3, a stride of 1, and padding of 1.

[0118] The algorithm for LKA is as follows:

[0119] Assume the input tensor is Its shape is:

[0120] 1. Create Copy tensor .

[0121] 2. pass Convolutional layer Process and record. .

[0122] 3. Regarding application Activation function:

[0123] 4. Create variables ,Right now Copy: .

[0124] 5. Through three different Convolutional layer , , Calculate attention weights :

[0125]

[0126] These convolutional layers are used to compute attention weights.

[0127] 6. Multiply the sums to obtain a weighted eigenvector:

[0128]

[0129] 7. [The following is a list of items / items] pass Convolutional layer Processing yields the final output:

[0130]

[0131] 8. With copy tensor Adding them together results in the residual join:

[0132]

[0133] These formulas represent the LKA algorithm flow, where This represents element-wise multiplication. This represents the ReLU activation function.

[0134] In the GAT module, this invention fuses the obtained fused features with the features obtained from GAT to obtain pseudo-labels. The specific process is as follows:

[0135]

[0136]

[0137] Representative represents a node The feature vectors are the outputs of the previous layer. represents the weight matrix. 'a' represents the attention weight vector, which are the parameters to be learned. express Activation function. || represents vector concatenation. This indicates the number of nodes. In GAT, each node... With other nodes Each of them has a corresponding attention weight. To adjust the dissemination of information. These attention weights are achieved through... The sum of the features is obtained by linear combination of the node features, and then normalized to ensure that their sum is 1. Finally, The new feature of a node is a weighted sum of the features of all its neighboring nodes, where the weights are determined by the attention weights. Decision. The operation rules of GAT can be expressed as follows:

[0138]

[0139] Will The function definition of the last level outputs: It accepts two inputs. This represents the weighted fusion feature obtained by the GRU module. This represents the cosine adjacency matrix. The following process is described as follows:

[0140]

[0141] M represents the pseudo-tag obtained through the GAT module and the fusion module. Representing the attention fusion process, this invention calculates the similarity probability score between M and the final pseudo-labels and real labels. The label calculation loss of this invention is as follows:

[0142]

[0143] in Representing the sigmoid function, its expression is:

[0144]

[0145] Because the feature fusion module interacts well with the features, the graph attention classification learning module can generate relatively high-quality pseudo-labels, thus better compensating for the imbalance of text features and better learning hash representations.

[0146] As attached Figure 3As shown, the Multi-Scale Label Feature Fusion Attention Network (MSFA) consists of a feature extraction module and four hierarchical multi-scale attention modules, which are then weighted and fused. Label information, like text, contains a large amount of feature information. In cross-modal hash retrieval, this invention treats labels as a supplement to text to compensate for the limited semantic features of text. Unlike the methods mentioned above, where the modal feature fusion graph attention module primarily fuses image and text features, the label network treats labels as a new type of modal information. In cross-modal hash retrieval tasks, the same image sample corresponds to multiple different labels; therefore, labels also possess multi-scale information. To address this issue, this invention employs multi-level modules to extract multi-scale features of labels, uses attention fusion to extract these features, and then filters them to generate corresponding hash codes. The specific algorithm is as follows:

[0147]

[0148]

[0149]

[0150] in, , , , Represent The first convolutional layer uses non-uniform strides of 10, 5, 3, and 1. The algorithm is as follows:

[0151]

[0152] in This represents the stride of the one-dimensional convolution used. This indicates the use of different labeled samples. This represents global max pooling, with a pooling window size of (1, 1).

[0153]

[0154]

[0155] in This represents the feature information extracted through multi-scale attention. The structure of the automatic table decoder is as follows:

[0156]

[0157] Where LN represents LayerNorm. This represents the use of one-dimensional convolution, where u represents the dimensions used, which are 2048 and 8192 respectively. Finally, this invention uses linear layers. and The function maps the obtained multi-scale label fusion features to the hash code length required by this invention:

[0158]

[0159] in This represents the text hash code, where k represents the length of the hash code. Represents the hash layer.

[0160] As attached Figure 4 As shown, the Deep Feature Text Network (DSTN) consists of two deep extraction modules and an autoencoder / decoder. This invention uses a deep text feature extraction network instead of the traditional fully connected layers (MLP) or Transformers to convert text tags into bag-of-words (BoW) vectors. Because of the sparse feature information, it can learn text features better than the former, aggregating more sparse text features and thus learning hash features more effectively. Compared to the latter using Transformers, it reduces computational resources and speeds up computation, and has an advantage in parameter count, without significantly increasing the number of parameters compared to the previous fully connected layers (MLP).

[0161] The text network proposed in this invention is relatively simple, and the specific algorithm is as follows:

[0162]

[0163]

[0164] in This represents a one-dimensional convolution with a kernel size of 3 x 3, a stride of 1, and padding of 1. Its input dimension is the dimension of the text bag-of-words vector, and its output dimension is 2048. This represents global average pooling, with a window size of (1, 1). This represents input and output dimensions of 2048 and 8192, respectively. Conversely, the input and output dimensions are 8192 and 2048, respectively.

[0165] To obtain a better text representation, this invention further uses a deep module to fuse the features obtained from the autoencoder / decoder. The specific process is as follows:

[0166]

[0167] in Representing text features, k represents different feature dimensions. This represents a one-dimensional convolution with a kernel size of 3 x 3, a stride of 1, and padding of 1. Its input dimension is 2048, and its output dimension is also 2048.

[0168] Finally, this invention uses linear layers. and The function converts text features into the hash code length required by this invention:

[0169]

[0170] in This represents the text hash code, where k represents the length of the hash code. Represents the hash layer.

Claims

1. A cross-modal hash retrieval method, characterized in that, Includes the following steps: 1) First, a cross-modal feature learning network including an image network, a text network, and a label network is included. The cross-modal feature fusion graph attention classification learning module maps the information of each image-text-label triple to a high-dimensional Hamming space. 2) Use That is, the SwinT Small model is used to extract semantic features of images; Two deep feature extraction modules and an autoencoder are designed as a text network to learn text features. Each text is converted into a bag-of-words (BOW) vector using the bag-of-words method. Multi-scale label feature information is extracted using an attention network that integrates multi-scale features; 3) The feature fusion graph attention classification learning module is used to deeply fuse and align the acquired image semantic features and text semantic features; 4) Use a deep text feature extraction network to optimize the generation of text hash codes, resulting in higher quality text hash codes. 5) Use linear layers and The function maps text features to the corresponding hash code length.

2. The cross-modal hash retrieval method according to claim 1, characterized in that: In step 1, the cross-modal feature fusion graph attention classification learning module assumes that the features extracted by the image feature extractor and the text feature extractor are respectively and The features extracted by the label network are The adjacency matrix is ​​defined as A∈R CxC Where C represents the dimension of the label, i.e., the number of categories; the symbol for each layer of the GAT network is defined as... The feature fusion method is expressed as: The feature tensor is formed by splicing two modes; in This is the feature tensor extracted after the first feature depth fusion layer.

3. The cross-modal hash retrieval method according to claim 2, characterized in that: Using GRU for deeper cross-modal fusion, the specific steps are as follows: in Features extracted using GRU; The obtained features at different levels are then subjected to different forms of weighting operations, as shown below: represent The attention generated for Attention generated after GRU.

4. The cross-modal hash retrieval method according to claim 3, characterized in that: The attention obtained is respectively and and To perform multiplication, follow these steps: Next, the obtained attention is mapped back to the original feature space dimension through a linear layer, and residually linked with the fused image-text features to enhance the robustness of the network module. The specific steps are as follows: This represents the fused attention features obtained after residual connections. This indicates the use of the PreLU activation function; RMSNorm represents Root Mean Square Layer Normalization; and it represents a linear layer that maps attention features back to the original feature dimensions. in This represents global max pooling, with a pooling window of (1,1). Represents the Local Kernel Alignment module. This represents a one-dimensional convolution with a kernel size of 3 x 3, a stride of 1, and padding of 1.

5. The cross-modal hash retrieval method according to claim 4, characterized in that: The algorithm for LKA is as follows: Assume the input tensor is Its shape is: ① Create a Copy tensor ; ②, will pass Convolutional layer Process and record. ; ③ To application Activation function: ④ Establish variables ,Right now Copy: ; Through three different Convolutional layer , , Calculate attention weights : Convolutional layers are used to compute attention weights; ,Will and Multiply to obtain a weighted feature tensor: ,Will pass Convolutional layer Processing yields the final output: ,Will With copy tensor Adding them together results in the residual join: The formula represents the LKA algorithm flow, where This represents element-wise multiplication. This represents the ReLU activation function.

6. The cross-modal hash retrieval method according to claim 2, characterized in that: In the GAT module, the obtained fusion features The pseudo-labels are obtained by fusing them with the features obtained from GAT. The specific process is as follows: Representative represents a node The feature vectors are the outputs of the previous layer; represents the weight matrix; 'a' represents the attention weight vector, which is the parameter to be learned. express Activation function; || represents vector concatenation operation; This represents the number of nodes; in GAT, each node... With other nodes Each of them has a corresponding attention weight. To adjust the dissemination of information; attention weighting is achieved through... The sum of the features is obtained by linear combination of the node features, and then normalization is used to ensure that their sum is 1; finally, The new feature of a node is a weighted sum of the features of all its neighboring nodes, where the weights are determined by the attention weights. Decide; The operation rules of GAT are expressed as follows: Will The function definition of the last level outputs: It accepts two inputs. This represents the weighted fusion feature obtained by the GRU module. This represents the cosine adjacency matrix; the following process is described as follows: M represents the pseudo-tag obtained through the GAT module and the fusion module. The attention fusion process is represented by calculating the similarity probability score between M and the final pseudo-labels and real labels; the label loss is calculated as follows: Representing the sigmoid function, its expression is: 。 7. The cross-modal hash retrieval method according to claim 2, characterized in that: In step 2, multi-level modules are used to extract multi-scale features of the labels, and attention is used to fuse the extracted multi-scale features of the labels. The features are then filtered to generate corresponding hash codes. The specific algorithm is as follows: , , , Represent The first convolutional layer uses non-uniform strides of 10, 5, 3, and 1; where... The algorithm is as follows: in This represents the stride of the one-dimensional convolution used. This represents samples using different labels; This represents global max pooling, with a pooling window size of (1, 1). This represents the feature information extracted through multi-scale attention. The structure of the automatic table decoder is as follows: Where LN represents LayerNorm. This indicates the use of one-dimensional convolution, where u represents the dimensions used, which are 2048 and 8192 respectively. Use linear layers and The function maps the obtained multi-scale label fusion features to the corresponding hash code length: in This represents the text hash code, where k represents the length of the hash code. Represents the hash layer.

8. The cross-modal hash retrieval method according to claim 1, characterized in that: In step 2, a deep text feature extraction network is used to convert text tags into bag-of-words (BoW) vectors. The specific algorithm is as follows: in This represents a one-dimensional convolution with a kernel size of 3 x 3, a stride of 1, and padding of 1. Its input dimension is the dimension of the text bag-of-words vector, and its output dimension is 2048. This represents global average pooling, with a window size of (1, 1). This represents input and output dimensions of 2048 and 8192, respectively. Conversely, the input and output dimensions are 8192 and 2048, respectively.

9. The cross-modal hash retrieval method according to claim 1, characterized in that: In step 3, the features obtained from the autoencoder / decoder are fused using deep modules. The specific process is as follows: in Representing text features, k represents different feature dimensions. This represents a one-dimensional convolution with a kernel size of 3 x 3, a stride of 1, and padding of 1. Its input dimension is 2048, and its output dimension is also 2048.

10. The cross-modal hash retrieval method according to claim 1, characterized in that: In step 4, the linear layer and The function is as follows: This represents the text hash code, where k represents the length of the hash code. Represents the hash layer.