Hash retrieval method and device based on multi-modal, electronic equipment, and storage medium
By constructing a sample training set and using modal data to fuse feature vectors to obtain attribution confidence, and weighted optimization of category prototypes, the problem of low retrieval accuracy of cross-modal hashing in multi-label noise scenarios is solved, and efficient and robust multimodal hash retrieval is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- STATE GRID SICHUAN ELECTRIC POWER CO MARKETING SERVICE CENT
- Filing Date
- 2026-03-19
- Publication Date
- 2026-06-19
AI Technical Summary
Existing cross-modal hashing techniques have low retrieval accuracy in real-world multi-label noise scenarios and cannot effectively adapt to the problem of redundant noise in labeled data.
By constructing a sample training set, the attribution confidence is obtained by fusing feature vectors from the modal data of the samples to be trained. Based on the attribution confidence, the feature vectors are weighted and fused to dynamically optimize and refine the category prototype, obtain the predicted probability distribution, and optimize the initial hash retrieval model to obtain the target hash retrieval model.
It significantly narrows the semantic gap between modalities without the need for clean labels, improves the accuracy and robustness of multimodal hash retrieval, reduces the reliance on high-quality manual annotation, and enhances model training efficiency and cross-modal retrieval performance.
Smart Images

Figure CN122240895A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of cross-modal hashing technology, specifically to a multimodal hash retrieval method, apparatus, electronic device, and storage medium. Background Technology
[0002] Cross-modal hashing technology maps sample data to binary hash codes and performs distance calculations in Hamming space to achieve efficient cross-modal retrieval, which has significant application value in large-scale data management. Currently, although cross-modal hashing has made significant progress in both unsupervised and supervised approaches, it has not yet escaped the ideal assumption of "clean labels and one-to-one semantic correspondence," thus systematically failing in real-world multi-label noise scenarios. Unsupervised methods rely solely on maximizing cross-modal relevance, lacking semantic anchors, and their retrieval accuracy rapidly declines in complex contexts. Traditional supervised methods, while introducing labels, assume that the labels are complete and correct. Once the labeling contains mixed noise of "partially correct + partially incorrect," the model treats the incorrect labels as reliable supervision, amplifying false associations, distorting the common hash space, and causing a precipitous drop in performance.
[0003] Therefore, classification methods based on cross-modal hashing rely on completely accurate supervision information, making it difficult to adapt to the problem of redundant noise in labeled data in real-world scenarios. Furthermore, the candidate label set of a sample often contains incorrect, noisy labels; directly utilizing all labels can lead to model overfitting to noise and distorting cross-modal semantic alignment, thus significantly reducing retrieval performance.
[0004] Therefore, in order to overcome the above-mentioned technical problems, the present invention provides a multimodal hash retrieval method, apparatus, electronic device, and storage medium. Summary of the Invention
[0005] The technical problem to be solved by the present invention is how to improve the accuracy of multimodal hash retrieval. The purpose is to provide a multimodal hash retrieval method, device, electronic device, and storage medium to improve the accuracy of multimodal hash retrieval.
[0006] This invention is achieved through the following technical solution:
[0007] In a first aspect, a multimodal hash retrieval method is provided, the method comprising: constructing a sample training set based on a plurality of training samples and sample labels corresponding to each training sample; the sample labels indicating the category to which the training sample belongs; inputting the sample training set into a preset initial hash retrieval model to obtain a predicted probability distribution corresponding to each training sample; the predicted probability distribution being obtained by classifying refined category prototypes; the refined category prototypes being obtained by dynamically weighting the fused feature vectors corresponding to each training sample using the category attribution confidence of each training sample; the attribution confidence being obtained through the fused feature vectors; the fused feature vectors being obtained by extracting and fusing modal data corresponding to each training sample respectively; obtaining a model loss based on the predicted probability distribution, the sample labels, and each refined category prototype; optimizing the initial hash retrieval model using the model loss to obtain a target hash retrieval model; and retrieving the data to be retrieved using the target hash retrieval model.
[0008] In some embodiments, the initial hash retrieval model obtains the fused feature vector in the following manner: obtaining modal data corresponding to each of the training samples; concatenating the modal data corresponding to each of the training samples to obtain modal concatenation data corresponding to each of the training samples; calculating the Euclidean norm corresponding to each of the modal concatenation data; and calculating the fused feature vector corresponding to each of the training samples using the modal concatenation data as the numerator and the Euclidean norm as the denominator.
[0009] In some embodiments, the initial hash retrieval model obtains the attribution confidence as follows: it initializes the category prototype corresponding to each category based on the fused feature vector to obtain the original category prototype; it calculates the cosine similarity between each training sample and each of the original category prototypes based on the fused feature vector; and it obtains the attribution confidence of each training sample to each category based on the cosine similarity.
[0010] In some embodiments, obtaining the model loss based on the predicted probability distribution, the sample labels, and each refined category prototype includes: obtaining the intra-class aggregation loss corresponding to the training sample based on each sample label, the refined category prototype, and the fused feature vector; obtaining the inter-class separation loss corresponding to the training sample based on the refined category prototype; obtaining the inter-instance contrast alignment loss corresponding to the training sample based on each sample label; obtaining the intra-instance positive label ranking loss based on each sample label and the sample category probability corresponding to each training sample; the sample category probability is obtained through the predicted probability distribution; and obtaining the model loss by combining the intra-class aggregation loss, the inter-class separation loss, the inter-instance contrast alignment loss, and the intra-instance positive label ranking loss.
[0011] In some embodiments, obtaining the intra-class aggregation loss corresponding to the training sample based on each of the sample labels, the refined category prototypes, and the fused feature vectors includes: obtaining the cosine similarity between each of the fused feature vectors and each of the refined category prototypes; and obtaining the intra-class aggregation loss based on each of the sample labels and the cosine similarity.
[0012] In some embodiments, obtaining the inter-instance contrast alignment loss corresponding to the training sample based on each of the sample labels includes: constructing a cross-modal similarity matrix based on the modal data corresponding to each of the training samples; constructing a plurality of positive sample pairs and a plurality of negative sample pairs based on each of the sample labels; and obtaining the inter-instance contrast alignment loss based on the cross-modal similarity matrix, the positive sample pairs, and the negative sample pairs.
[0013] In some embodiments, obtaining the positive label ranking loss within an instance based on the sample labels and the sample class probabilities corresponding to each training sample includes: obtaining the positive and negative label pairs corresponding to each training sample based on the sample labels; and obtaining the positive label ranking loss within the instance based on the sample class probabilities and the positive and negative label pairs.
[0014] Secondly, a multimodal hash retrieval device is provided, comprising: a construction module configured to construct a sample training set based on a plurality of training samples and sample labels corresponding to each training sample; the sample labels are used to indicate the category to which the training sample belongs; an input module configured to input the sample training set into a preset initial hash retrieval model to obtain a predicted probability distribution corresponding to each training sample; the predicted probability distribution is obtained by classifying refined category prototypes; the refined category prototypes are obtained by dynamically weighting the fused feature vectors corresponding to each training sample using the category attribution confidence of each training sample; the attribution confidence is obtained through the fused feature vectors; the fused feature vectors are obtained by extracting and fusing modal data corresponding to each training sample respectively; a loss acquisition module configured to acquire a model loss based on the predicted probability distribution, the sample labels, and each refined category prototype; an optimization module configured to optimize the initial hash retrieval model using the model loss to obtain a target hash retrieval model; and a retrieval module configured to retrieve the data to be retrieved using the target hash retrieval model.
[0015] Thirdly, an electronic device includes a processor and a memory storing program instructions, the processor being configured to execute the aforementioned multimodal hash retrieval method when running the program instructions.
[0016] Fourthly, a storage medium stores program instructions that, when executed, perform the aforementioned multimodal hash retrieval method.
[0017] Compared with existing technologies, this invention constructs a sample training set based on several training samples and sample labels indicating the category to which the training samples belong. The training set is then input into a pre-defined initial hash retrieval model to extract and fuse the modal data corresponding to each training sample. The fused feature vector is used to obtain the attribution confidence, and the weighted fused feature vector based on the attribution confidence is used to obtain refined category prototypes. The refined category prototypes are then classified to obtain a predicted probability distribution. Based on the predicted probability distribution, sample labels, and each refined category prototype, the model loss is obtained. The model loss is then used to optimize the initial hash retrieval model to obtain a target hash retrieval model, which facilitates the retrieval of the data to be retrieved using the target hash retrieval model. In this way, compared with existing technologies that rely on completely accurate supervision information, this scheme uses the fused feature vector obtained by fusing the modal data corresponding to each training sample to obtain the attribution confidence, and obtains the refined category prototype by weighting the fused feature vector based on the attribution confidence. This realizes the dynamic weighting of the fused feature vector, which can alleviate the interference of noisy labels on semantic center estimation. Thus, it can significantly reduce the semantic gap between modalities without the need for clean labels, thereby greatly improving the robustness of the model to multi-label noise while maintaining high retrieval accuracy. Attached Figure Description
[0018] To more clearly illustrate the technical solutions of the exemplary embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly described below. It should be understood that the following drawings only show some embodiments of the present invention and should not be considered as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort. In the drawings:
[0019] Figure 1 This is a flowchart illustrating a multimodal hash retrieval method provided in an embodiment of this disclosure;
[0020] Figure 2 This is a flowchart illustrating a method for obtaining model loss provided in an embodiment of this disclosure;
[0021] Figure 3 This is a schematic diagram of the framework of an initial hash model corresponding to an image modality provided in an embodiment of this disclosure;
[0022] Figure 4 This is a schematic diagram of the framework of an initial hash model corresponding to a text modality provided in an embodiment of this disclosure;
[0023] Figure 5This is a table illustrating the average precision of different types of cross-modal hashing algorithms or models trained on the MIRFlickr-25k and IAPR TC-1 databases at different noise rates, provided by an embodiment of this disclosure.
[0024] Figure 6 This is a table illustrating the average precision of MS-COCO and NUS-WIDE databases after training different types of cross-modal hashing algorithms or models at different noise rates, provided by an embodiment of this disclosure.
[0025] Figure 7 This is a schematic diagram of a multimodal hash retrieval device provided in an embodiment of this disclosure;
[0026] Figure 8 This is a schematic diagram of another multimodal hash retrieval device provided in an embodiment of this disclosure. Detailed Implementation
[0027] To make the objectives, technical solutions, and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments and accompanying drawings. The illustrative embodiments and descriptions of the present invention are only used to explain the present invention and are not intended to limit the present invention.
[0028] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.
[0029] The block diagrams shown in the accompanying drawings are merely functional entities and do not necessarily correspond to physically independent entities. That is, these functional entities can be implemented in software, in one or more hardware modules or integrated circuits, or in different network and / or processor devices and / or microcontroller devices.
[0030] The flowcharts shown in the accompanying drawings are merely illustrative and do not necessarily include all content and operations / steps, nor do they necessarily have to be performed in the described order. For example, some operations / steps can be broken down, while others can be combined or partially combined; therefore, the actual execution order may change depending on the specific circumstances.
[0031] In this application, "multiple" refers to two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone. The character " / " generally indicates that the preceding and following related objects have an "or" relationship.
[0032] Please see Figure 1 , Figure 1 This is a schematic diagram illustrating a multimodal hash retrieval method as an exemplary embodiment of this application.
[0033] like Figure 1 As shown, this disclosure provides a multimodal hash retrieval method, which includes:
[0034] Step S101: Construct a sample training set based on several training samples and the sample labels corresponding to each training sample; the sample labels are used to indicate the category to which the training sample belongs.
[0035] Step S102: Input the sample training set into the preset initial hash retrieval model to obtain the predicted probability distribution corresponding to each training sample; the predicted probability distribution is obtained by classifying the refined category prototype; the refined category prototype is obtained by dynamically weighting the fused feature vector corresponding to each training sample using the category attribution confidence of each training sample; the attribution confidence is obtained through the fused feature vector; the fused feature vector is obtained by extracting and fusing the modal data corresponding to each training sample respectively;
[0036] Step S103: Obtain the model loss based on the predicted probability distribution, sample labels, and prototypes of each refining category.
[0037] Step S104: Optimize the initial hash retrieval model using model loss to obtain the target hash retrieval model.
[0038] Step S105: Use the target hash retrieval model to retrieve the data to be retrieved.
[0039] The multimodal hash retrieval method provided in this disclosure constructs a sample training set based on several training samples and sample labels indicating the category to which the training samples belong. The training set is then input into a preset initial hash retrieval model. Modal data corresponding to each training sample is extracted and fused. The fused feature vector is used to obtain the attribution confidence. The attribution confidence is then used to weight the fused feature vector to obtain refined category prototypes. The refined category prototypes are then classified to obtain a predicted probability distribution. Based on the predicted probability distribution, sample labels, and each refined category prototype, the model loss is obtained. The model loss is then used to optimize the initial hash retrieval model to obtain a target hash retrieval model, which facilitates the retrieval of the data to be retrieved using the target hash retrieval model. In this way, compared with existing technologies that rely on completely accurate supervision information, this scheme uses the fused feature vector obtained by fusing the modal data corresponding to each training sample to obtain the attribution confidence, and obtains the refined category prototype by weighting the fused feature vector based on the attribution confidence. This realizes the dynamic weighting of the fused feature vector, which can alleviate the interference of noisy labels on semantic center estimation. Thus, it can significantly reduce the semantic gap between modalities without the need for clean labels, thereby greatly improving the robustness of the model to multi-label noise while maintaining high retrieval accuracy.
[0040] Meanwhile, by significantly reducing the reliance on high-quality manual annotation, this approach falls under the category of weakly supervised cross-modal hashing, requiring only noisy multi-label annotations for efficient training. Compared to traditional methods that require precise and complete single / multi-label annotations, this scheme significantly reduces the human and time costs of data annotation. Furthermore, end-to-end joint optimization improves model training efficiency and cross-modal retrieval performance, resulting in greater practicality and scalability.
[0041] It should be noted that the training samples are obtained in the following way: several training data are obtained; modal features corresponding to each modality of each training data are extracted using a preset feature extraction model; and the modal features corresponding to each modality of each training data are determined as each training sample.
[0042] The training data is multimodal data. For example, the training data includes image data of the image modality and text data of the text modality.
[0043] In some embodiments, for image modalities, the feature extraction model may include a CNN (Convolutional Neural Network) or a VGG19 (Visual Geometry Group 19-layer Network) model; for text modalities, the feature extraction model may include a BERT (Bidirectional Encoder Representations from Transformers) network or a Doc2Vec (Paragraph Vector) model; for audio modalities, the feature extraction model may extract mechanical energy features using the Mel-frequency cepstral coefficient algorithm or the Mel-spectral graph algorithm. Feature extraction models may also use other methods for feature extraction, which are not limited here.
[0044] It should be noted that the sample labels corresponding to each training sample are noisy labels.
[0045] For multimodal hashing with noisy labels, the input space of the initial hash retrieval model is X; the label space is Y = {1, 2, ..., C}. Each item in the label space represents a category. C is the total number of categories.
[0046] Then through The training set consists of each training sample and its corresponding label. This represents the m-th modality data for the i-th training sample. The number of modal types in the modal data; Let i be the i-th sample to be trained; Let be the sample label corresponding to the i-th training sample; M represents the number of training samples. It should be noted that M=2. The modal types of modal data include image modalities and text modalities. Therefore, the i-th training sample is... ;in, The image modal data in the i-th training sample; Let be the text modal data in the i-th training sample.
[0047] This is a noisy multi-label vector corresponding to the C categories of the i-th training sample. In this noisy multi-label vector, the value of each element indicates whether the i-th training sample belongs to the category corresponding to that element. For example: in When the c-th element is 1, it indicates that the training sample is labeled as belonging to the c-th category; it should be noted that this labeling may be incorrect.
[0048] In some embodiments, C=3, and Then the i-th training sample belongs to the c-th category.
[0049] The predicted probability distribution can be obtained by classifying refined category prototypes using a pre-defined classifier. The predicted probability distribution is the predicted probability distribution of the training sample based on the m-th modality type. It represents the category to which the data of the training sample based on the m-th modality type belongs.
[0050] It should be noted that the goal of cross-modal hashing is to learn a modality-specific hash function. This maps data from different modalities to a unified Hamming space. During training, this is achieved through... This yields a continuous real-valued representation. Here, tanh is the hyperbolic tangent function, and the final binary hash code is then generated using a sign function. Where L represents the hash code length. Furthermore, a linear classifier is introduced. Union-binding activation function This allows us to obtain the predicted probability distribution of the sample based on the m-th modality type. Through the above mechanism, cross-modal hashing utilizes compact binary code to achieve efficient and semantically consistent cross-modal similarity retrieval.
[0051] Furthermore, the initial hash retrieval model obtains the fused feature vector in the following way: obtain the modal data corresponding to each training sample; concatenate the modal data corresponding to each training sample to obtain the modal concatenation data corresponding to each training sample; calculate the Euclidean norm corresponding to each modal concatenation data; and calculate the fused feature vector corresponding to each training sample by using the modal concatenation data as the numerator and the Euclidean norm as the denominator.
[0052] In this way, by splicing the modal data corresponding to each training sample separately, the modal splicing data corresponding to each training sample is obtained, so that the modal splicing data can reflect the cross-modal semantic information of the training sample. Then, using the modal splicing data obtained by splicing the modal data corresponding to each training sample as the numerator and the Euclidean norm corresponding to each modal splicing data as the denominator, the fused feature vector corresponding to each training sample is calculated, so as to normalize the modal splicing data. Thus, while preserving the cross-modal semantic information, the normalization process eliminates scale differences and provides a unified metric space for subsequent prototype construction.
[0053] Furthermore, the modal data corresponding to each training sample is obtained, that is, the training samples are converted into hash representations through a hash code linear layer. These hash representations are then identified as modal data.
[0054] It should be noted that, through calculation , obtain the fused feature vector corresponding to the i-th training sample. Where, This is the hash representation of the first modality data, i.e., the first modality type; This is the hash representation of the Mth modality data, i.e., the Mth modality type. This is the modal splicing data corresponding to the i-th training sample; Let L2 be the L2 norm corresponding to the i-th modal concatenation data, i.e., the Euclidean norm.
[0055] Specifically, the types of modal data include image modality and text modality, then the fused feature vector corresponding to the i-th training sample... This fusion feature preserves cross-modal semantic information and eliminates scale differences through normalization, providing a unified metric space for subsequent prototype construction.
[0056] Furthermore, the initial hash retrieval model obtains the attribution confidence as follows: it initializes the category prototype corresponding to each category based on the fused feature vector to obtain the original category prototype; it calculates the cosine similarity between each training sample and each original category prototype based on the fused feature vector; and it obtains the attribution confidence of each training sample to each category based on the cosine similarity.
[0057] Thus, since the construction of the original category prototype depends on the sample labels, and in multi-label training scenarios, labels often contain incorrect or missing labels, noisy samples can skew the original category prototype, causing it to deviate from the true semantic center. Furthermore, in multi-label data, a sample typically belongs to multiple categories simultaneously. If a hard average is applied to each category independently, the co-occurrence dependency and semantic overlap between categories will be ignored, leading to semantic ambiguity or even conflict in the prototype, weakening its effectiveness as a semantic anchor. By calculating the cosine similarity between each training sample and each original category prototype based on the fused feature vector, and then obtaining the confidence score of each training sample's classification for each category based on the cosine similarity, the classification confidence score can be used to weight the fused feature vector, mitigating the interference of noisy labels on semantic center estimation.
[0058] Furthermore, the original category prototype is obtained by initializing the category prototype corresponding to each category based on the fused feature vector, including: calculating... This yields the original category prototype corresponding to the c-th category. This is the original category prototype corresponding to the c-th category; Let c be the value of the c-th element in the sample label corresponding to the i-th training sample. ; The first constant is preset, and its value is greater than 0, used to prevent division by zero; c represents the c-th category. .
[0059] It should be noted that, in In the case of , the i-th training sample belongs to the c-th class. In the case of , it indicates that the i-th training sample does not belong to the c-th class.
[0060] While this initialization method is intuitive, it relies directly on the original labels, which can easily introduce bias in noisy multi-label scenarios and fail to accurately reflect the true semantic distribution. Therefore, it is necessary to calculate the cosine similarity between each training sample and each original category prototype based on the fused feature vector; obtain the confidence score of each training sample for each category based on the cosine similarity; and then dynamically weight the fused feature vector corresponding to each training sample using the category confidence score to obtain a refined category prototype, thereby achieving refinement and optimization of the category prototype.
[0061] Furthermore, based on the fused feature vectors, the cosine similarity between each training sample and each original category prototype is calculated, including: by calculating... We obtain the cosine similarity between the i-th training sample and the original class prototype corresponding to the c-th class. Wherein, Let be the cosine similarity between the i-th training sample and the original class prototype corresponding to the c-th class; T represents the transpose operation.
[0062] Furthermore, based on cosine similarity, the confidence scores of each training sample for each category are obtained, including: by calculating... , obtain the confidence score of the i-th training sample for the c-th category. Where, Let be the confidence level of the i-th training sample in class c. ; Represents any category other than the cth category; For the i-th training sample and the i-th training sample Cosine similarity between the original category prototypes corresponding to each category; This is a preset temperature coefficient used to control the sharpness of soft distribution.
[0063] It should be noted that the attribution confidence reflects the semantic affinity between the training sample and each category.
[0064] Furthermore, the refined category prototype is obtained by dynamically weighting the fused feature vectors of each training sample using the category attribution confidence of each training sample in the following manner: This is achieved through calculation... This yields the refined category prototype corresponding to the c-th category. This is the prototype of the refined category corresponding to the c-th category; This is a preset second constant with a value greater than 0, used to prevent division by zero; It can be equal to No restrictions are imposed here.
[0065] This facilitates obtaining model loss through dynamically weighted refined category prototypes, and then training the model. This not only effectively identifies and suppresses the negative impact of high-noise samples, but also makes reasonable use of some reliable weak supervision information to achieve progressive learning that "discards falsehoods and retains truth," thereby significantly improving the retrieval accuracy and stability of cross-modal hashing in real noisy environments.
[0066] It should be noted that after obtaining the refined category prototype, L2 norm normalization can also be performed on the refined category prototype, that is... This eliminates feature vector scale differences. Then, the normalized post-refined class prototype is used as the updated refined class prototype.
[0067] It should be noted that, for The weights corresponding to d are weighted. Let be the total weight assigned to the c-th category. To avoid unreliable updates, for each category, if its total weight is lower than a preset weight threshold, that category can be ignored in subsequent loss calculations. This ensures that only high-confidence semantic structures participate in the optimization.
[0068] Furthermore, the model loss is obtained based on the predicted probability distribution, sample labels, and prototypes of each refined category, including: obtaining the intra-class aggregation loss corresponding to the training sample based on each sample label, the prototype of the refined category, and the fused feature vector; obtaining the inter-class separation loss corresponding to the training sample based on the prototype of the refined category; obtaining the inter-instance contrast alignment loss corresponding to the training sample based on each sample label; obtaining the intra-instance positive label ranking loss based on each sample label and the sample class probability corresponding to each training sample; the sample class probability is obtained through the predicted probability distribution; and the model loss is obtained by jointly considering the intra-class aggregation loss, the inter-class separation loss, the inter-instance contrast alignment loss, and the intra-instance positive label ranking loss.
[0069] In this way, by combining the model loss obtained from intra-class aggregation loss, inter-class separation loss, inter-instance contrast alignment loss, and intra-instance positive label ranking loss for training, the consistency between class-level semantic structure and instance-level relationship is optimized in a coordinated manner. This facilitates the learning of a cross-modal binary representation that is both semantically consistent and highly discriminative under noisy multi-label supervision, significantly improving retrieval accuracy and generalization ability.
[0070] Furthermore, the intra-class aggregation loss corresponding to the training sample is obtained based on the sample labels, refined category prototypes, and fused feature vectors, including: obtaining the cosine similarity between each fused feature vector and each refined category prototype; and obtaining the intra-class aggregation loss based on each sample label and cosine similarity.
[0071] In this way, by obtaining the cosine similarity between each fused feature vector and each refined category prototype, and then obtaining the intra-class aggregation loss based on each sample label and cosine similarity, it is possible to encourage samples to move closer to the refined prototype of their respective category, thereby forming a compact semantic cluster, suppressing feature dispersion caused by noise, and thus enhancing intra-class compactness.
[0072] Furthermore, the cosine similarity between each fused feature vector and each refined category prototype is obtained, including: by calculating... We obtain the cosine similarity between the fused feature vector corresponding to the i-th training sample and the refined category prototype corresponding to the c-th category. Let be the cosine similarity between the fused feature vector corresponding to the i-th training sample and the refined category prototype corresponding to the c-th category.
[0073] Furthermore, the intra-class aggregation loss is obtained based on the sample labels and cosine similarity, including: by calculating... The intra-class aggregation loss is obtained. For intra-class aggregation loss; This is a preset third constant with a value greater than 0, used to prevent division by zero; It can be equal to or For example, all are 1e-6, and no restrictions are imposed here.
[0074] Furthermore, the inter-class separation loss corresponding to the training samples is obtained based on the refined category prototype, including: by calculating The inter-class separation loss is obtained. This is the inter-class separation loss; Characterizing the first Types; Characterizing the first Types; The preset separation margin is used to force the first... The prototype of the refining category corresponding to the first category and the second category The refined category prototypes corresponding to each category maintain a minimum angular interval. This introduces a margin-based inter-class separation loss, explicitly pushing different category prototypes away from each other, thereby preventing different categories from being confused in the embedding space and enhancing inter-class separability.
[0075] It should be noted that by introducing inter-class separation loss with margins, we explicitly push different class prototypes away from each other, so that when two refined class prototypes are too close, i.e., their similarity exceeds a certain threshold... At this time, the inter-class separation loss is activated, imposing a penalty to encourage each class to occupy an independent and highly discriminative region in the feature space.
[0076] Furthermore, the inter-instance contrast alignment loss corresponding to the training samples is obtained based on the labels of each sample, including: constructing a cross-modal similarity matrix based on the modal data corresponding to each training sample; constructing several positive sample pairs and several negative sample pairs based on the labels of each sample; and obtaining the inter-instance contrast alignment loss based on the cross-modal similarity matrix, positive sample pairs, and negative sample pairs.
[0077] In this way, a cross-modal similarity matrix is constructed based on the modal data corresponding to each training sample; several positive sample pairs and several negative sample pairs are constructed based on the labels of each sample; then, the inter-instance contrast alignment loss is obtained based on the cross-modal similarity matrix, positive sample pairs, and negative sample pairs, so as to use the inter-instance contrast alignment loss to focus on the feature alignment of different modalities, and realize a three-dimensional optimization mechanism of positive sample merging, negative sample penalty, and self-alignment enhancement, which systematically improves the accuracy and robustness of cross-modal retrieval.
[0078] It should be noted that the cross-modal similarity matrix includes cross-modal similarity sub-matrices between different training samples and cross-modal similarity sub-matrices corresponding to each training sample.
[0079] Furthermore, a cross-modal similarity matrix is constructed based on the modal data corresponding to each training sample, including: normalizing the modal data corresponding to each training sample; and calculating... Obtain the cross-modal similarity submatrix between the i-th training sample and the j-th training sample; where, Let be the cross-modal similarity submatrix between the i-th and j-th training samples; This is the normalized matrix of the first modality data corresponding to the i-th training sample; The matrix is the normalized result of the first modality data corresponding to the j-th training sample; it is calculated... , obtain the cross-modal similarity submatrix corresponding to the i-th training sample; Let be the cross-modal similarity submatrix corresponding to the i-th training sample.
[0080] It should be noted that, ; That is, the cross-modal similarity submatrices are all N×N dimensional real matrices.
[0081] Furthermore, based on each sample label, several positive sample pairs and several negative sample pairs are constructed, including: by calculating... Obtain a set of positive sample pairs; determine each element in the set of positive sample pairs as a positive sample pair; calculate... Obtain a set of negative sample pairs; then determine each element in the set of negative sample pairs as a negative sample pair.
[0082] It should be noted that a positive sample pair is a sample pair that shares at least one positive label, meaning that there are sample pairs whose labels contain one or more elements with a value of 1. A negative sample pair is a sample pair that does not share a label, meaning that there are no sample pairs whose labels contain elements with a value of 1 at the same position.
[0083] Furthermore, based on the cross-modal similarity matrix, positive sample pairs, and negative sample pairs, the inter-instance contrast alignment loss is obtained, including: by calculating The inter-instance alignment loss is obtained. This represents the alignment loss between instances.
[0084] It should be noted that in the inter-instance alignment loss, the first term... Used to bring semantically related samples closer together; second item Used to penalize semantically unrelated but feature-similar sample pairs; the third term This is used to enhance the self-alignment maximization of different modal features of the same sample.
[0085] Furthermore, the positive label ranking loss within an instance is obtained based on the sample label and the sample class probability corresponding to each training sample, including: obtaining the positive and negative label pairs corresponding to each training sample based on each sample label; and obtaining the positive label ranking loss within an instance based on the sample class probability and the positive and negative label pairs.
[0086] In this way, by obtaining the positive and negative label pairs corresponding to each training sample based on each sample label, and then obtaining the positive label ranking loss within the instance based on the sample class probability and the positive and negative label pairs, the ranking relationship within the multi-label data is modeled. This can compensate for the shortcomings of contrastive learning in modeling the label structure within the sample, thus making the prediction score for any positive label higher than that for any negative label, thereby preserving the relative ranking information in the multi-label data. Even if some labels are missing or incorrect, as long as reliable positive-negative pairs exist, the model can still learn effective discrimination boundaries, significantly improving robustness under incomplete supervision.
[0087] Furthermore, based on the labels of each sample, the positive and negative label pairs corresponding to each training sample are obtained, including: by calculating... This process obtains the set of positive and negative label pairs corresponding to the i-th training sample; the set of positive and negative label pairs includes all positive and negative label pairs corresponding to the i-th training sample. The i-th training sample corresponds to the sample label of the i-th training sample. The value of each element is 1; The i-th training sample corresponds to the sample label of the i-th training sample. The value of each element is 0; It represents any element in the set of positive and negative label pairs, that is, any positive and negative label pair corresponding to the i-th training sample.
[0088] Furthermore, the positive label ranking loss within an instance is obtained based on the sample class probability and the positive / negative label pair, including: by calculating The in-instance positive label ranking loss is obtained. The loss is the positive label sorting loss within the instance; Let be any positive or negative label pair in the set of positive and negative label pairs corresponding to the i-th training sample; For the i-th training sample, the m-th modality type belongs to the th modality. The probability of a class; For the i-th training sample, the m-th modality type belongs to the th modality. The probability of a class; Characterizes sigmoid activation; Characterize the calculation of binary cross-entropy loss.
[0089] Furthermore, the model loss is obtained by combining the intra-class aggregation loss, inter-class separation loss, inter-instance contrast alignment loss, and intra-instance positive label ranking loss, including: calculating... The model loss is obtained. For model loss; The preset first weight; The second weight is preset; This is the preset third weight.
[0090] It should be noted that, , and All are adjustable hyperparameters used to balance the three objectives of class-level semantic structure corresponding to intra-class compactness and inter-class separability, cross-modal alignment between instances, and intra-sample label ranking.
[0091] In some embodiments, , , Other values can also be set; there are no restrictions here.
[0092] Furthermore, in step S104, the initial hash retrieval model is optimized using model loss to obtain the target hash retrieval model. That is, with the goal of minimizing model loss, the initial hash retrieval model is iteratively optimized using the Adam algorithm to obtain the target hash retrieval model. In some embodiments, the learning rate of the Adam algorithm can be set to 1e-4.
[0093] like Figure 2 As shown, this disclosure provides a method for obtaining model loss, the method comprising:
[0094] Step S201: Obtain the cosine similarity between each fused feature vector and each refined category prototype.
[0095] Step S202: Obtain the intra-class aggregation loss based on the label and cosine similarity of each sample.
[0096] Step S203: Obtain the inter-class separation loss corresponding to the training sample based on the refined category prototype.
[0097] Step S204: Construct a cross-modal similarity matrix based on the modal data corresponding to each training sample.
[0098] Step S205: Construct several positive sample pairs and several negative sample pairs based on each sample label.
[0099] Step S206: Obtain the inter-instance contrast alignment loss based on the cross-modal similarity matrix, positive sample pairs, and negative sample pairs.
[0100] Step S207: Obtain the positive and negative label pairs corresponding to each training sample based on each sample label; the sample class probability is obtained by predicting the probability distribution.
[0101] Step S208: Obtain the positive label ranking loss within the instance based on the sample class probability and the positive and negative label pairs.
[0102] Step S209: Combine intra-class aggregation loss, inter-class separation loss, inter-instance contrast alignment loss, and intra-instance positive label ranking loss to obtain the model loss.
[0103] In this embodiment, the cosine similarity between each fused feature vector and each refined category prototype is obtained separately. Intra-class aggregation loss is obtained based on each sample label and the cosine similarity. Inter-class separation loss is obtained based on the refined category prototypes. A cross-modal similarity matrix is constructed based on the modal data corresponding to each training sample. Then, several positive sample pairs and several negative sample pairs are constructed based on each sample label. Inter-instance contrast alignment loss is obtained based on the cross-modal similarity matrix, positive sample pairs, and negative sample pairs. Simultaneously, positive and negative label pairs corresponding to each training sample are obtained based on each sample label. Sample class probabilities are obtained through predicted probability distributions. Intra-instance positive label ranking loss is obtained based on the sample class probabilities and positive and negative label pairs. Finally, the model loss is obtained by combining the intra-class aggregation loss, inter-class separation loss, inter-instance contrast alignment loss, and intra-instance positive label ranking loss. The model loss obtained by combining intra-class aggregation loss, inter-class separation loss, inter-instance contrast alignment loss and intra-instance positive label ranking loss is used for training. This collaboratively optimizes the consistency between class-level semantic structure and instance-level relationship, making it easier to learn a cross-modal binary representation that is both semantically consistent and highly discriminative under noisy multi-label supervision, thus significantly improving retrieval accuracy and generalization ability.
[0104] In some embodiments, the model can be trained on four cross-modal datasets: MIRFlickr-25k, NUS-WIDE, MS-COCO, and IAPR TC-12. Each sample contains a pair of images and text, along with a set of labels containing some correct labels and some incorrect labels. That is, the training data consists of data from the four cross-modal datasets: MIRFlickr-25k, NUS-WIDE, MS-COCO, and IAPR TC-12.
[0105] For image data, the VGG19 model pre-trained on ImageNet can be used to extract the corresponding modal features; for text data, the trained Doc2Vec model can be used to extract the features to obtain the corresponding modal features.
[0106] Please refer to Table 1, which shows the data for the four cross-modal datasets.
[0107]
[0108] Table 1
[0109] As shown in Table 1, the MIRFlickr-25k dataset contains 18,015 data points, of which 2,000 can be selected as test data and 10,000 as training data; the average true label probability of the MIRFlickr-25k dataset is 3.78; the total number of categories is 24. The NUS-WIDE dataset contains 188,321 data points, of which 2,100 can be selected as test data and 10,500 as training data; the average true label probability of the MIRFlickr-25k dataset is 2.09; the total number of categories is 21. The MS-COCO dataset contains 117,218 data points, of which 5,000 can be selected as test data and 10,000 as training data; the average true label probability of the MIRFlickr-25k dataset is 2.76; the total number of categories is 80. The IAPR TC-12 dataset contains 18,000 data points, of which 2,000 can be selected as test data and 10,000 as training data. The average true label probability of the MIRFlickr-25k dataset is 4.11, and the total number of categories is 255.
[0110] It should be noted that, after feature extraction, the image feature length in the MIRFlickr-25k dataset is 4096, meaning the length of the modal feature corresponding to the image modality is 4096; the text feature length is 1386, meaning the length of the modal feature corresponding to the text modality is 1386. Similarly, after feature extraction, the image feature length in the NUS-WIDE dataset is 4096, meaning the length of the modal feature corresponding to the image modality is 4096; the text feature length is 1000, meaning the length of the modal feature corresponding to the text modality is 1000. Likewise, after feature extraction, the image feature length in the MS-COCO dataset is 4096, meaning the length of the modal feature corresponding to the image modality is 4096; the text feature length is 300, meaning the length of the modal feature corresponding to the text modality is 300. (IAPR...) After feature extraction, the image feature length in the TC-12 dataset is 4096, meaning the length of the modal feature corresponding to the image modality is 4096; the text feature length is 2912, meaning the length of the modal feature corresponding to the text modality is 2912.
[0111] After feature extraction, the modal features corresponding to each modality of each data to be trained are determined as each sample to be trained. Combined with the sample labels in the label set, the sample training set can be obtained.
[0112] Then the sample training set can be input into the preset initial hash retrieval model to obtain the predicted probability distribution corresponding to each sample to be trained.
[0113] It should be noted that the initial hash retrieval model includes initial hash classification sub-models corresponding to each modality. For example, there are initial hash models for image modalities and initial hash models for text modalities.
[0114] The initial hash model corresponding to each modality includes several linear layers, hash code linear layers, and classifier linear layers.
[0115] In some embodiments, please refer to Table 2 and Figure 3 Table 2 is the framework structure table of the initial hash model corresponding to the image modality. Figure 3 This is a schematic diagram of the framework of the initial hash model corresponding to the image modality.
[0116]
[0117] Table 2
[0118] Combine Table 2 and Figure 3 As shown, the initial hash model corresponding to the image modality includes a first linear layer 1, a second linear layer 2, a third linear layer 3, a first hash code linear layer 4, and a first classifier linear layer 5. The first linear layer 1 performs initial compression on the image data of the image modality class for preliminary feature extraction. It introduces non-linearity through a preset ReLU activation function, enabling the model to capture low-level image features and overcome the expressive limitations of linear models. The dimension of the image data of the image modality class is the image feature length. Guided by a preset loss function, it adjusts through backpropagation to initially screen task-related features and suppress noise interference. The second linear layer 2 introduces non-linearity through the ReLU activation function, further extracting intermediate image features based on the 8096 data output from the first linear layer 1. The third linear layer 3 introduces non-linearity through the ReLU activation function, focusing on image information and retaining key image information based on the 8096 data output from the second linear layer 2, preparing for subsequent hash code generation. The first hash code linear layer 4 converts the data output from the third linear layer 3 into hash codes using a preset Tanh function to obtain the hash representation length. It should be noted that the compactness (64 dimensions) of the hash code supports efficient similarity retrieval (such as Hamming distance calculation), making it suitable for large-scale image retrieval, recommendation systems, and other scenarios. The first classifier, linear layer 5, performs multi-class classification based on the hash code using a pre-defined Sigmoid function to obtain the predicted probability distribution corresponding to the image modality.
[0119] In other embodiments, please refer to Table 3 and Figure 4 Table 3 shows the framework structure of the initial hash model corresponding to the text modality. Figure 4 This is a schematic diagram of the framework of the initial hash model corresponding to the text modality.
[0120]
[0121] Table 3
[0122] Combined with Table 3 and Figure 4 As shown, the initial hash model corresponding to the text modality includes a fourth linear layer (6), a fifth linear layer (7), a second hash code linear layer (8), and a second classifier linear layer (9). The fourth linear layer (6) performs initial compression on the image data of the text modality for preliminary feature extraction. It introduces non-linearity through a preset ReLU activation function, enabling the model to capture low-level semantic features and overcome the expressive limitations of linear models. The dimension of the image data for the text modality is the length of the text features. Guided by a preset loss function, it adjusts through backpropagation to initially filter task-related features and suppress noise interference. The fifth linear layer (7) introduces non-linearity through the ReLU activation function, focusing on object-level semantic information based on the 8096 data output from the fourth linear layer (6). It retains key semantic information while reducing dimensionality, preparing for subsequent hash code generation. The second hash code linear layer (8) converts the data output from the fifth linear layer (7) into hash codes using a preset Tanh function to obtain the hash representation length. It should be noted that the compactness (64 dimensions) of the hash codes supports efficient similarity retrieval (such as Hamming distance calculation), making it suitable for large-scale image retrieval, recommendation systems, and other scenarios. The second classifier, linear layer 9, performs multi-class classification based on hash codes using a preset Sigmoid function to obtain the predicted probability distribution corresponding to the text modality.
[0123] Then, based on the predicted probability distribution, sample labels, and prototypes of each refined category, the model loss is obtained, and the model loss is used to optimize the initial hash retrieval model to obtain the target hash retrieval model.
[0124] For image-to-text retrieval and text-to-image retrieval tasks, the Mean Average Precision (MAP) of the test data can be used as a core evaluation metric. MAP is a common metric in the cross-modal retrieval field. It is calculated as the average precision of each relevant sample in the retrieval results under different recall rates. It comprehensively considers the accuracy of the retrieval results (whether relevant samples are recalled) and the ranking quality (whether relevant samples are ranked highly), and can comprehensively reflect the semantic matching ability of the model.
[0125] The experiment used 32-bit, 64-bit, and 128-bit configurations to analyze the impact of bit length on retrieval performance. Symmetric noise was used: the labels of the data were randomly perturbed according to a uniform distribution. For example, under 60% symmetric noise conditions, one label of 40% of the samples was randomly replaced with another category, while the labels of the remaining 60% of the samples remained correct. This noise assumed that all categories had the same level of confusion.
[0126] The experiment used four noise rate settings: 0%, 20%, 50%, and 80%. The experimental results are as follows: Figure 5 and Figure 6 As shown. Figure 5 This is a table illustrating the average precision of different types of cross-modal hashing algorithms or models trained on the MIRFlickr-25k and IAPR TC-1 databases at different noise rates. Figure 6 This is a table illustrating the average precision of different types of cross-modal hashing algorithms or models trained on the MS-COCO and NUS-WIDE databases at different noise rates.
[0127] Among them, Flickr represents the MIRFlickr-25k database; IAPR represents the IAPR TC-12 database; COCO represents the MS-COCO database; and NUS represents the NUS-WIDE database. The highest value is highlighted in bold, and the second highest value is highlighted with an underline. The higher the value, the better the representation of the model.
[0128] from Figure 5 and Figure 6 As can be seen from the data, the "ours" model performs the best, meaning that the model in this scheme performs the best, and its average precision is greater than that of other models.
[0129] It should be noted that during the retrieval, a pre-set retrieval sample library containing multiple retrieval samples is used. In step S105, the target hash retrieval model is used to retrieve the data to be retrieved. Specifically, this includes: First, obtaining the hash code corresponding to the data to be retrieved and the hash code corresponding to each retrieval sample in the retrieval sample library using the hash retrieval model. Second, calculating the Hamming distance between the hash code corresponding to the data to be retrieved and the hash codes corresponding to each retrieval sample. Third, sorting the obtained Hamming distances in ascending order; the sorted result is the retrieval result. The most similar sample has the smallest Hamming distance, which is the sample ranked first in terms of Hamming distance.
[0130] It should be noted that this solution has the following effects:
[0131] 1. By collaboratively modeling class-aware prototypes and instance-aware relationships, and utilizing the semantic consistency between cross-modal fusion features and refined category prototypes, we adaptively calculate the confidence of samples to belong to each category, realize dynamic weighting of training samples, effectively mine the supervision signals of high-confidence samples and suppress noise interference, thereby obtaining a robust and discriminative hash representation.
[0132] 2. A two-level semantic alignment strategy was constructed: at the class level, a soft allocation mechanism was used to model the partial matching relationship between the sample and multiple categories, capturing the co-occurrence of labels and semantic overlap; at the instance level, positive and negative sample pairs were defined based on the intersection of labels—positive pairs were those with non-empty intersections, and negative pairs were those with no intersection at all. Continuous similarity was preserved for partially overlapping samples to avoid hard partitioning and achieve fine modeling of multi-label structures.
[0133] 3. An in-instance positive and negative label pair ordering constraint is introduced. By constructing all satisfying label pairs, the positive label prediction score is forced to be higher than the negative label. The score difference is supervised by binary cross-entropy, preserving the relative discrimination order within multiple labels and improving the robustness of the model under label missing or noise.
[0134] 4. It breaks through the dependence on high-quality complete annotations, requiring only a noisy multi-label candidate set for training. Through class-instance two-level semantic refinement and cross-modal consistency learning, it significantly reduces data costs without additional annotation cleaning, while maintaining excellent retrieval performance, promoting the application of cross-modal hashing in large-scale weakly supervised real-world scenarios.
[0135] Combination Figure 7 As shown, this embodiment of the present disclosure provides a multimodal hash retrieval device 70, which includes: a construction module 71, an input module 72, a loss acquisition module 73, an optimization module 74, and a retrieval module 75.
[0136] Module 71 is configured to construct a sample training set based on several training samples and the sample labels corresponding to each training sample; the sample labels are used to indicate the category to which the training sample belongs.
[0137] Input module 72 is configured to input the sample training set into a preset initial hash retrieval model to obtain the predicted probability distribution corresponding to each training sample; the predicted probability distribution is obtained by classifying the refined category prototype; the refined category prototype is obtained by dynamically weighting the fused feature vector corresponding to each training sample using the category attribution confidence of each training sample; the attribution confidence is obtained through the fused feature vector; the fused feature vector is obtained by extracting and fusing the modal data corresponding to each training sample separately;
[0138] The loss acquisition module 73 is configured to acquire the model loss based on the predicted probability distribution, sample labels, and prototypes of each refined category.
[0139] Optimization module 74 is configured to optimize the initial hash retrieval model using model loss to obtain the target hash retrieval model;
[0140] The retrieval module 75 is configured to retrieve the data to be retrieved using a target hash retrieval model.
[0141] The multimodal hash retrieval device provided in this disclosure constructs a sample training set based on several training samples and sample labels indicating the category to which the training samples belong. The training set is then input into a preset initial hash retrieval model. Modal data corresponding to each training sample is extracted and fused. The fused feature vector is used to obtain the attribution confidence. The attribution confidence is then used to weight the fused feature vector to obtain refined category prototypes. The refined category prototypes are then classified to obtain a predicted probability distribution. Based on the predicted probability distribution, sample labels, and each refined category prototype, the model loss is obtained. The model loss is then used to optimize the initial hash retrieval model to obtain a target hash retrieval model, which facilitates the retrieval of the data to be retrieved using the target hash retrieval model. In this way, compared with existing technologies that rely on completely accurate supervision information, this scheme uses the fused feature vector obtained by fusing the modal data corresponding to each training sample to obtain the attribution confidence, and obtains the refined category prototype by weighting the fused feature vector based on the attribution confidence. This realizes the dynamic weighting of the fused feature vector, which can alleviate the interference of noisy labels on semantic center estimation. Thus, it can significantly reduce the semantic gap between modalities without the need for clean labels, thereby greatly improving the robustness of the model to multi-label noise while maintaining high retrieval accuracy.
[0142] Furthermore, the initial hash retrieval model obtains the fused feature vector as follows: It acquires the modal data corresponding to each training sample; it concatenates the modal data corresponding to each training sample to obtain the concatenated modal data corresponding to each training sample; it calculates the Euclidean norm corresponding to each concatenated modal data; and it calculates the fused feature vector corresponding to each training sample using the concatenated modal data as the numerator and the Euclidean norm as the denominator.
[0143] Furthermore, the initial hash retrieval model obtains the attribution confidence as follows: it initializes the category prototype corresponding to each category based on the fused feature vector to obtain the original category prototype; it calculates the cosine similarity between each training sample and each original category prototype based on the fused feature vector; and it obtains the attribution confidence of each training sample to each category based on the cosine similarity.
[0144] Furthermore, the loss acquisition module is configured to acquire the model loss based on the predicted probability distribution, sample labels, and each refined category prototype in the following ways: acquiring the intra-class aggregation loss corresponding to the training sample based on each sample label, refined category prototype, and fused feature vector; acquiring the inter-class separation loss corresponding to the training sample based on the refined category prototype; acquiring the inter-instance contrast alignment loss corresponding to the training sample based on each sample label; acquiring the intra-instance positive label ranking loss based on each sample label and the sample class probability corresponding to each training sample; the sample class probability is obtained through the predicted probability distribution; and the model loss is acquired by jointly acquiring the intra-class aggregation loss, inter-class separation loss, inter-instance contrast alignment loss, and intra-instance positive label ranking loss.
[0145] Furthermore, the loss acquisition module is configured to obtain the intra-class aggregate loss corresponding to the training sample based on each sample label, refined category prototype, and fused feature vector in the following manner: obtain the cosine similarity between each fused feature vector and each refined category prototype respectively; obtain the intra-class aggregate loss based on each sample label and cosine similarity.
[0146] Furthermore, the loss acquisition module is configured to acquire the inter-instance contrast alignment loss corresponding to the training sample based on each sample label in the following manner: constructing a cross-modal similarity matrix based on the modal data corresponding to each training sample; constructing several positive sample pairs and several negative sample pairs based on each sample label; and acquiring the inter-instance contrast alignment loss based on the cross-modal similarity matrix, positive sample pairs, and negative sample pairs.
[0147] Furthermore, the loss acquisition module is configured to obtain the positive label ranking loss within an instance based on the sample label and the sample class probability corresponding to each training sample in the following manner: obtain the positive and negative label pairs corresponding to each training sample based on each sample label; obtain the positive label ranking loss within an instance based on the sample class probability and the positive and negative label pairs.
[0148] It should be noted that the multimodal hash retrieval device and the multimodal hash retrieval method provided in the above embodiments belong to the same concept. The specific methods by which each module and unit performs operations have been described in detail in the method embodiments and will not be repeated here. In practical applications, the multimodal hash retrieval device provided in the above embodiments can be configured to perform the above functions by different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. This is not a limitation here.
[0149] Combination Figure 8As shown, this disclosure provides another multimodal hash retrieval device, including a processor 81 and a memory 82. Optionally, the device may further include a communication interface 83 and a bus 84. The processor 81, communication interface 83, and memory 82 can communicate with each other via the bus 84. The communication interface 83 can be used for information transmission. The processor 81 can call logical instructions in the memory 82 to execute the multimodal hash retrieval method of the above embodiment.
[0150] Furthermore, the logic instructions in the aforementioned memory 82 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium.
[0151] The memory 82, as a storage medium, can be used to store software programs and computer-executable programs, such as program instructions / modules corresponding to the methods in the embodiments of this disclosure. The processor 81 executes functional applications and data processing by running the program instructions / modules stored in the memory 82, that is, it implements the multimodal hash retrieval method in the above embodiments.
[0152] The memory 82 may include a program storage area and a data storage area. The program storage area may store the operating system and application programs required for at least one function; the data storage area may store data created based on the use of the terminal device. Furthermore, the memory 82 may include high-speed random access memory and may also include non-volatile memory.
[0153] This disclosure provides a storage medium storing computer-executable instructions configured to execute the above-described multimodal hash retrieval method.
[0154] The aforementioned storage media can be either transient computer-readable storage media or non-transitory computer-readable storage media. Non-transitory storage media include various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks, and can also be transient storage media.
[0155] The foregoing description and accompanying drawings fully illustrate embodiments of this disclosure to enable those skilled in the art to practice them. Other embodiments may include structural, logical, electrical, procedural, and other changes. The embodiments represent only possible variations. Individual components and functions are optional unless explicitly required, and the order of operation may vary. Parts and features of some embodiments may be included in or replace parts and features of other embodiments. Moreover, the terminology used in this application is for describing embodiments only and is not intended to limit the claims. As used in the description of embodiments and claims, the singular forms “a,” “an,” and “the” are intended to equally include the plural forms unless the context clearly indicates otherwise. Similarly, the term “and / or” as used in this application means including one or more of the associated listed items and all possible combinations thereof. Additionally, when used in this application, the term "comprise" and its variations "comprises" and / or "comprising" refer to the presence of stated features, integrals, steps, operations, elements, and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or groups thereof. Without further limitations, an element defined by the phrase "comprises a..." does not exclude the presence of other identical elements in the process, method, or apparatus that includes said element. In this document, each embodiment may focus on the differences from other embodiments, and similar or identical parts between embodiments can be referred to mutually. For methods, products, etc., disclosed in the embodiments, if they correspond to the method section disclosed in the embodiments, the relevant parts can be referred to the description of the method section.
[0156] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of the embodiments of this disclosure. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0157] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions marked in the blocks may occur in a different order than that shown in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. In the descriptions corresponding to the flowcharts and block diagrams in the accompanying drawings, the operations or steps corresponding to different blocks may also occur in a different order than disclosed in the description, and sometimes there is no specific order between different operations or steps. For example, two consecutive operations or steps may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. Each block in a block diagram and / or flowchart, and combinations of blocks in a block diagram and / or flowchart, can be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.
Claims
1. A training method based on a multimodal hash retrieval model, characterized in that, include: A training set is constructed based on several training samples and the corresponding sample labels for each training sample; the sample labels are used to indicate the category to which the training sample belongs. The sample training set is input into a preset initial hash retrieval model to obtain the predicted probability distribution corresponding to each of the training samples; the predicted probability distribution is obtained by classifying the refined category prototype; the refined category prototype is obtained by dynamically weighting the fused feature vector corresponding to each of the training samples using the category attribution confidence of each of the training samples. The attribution confidence is obtained through the fused feature vector; the fused feature vector is obtained by extracting and fusing the modal data corresponding to each of the training samples respectively. The model loss is obtained based on the predicted probability distribution, the sample labels, and the prototypes of each of the refined categories. The initial hash retrieval model is optimized using the model loss to obtain the target hash retrieval model; The target hash retrieval model is used to retrieve the data to be retrieved.
2. The method according to claim 1, characterized in that, The initial hash retrieval model obtains the fused feature vector in the following manner: Obtain the modal data corresponding to each of the training samples; The modal data corresponding to each of the training samples are spliced together to obtain the modal splicing data corresponding to each of the training samples. Calculate the Euclidean norm corresponding to each modal splicing data; Using the modal splicing data as the numerator and the Euclidean norm as the denominator, calculate the fused feature vector corresponding to each of the training samples.
3. The method according to claim 1, characterized in that, The initial hash retrieval model obtains the attribution confidence score in the following manner: The category prototype corresponding to each category is initialized based on the fused feature vector to obtain the original category prototype; Based on the fused feature vector, the cosine similarity between each training sample and each original category prototype is calculated. Based on the cosine similarity, the confidence level of each training sample for each category is obtained.
4. The method according to claim 1, characterized in that, The method of obtaining the model loss based on the predicted probability distribution, the sample labels, and each of the refined category prototypes includes: Based on the sample labels, the refined category prototypes, and the fused feature vectors, the intra-class aggregation loss corresponding to the training samples is obtained; Based on the refined category prototype, obtain the inter-class separation loss corresponding to the training sample; Based on the labels of each sample, obtain the inter-instance contrast alignment loss corresponding to the sample to be trained. The positive label ranking loss within an instance is obtained based on the sample labels and the sample category probabilities corresponding to each training sample; the sample category probabilities are obtained through the predicted probability distribution. The model loss is obtained by combining the intra-class aggregation loss, the inter-class separation loss, the inter-instance contrast alignment loss, and the intra-instance positive label ranking loss.
5. The method according to claim 4, characterized in that, The step of obtaining the intra-class aggregation loss corresponding to the training sample based on each of the sample labels, the refined category prototype, and the fused feature vector includes: The cosine similarity between each of the fused feature vectors and each of the refined category prototypes is obtained respectively; The intra-class aggregation loss is obtained based on the sample labels and the cosine similarity.
6. The method according to claim 4, characterized in that, The step of obtaining the inter-instance contrast alignment loss corresponding to the training sample based on each of the sample labels includes: Construct a cross-modal similarity matrix based on the modal data corresponding to each of the training samples; Based on the sample labels, construct several positive sample pairs and several negative sample pairs; The inter-instance contrast alignment loss is obtained based on the cross-modal similarity matrix, the positive sample pairs, and the negative sample pairs.
7. The method according to claim 4, characterized in that, The method of obtaining the in-instance positive label ranking loss based on the sample labels and the sample class probabilities corresponding to each training sample includes: Based on the sample labels, obtain the positive and negative label pairs corresponding to each training sample; The positive label ranking loss within the instance is obtained based on the sample category probability and the positive and negative label pairs.
8. A hash retrieval device based on multimodal operation, characterized in that, include: The construction module is configured to construct a sample training set based on a number of training samples and the sample labels corresponding to each training sample; the sample labels are used to indicate the category to which the training sample belongs. The input module is configured to input the sample training set into a preset initial hash retrieval model to obtain the predicted probability distribution corresponding to each of the training samples; the predicted probability distribution is obtained by classifying the refined category prototype; the refined category prototype is obtained by dynamically weighting the fused feature vector corresponding to each of the training samples using the category attribution confidence of each of the training samples; the attribution confidence is obtained through the fused feature vector; the fused feature vector is obtained by extracting and fusing the modal data corresponding to each of the training samples respectively. The loss acquisition module is configured to acquire the model loss based on the predicted probability distribution, the sample labels, and each of the refined category prototypes; The optimization module is configured to optimize the initial hash retrieval model using the model loss to obtain the target hash retrieval model; The retrieval module is configured to retrieve the data to be retrieved using the target hash retrieval model.
9. An electronic device comprising a processor and a memory storing program instructions, characterized in that, The processor is configured to execute the multimodal hash retrieval method as described in any one of claims 1 to 7 when running the program instructions.
10. A storage medium storing program instructions, characterized in that, When the program instructions are executed, they perform the multimodal hash retrieval method as described in any one of claims 1 to 7.