A robust cross-modal retrieval method and apparatus

By combining a dual-model multimodal classifier and a contrastive learning model, the problem of decreased cross-modal retrieval performance under noisy labels is solved, achieving higher accuracy and robustness.

CN122309768APending Publication Date: 2026-06-30UNIV OF SCI & TECH BEIJING +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
UNIV OF SCI & TECH BEIJING
Filing Date
2026-03-17
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing cross-modal retrieval methods learn biased semantic correspondences in the presence of noisy labels, leading to a significant performance degradation.

Method used

We employ a dual-model multimodal classifier and a cross-modal pre-trained model based on contrastive learning. By training on the multimodal dataset, we divide the dataset into clean segments and map them to the same feature space. We learn the mapping relationship between data from different modalities and use cross-modal knowledge to achieve cross-modal retrieval.

Benefits of technology

It effectively reduces the interference of noisy multimodal data on model training, improves the accuracy and robustness of cross-modal retrieval, and enhances the model's ability to perform cross-modal semantic alignment and representation in noisy environments.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309768A_ABST
    Figure CN122309768A_ABST
Patent Text Reader

Abstract

This invention discloses a robust cross-modal retrieval method and apparatus, belonging to the field of artificial intelligence technology. The method includes: acquiring multimodal data and constructing a multimodal dataset; training a pre-defined cross-modal learning model using the multimodal dataset; wherein the model includes a dual-model multimodal classifier and a cross-modal pre-trained model based on contrastive learning; the classifier is used to divide clean data in the multimodal dataset into a clean dataset; the cross-modal pre-trained model is used to map data of different modalities in the clean dataset to the same feature space through contrastive learning, thereby obtaining cross-modal knowledge; and cross-modal retrieval is achieved based on the obtained cross-modal knowledge. This invention can effectively reduce the interference of noisy multimodal data on model training, improve the accuracy and robustness of cross-modal retrieval, and enhance the model's ability to learn cross-modal semantic alignment and representation in noisy environments.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and in particular to a robust cross-modal retrieval method and apparatus. Background Technology

[0002] With the rapid development of the Internet, the amount of multimodal data (such as images and text) has increased dramatically, and cross-modal retrieval (CMR) has therefore received continuous attention. This task aims to retrieve relevant data (such as images) of another modality using query data of one modality (such as text), and has important application value in multimedia content management, intelligent retrieval systems and other fields.

[0003] Existing methods typically map data from different modalities to the same semantic representation space and utilize label information to distinguish samples of different categories, thereby achieving cross-modal semantic alignment. However, in practical applications, the annotation of multimodal data often contains noise, including mislabels and missing labels. This label noise leads to a significant performance degradation of existing methods that rely on clean labels, because noisy labels cause the model to learn biased semantic correspondences, thus affecting retrieval results. Therefore, how to achieve robust cross-modal retrieval in the presence of label noise has become a key challenge in current research.

[0004] In summary, existing cross-modal retrieval methods suffer from the drawback of learning biased semantic correspondences due to noisy labels, which leads to a significant performance degradation. Summary of the Invention

[0005] This invention provides a robust cross-modal retrieval method and apparatus to solve the technical problem that existing cross-modal retrieval methods suffer from biased semantic correspondences learned by the model due to noisy labels, which leads to a significant decrease in cross-modal retrieval performance.

[0006] To solve the above-mentioned technical problems, the present invention provides the following technical solution: On the one hand, the present invention provides a robust cross-modal retrieval method, comprising: Acquire multimodal data and construct a multimodal dataset; The multimodal dataset is used to train a pre-defined cross-modal learning model, enabling the model to learn the mapping relationships between data from different modalities and obtain cross-modal knowledge. The cross-modal learning model includes a dual-model multimodal classifier and a cross-modal pre-trained model based on contrastive learning. The dual-model multimodal classifier is used to partition the clean data in the multimodal dataset into a clean dataset. The cross-modal pre-trained model based on contrastive learning is used to map data from different modalities in the clean dataset to the same feature space through contrastive learning, learning the mapping relationships between data from different modalities and obtaining cross-modal knowledge. Cross-modal retrieval is achieved based on the obtained cross-modal knowledge.

[0007] Furthermore, the multimodal data includes two different modalities: text data and image data.

[0008] Furthermore, the dual-model multimodal classifier includes a first multimodal classifier and a second multimodal classifier; The step of dividing the clean data in the multimodal dataset into a clean dataset includes: Multimodal data is input into a dual-branch coding network for feature extraction to obtain text features and image features; Text features and image features are input into the first multimodal classifier to obtain the predicted values ​​of the semantic labels of the corresponding samples. Based on the predicted values ​​and the true values ​​of the semantic labels, the confidence of the corresponding samples is calculated. When the confidence of a sample is greater than a preset threshold, the corresponding sample is considered to be clean data. The text features and image features are input into the second multimodal classifier to obtain the predicted values ​​of the semantic labels of the corresponding samples. Based on the predicted values ​​and the true values ​​of the semantic labels, the confidence of the corresponding samples is calculated. When the confidence of a sample is greater than a preset threshold, the corresponding sample is considered to be clean data. For a given sample, if both the first multimodal classifier and the second multimodal classifier consider it to be clean data, then the corresponding sample is assigned to the clean dataset.

[0009] Furthermore, the first and second multimodal classifiers have the same structure, both consisting of several fully connected layers.

[0010] Furthermore, the dual-branch coding network encodes image data using a 19-layer VGGNet to obtain image features; and the dual-branch coding network encodes text data using a multilayer perceptron to obtain text features.

[0011] Furthermore, both the first and second multimodal classifiers use classification loss as the loss function; the classification loss is expressed as: ; in, Indicates classification loss; This represents the classification loss when predicting the semantic labels of samples using image features. This represents the classification loss when predicting the semantic label of a sample based on its features.

[0012] Furthermore, the cross-modal pre-trained model based on contrastive learning employs the InfoNCE loss function. .

[0013] Furthermore, the loss function of the cross-modal learning model is a combination of classification loss and InfoNCE loss function.

[0014] Furthermore, the loss function of the cross-modal learning model is expressed as: ; in, Represents the loss function of a cross-modal learning model; This indicates the preset trade-off parameters.

[0015] On the other hand, the present invention also provides a robust cross-modal retrieval device, comprising: The data acquisition module is used to acquire multimodal data and construct a multimodal dataset; A cross-modal knowledge learning module is used to train a pre-defined cross-modal learning model using the multimodal dataset, enabling the model to learn the mapping relationships between data from different modalities and obtain cross-modal knowledge. The cross-modal learning model includes a dual-model multimodal classifier and a cross-modal pre-trained model based on contrastive learning. The dual-model multimodal classifier is used to partition the clean data in the multimodal dataset into a clean dataset. The cross-modal pre-trained model based on contrastive learning is used to map data from different modalities in the clean dataset to the same feature space through contrastive learning, learning the mapping relationships between data from different modalities and obtaining cross-modal knowledge. The cross-modal retrieval module is used to perform cross-modal retrieval based on the obtained cross-modal knowledge.

[0016] In another aspect, the present invention also provides an electronic device comprising a processor and a memory; wherein the memory stores at least one instruction, which is loaded and executed by the processor to implement the above-described method.

[0017] In another aspect, the present invention also provides a computer-readable storage medium storing at least one instruction, which is loaded and executed by a processor to implement the above method.

[0018] The beneficial effects of the technical solution provided by this invention include at least the following: The cross-modal retrieval method provided by this invention trains a pre-defined cross-modal learning model using a multimodal dataset. The cross-modal learning model includes a dual-model multimodal classifier and a contrastive learning-based cross-modal pre-trained model. The dual-model multimodal classifier is used to partition clean data in the multimodal dataset into a clean dataset. The contrastive learning-based cross-modal pre-trained model uses contrastive learning to map data from different modalities in the clean dataset to the same feature space, learning the mapping relationships between data from different modalities and obtaining cross-modal knowledge. Based on the obtained cross-modal knowledge, cross-modal retrieval is achieved. This effectively reduces the interference of noisy multimodal data on model training, improves the accuracy and robustness of cross-modal retrieval, and enhances the model's ability to learn cross-modal semantic alignment and representation in noisy environments. Attached Figure Description

[0019] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0020] Figure 1 This is a schematic diagram of the execution flow of the robust cross-modal retrieval method provided in the embodiments of the present invention; Figure 2 This is a structural block diagram of the cross-modal learning model provided in an embodiment of the present invention; Figure 3 This is a structural block diagram of the robust cross-modal retrieval device provided in the embodiments of the present invention; Figure 4 This is a structural block diagram of the electronic device provided in an embodiment of the present invention. Detailed Implementation

[0021] To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

[0022] First, it should be noted that in the embodiments of the present invention, the words "exemplarily," "for example," etc., are used to indicate that they are examples, illustrations, or descriptions. Any embodiment or design scheme described as "exemplary" in the present invention should not be construed as being more preferred or advantageous than other embodiments or design schemes. Specifically, the use of the term "exemplarily" is intended to present the concept in a specific manner. Furthermore, in the embodiments of the present invention, the meaning expressed by "and / or" can be both, or it can be either one or the other.

[0023] First Embodiment

[0024] This embodiment provides a robust cross-modal retrieval method, which can be implemented by an electronic device, such as a terminal or a server. The execution flow of this method is as follows: Figure 1 As shown, it includes the following steps: S1, Acquire multimodal data and construct a multimodal dataset; The multimodal data includes two different modalities: text data and image data. For example, when the solution of this invention is used for geographic information retrieval, the two modalities can be a text describing geographic information (topography, hydrology, water bodies, etc.) of a certain geographic location and a satellite remote sensing image corresponding to the geographic location.

[0025] S2, the pre-defined cross-modal learning model is trained using the multimodal dataset, enabling the model to learn the mapping relationship between data from different modalities and obtain cross-modal knowledge; wherein, the cross-modal learning model includes a dual-model multimodal classifier and a cross-modal pre-trained model based on contrastive learning; the dual-model multimodal classifier is used to divide the clean data in the multimodal dataset into a clean dataset; the cross-modal pre-trained model based on contrastive learning is used to map data from different modalities in the clean dataset to the same feature space through contrastive learning, learn the mapping relationship between data from different modalities, and obtain cross-modal knowledge; The cross-modal learning model in this embodiment is as follows: Figure 2 As shown.

[0026] The clean data in the multimodal dataset is divided into a clean dataset, including: 1. Input multimodal data into a dual-branch coding network for feature extraction to obtain text and image features.

[0027] For image feature extraction, this embodiment uses the 19-layer convolutional layers of VGGNet as the ImgNet convolutional layers for the image modality, and pre-trained on ImageNet. A 4096-dimensional feature vector is generated from the fc7 layer as the high-level semantic representation of each image, as follows: Next, several fully connected layers will... Mapped to a general representation, it is represented as : ; in, It is the dimension of the public representation space. .

[0028] For text feature extraction, TxtNet is pre-trained using a multilayer perceptron (MLP) to perform general classification tasks, consisting of two fully connected layers. and Composition and generation of features Then several fully connected layers will Converted to a general text representation, it is represented as : ; in, It is the dimension of the public representation space. .

[0029] Based on the above, this embodiment employs a dual-branch coding network to project multimodal data into the same semantic space, obtaining common feature representations of images and text. Thus, the features of images and text are input into a shared representation space to learn cross-modal high-order similarity representations of images and text, optimizing the modality-invariant common representation.

[0030] 2. Based on the prediction of the dual-model classifier and the original labeling information, a confidence calculation and model consensus strategy is adopted to divide the multimodal training data into a clean set and a noisy set, so as to achieve robust filtering of noisy labels.

[0031] Specifically, a dual-model multimodal classifier predicts the semantic labels of the data, calculates the confidence score together with the original labels, and employs a dual-model consensus strategy to partition data that both models consider clean into a clean dataset. The multimodal classifier network consists of several fully connected layers and takes the common representation of both modalities as input to predict the class labels, as shown below:

[0032] in, yes function, These are the parameters of the classifier network. This embodiment trains the multimodal classifier network by calculating the classification loss of labeled data:

[0033] in, A summary of the training of text-image pairs; Indicates the first A tag for a pair of images and text.

[0034] Then, in this embodiment, the multi-label classification loss of the two modalities is combined as follows:

[0035] Then, the confidence level that the sample belongs to a clean sample is calculated based on the predicted label and the original label:

[0036] in, Indicates the original image label; Indicates the predicted image label; Represents the original text label; Represents the predicted text label; where Represents variables related to model A or model B.

[0037] Subsequently, a dual-model collaborative strategy was adopted, and clean samples were distinguished from noisy samples based on a given partition threshold:

[0038] In this way, the common representation features of multimodal data can be separated from high-confidence labeled data and low-confidence data based on adaptive multimodal noise filtering, resulting in clean sample datasets and noisy sample datasets.

[0039] 3. A cross-modal pre-trained model based on contrastive learning is adopted. Through contrastive learning, using a clean sample set and a dynamically updated matching matrix, and employing the cross-modal InfoNCE loss function, a lower bound of mutual information between image and text representations is learned, achieving robust and unbiased cross-modal representation alignment under noisy supervision. This allows for the construction of category-level similarity on clean data based on labeled and predicted tags, establishing contrastive alignment of cross-modal representations.

[0040] In this embodiment, high-confidence labeled data is used to construct category-level similarity on clean data based on labeled and predicted labels, and to establish a contrast alignment of cross-modal representations.

[0041] Specifically, this embodiment employs the InfoNCE loss encouragement model to distinguish positive sample pairs, i.e., semantically aligned cross-modal samples, from a set of negative sample pairs (dissimilar samples), thereby bringing related representations closer together in the representation space while pushing unrelated representations apart. To bridge the cross-modal semantic gap, this embodiment uses a cross-modal InfoNCE objective, which explicitly aligns the representations of different modalities by treating cross-modal matched sample pairs as positive samples. Positive and negative sample pairs can be formally represented by a corresponding matrix as follows:

[0042] in, express It is a positive answer; otherwise, it is a negative answer.

[0043] The definition of cross-modal InfoNCE is as follows:

[0044] in, To represent cosine similarity, The temperature parameter, combined with the above two parts, yields a curriculum-based contrastive loss for multi-label cross-modal learning in this embodiment:

[0045] 4. Based on the joint optimization of multimodal classification loss and cross-modal contrastive loss, a unified objective function is constructed. By dynamically balancing classification accuracy and representation alignment through parameter trade-offs, end-to-end learning and generalization performance improvement of shared semantic space are achieved.

[0046] Specifically, in this embodiment, label prediction is improved by minimizing the multimodal classification loss, while the shared representation space of the two modalities is optimized by minimizing the cross-modal contrastive loss. Therefore, this embodiment combines the two losses to obtain the final objective function as follows:

[0047] in, It is a trade-off parameter used to adjust the relative contributions of the first and second components in the result representation. In one feasible implementation, Available The value can be taken within a range and can be determined based on the specific application scenario and dataset characteristics.

[0048] For example, optimal performance can be achieved by selecting from multiple candidate values ​​through validation set evaluation, cross-validation, grid search, or empirical parameter tuning. Value. In some embodiments, The value can be selected from 0.01, 0.05, 0.1, 0.2, 1, 2, 5, or 10. A smaller value can be used when the influence of the balancing term is weak; a larger value can be used when the influence of the balancing term is strong. Optimal values ​​for different datasets or tasks. The values ​​may vary, and this invention does not limit them.

[0049] S3, based on the obtained cross-modal knowledge, enables cross-modal retrieval.

[0050] In summary, this embodiment provides a cross-modal retrieval method by training a pre-defined cross-modal learning model using a multimodal dataset. The cross-modal learning model includes a dual-model multimodal classifier and a contrastive learning-based cross-modal pre-trained model. The dual-model multimodal classifier is used to partition clean data from the multimodal dataset into a clean dataset. The contrastive learning-based cross-modal pre-trained model uses contrastive learning to map data from different modalities in the clean dataset to the same feature space, learning the mapping relationships between data from different modalities and obtaining cross-modal knowledge. Based on the obtained cross-modal knowledge, cross-modal retrieval is achieved. This effectively reduces the interference of noisy multimodal data on model training, improves the accuracy and robustness of cross-modal retrieval, and enhances the model's ability to learn cross-modal semantic alignment and representation in noisy environments.

[0051] Second Embodiment

[0052] This embodiment provides a robust cross-modal retrieval device, such as... Figure 3 As shown, it includes the following modules: The data acquisition module is used to acquire multimodal data and construct a multimodal dataset; A cross-modal knowledge learning module is used to train a pre-defined cross-modal learning model using the multimodal dataset, enabling the model to learn the mapping relationships between data from different modalities and obtain cross-modal knowledge. The cross-modal learning model includes a dual-model multimodal classifier and a cross-modal pre-trained model based on contrastive learning. The dual-model multimodal classifier is used to partition the clean data in the multimodal dataset into a clean dataset. The cross-modal pre-trained model based on contrastive learning is used to map data from different modalities in the clean dataset to the same feature space through contrastive learning, learning the mapping relationships between data from different modalities and obtaining cross-modal knowledge. The cross-modal retrieval module is used to perform cross-modal retrieval based on the obtained cross-modal knowledge.

[0053] It should be noted that, for ease of explanation, Figure 3 Only the main components of the device are shown. Furthermore, the robust cross-modal retrieval device of this embodiment corresponds to the robust cross-modal retrieval method of the first embodiment described above; the functions implemented by each functional module in the robust cross-modal retrieval device of this embodiment correspond one-to-one with the process steps in the robust cross-modal retrieval method described above; therefore, they will not be described again here.

[0054] Third Embodiment

[0055] This embodiment provides an electronic device, such as... Figure 4As shown, the electronic device includes a processor and a memory; wherein the processor and the memory can be connected via a communication bus; the memory stores at least one instruction, which is loaded and executed by the processor to implement the method of the first embodiment described above. Furthermore, the electronic device may also include a transceiver, the processor and the transceiver can be connected via a communication bus, and the transceiver is used to communicate with other devices.

[0056] Below, in conjunction with Figure 4 A detailed introduction to each component of this electronic device is provided below: The processor is the control center of the electronic device. The electronic device may include multiple processors, each of which can be a single-core processor (single-CPU) or a multi-core processor (multi-CPU). The term "processor" can refer to a single processor or a collective term for multiple processing elements. For example, a processor can be one or more central processing units (CPUs), other general-purpose processors, application-specific integrated circuits (ASICs), or one or more integrated circuits configured to implement embodiments of the present invention, such as one or more digital signal processors (DSPs), one or more field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor. The processor can perform various functions of the electronic device by running or executing software programs stored in memory and by calling data stored in memory.

[0057] In a specific implementation, as one example, the processor may include one or more CPUs, for example... Figure 4 CPU0 and CPU1 shown are, of course, merely illustrative examples.

[0058] The memory is used to store the software program that executes the solution of the present invention, and the processor controls its execution. For specific implementation methods, please refer to the above method embodiments, which will not be repeated here.

[0059] Optionally, the memory may be a read-only memory (ROM) or other type of static storage device capable of storing static information and instructions, random access memory (RAM) or other type of dynamic storage device capable of storing information and instructions, or electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compressed optical discs, laser discs, optical discs, digital universal optical discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium capable of carrying or storing desired program code in the form of instructions or data structures and accessible by a computer, but not limited thereto. The memory may be integrated with the processor or exist independently, and may be accessed through the interface circuit of the electronic device ( Figure 4 (Not shown in the image) is coupled to the processor; however, this embodiment of the invention does not impose specific limitations on this.

[0060] The transceiver may include a receiver and a transmitter. Figure 4 (Not shown separately). The receiver is used to implement the receiving function, and the transmitter is used to implement the transmitting function. The transceiver can be integrated with the processor or exist independently, and can be connected through the interface circuit of the electronic device (…). Figure 4 (Not shown in the image) is coupled to the processor, and this embodiment of the invention does not specifically limit this.

[0061] In addition, it should be noted that, Figure 4 The structure of the electronic device shown is not intended to limit the device. Actual devices may include more or fewer components than shown, or combine certain components, or have different component arrangements. Furthermore, the technical effects achieved by this electronic device when performing the method of the first embodiment described above can be referenced to the technical effects described in the first embodiment; therefore, they will not be repeated here.

[0062] Fourth embodiment

[0063] This embodiment provides a computer-readable storage medium storing at least one instruction, which is loaded and executed by a processor to implement the method of the first embodiment described above. The computer-readable storage medium may be a ROM, random access memory, CD-ROM, magnetic tape, floppy disk, or optical data storage device, etc. The instruction stored therein can be loaded and executed by a processor in a terminal.

[0064] Furthermore, it should be noted that the present invention can be provided as a method, apparatus, or computer program product. Therefore, embodiments of the present invention can take the form of a completely or partially hardware embodiment, a completely or partially software embodiment, or an embodiment combining software and hardware aspects. Moreover, when implemented in software, embodiments of the present invention can take the form of a computer program product implemented on one or more computer-usable storage media containing computer-usable program code. The computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any usable medium accessible to a computer or a data storage device such as a server or data center containing one or more sets of usable media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. A semiconductor medium can be a solid-state drive (SSD).

[0065] Embodiments of the present invention are described with reference to flowchart illustrations and / or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0066] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1The functions specified in one or more boxes. These computer program instructions may also be loaded onto a computer or other programmable data processing terminal equipment to cause a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable terminal equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0067] It should also be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. The terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element. Furthermore, the term "and / or" is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, and B alone, where A and B can be singular or plural. Additionally, the character " / " in this text generally indicates an "or" relationship between the preceding and following objects, but it can also indicate an "AND / OR" relationship. Please refer to the context for specific interpretations. "At least one" refers to one or more items, while "more than" refers to two or more items. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or multiple items. For example, at least one of a, b, or c can be represented as: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple.

[0068] Furthermore, it is understood that in various embodiments of the present invention, the order of the above-mentioned process numbers does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

[0069] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.

[0070] In the several embodiments provided by this invention, it should be understood that the disclosed devices, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative. For instance, the division of functional modules / units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another device, or some features may be ignored or not executed. Furthermore, the shown or discussed mutual couplings or direct couplings or communication connections may be through some interfaces; indirect couplings or communication connections between devices or units may be electrical, mechanical, or other forms. Units described as separate components may or may not be physically separate, and components shown as units may or may not be physical units, i.e., they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs. Additionally, the functional units in the various embodiments of this invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

[0071] If the method is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0072] Finally, it should be noted that the above description is merely a preferred embodiment of the present invention. It should be pointed out that although preferred embodiments of the present invention have been described, those skilled in the art, once they understand the basic inventive concept of the present invention, can make several improvements and modifications without departing from the principles described herein. These improvements and modifications should also be considered within the scope of protection of the present invention. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the embodiments of the present invention.

Claims

1. A robust cross-modal retrieval method, characterized in that, include: Acquire multimodal data and construct a multimodal dataset; The multimodal dataset is used to train a pre-defined cross-modal learning model, enabling the model to learn the mapping relationships between data from different modalities and obtain cross-modal knowledge. The cross-modal learning model includes a dual-model multimodal classifier and a cross-modal pre-trained model based on contrastive learning. The dual-model multimodal classifier is used to partition the clean data in the multimodal dataset into a clean dataset. The cross-modal pre-trained model based on contrastive learning is used to map data from different modalities in the clean dataset to the same feature space through contrastive learning, learning the mapping relationships between data from different modalities and obtaining cross-modal knowledge. Cross-modal retrieval is achieved based on the obtained cross-modal knowledge.

2. The robust cross-modal retrieval method as described in claim 1, characterized in that, The multimodal data includes two different modalities: text data and image data.

3. The robust cross-modal retrieval method as described in claim 2, characterized in that, The dual-model multimodal classifier includes a first multimodal classifier and a second multimodal classifier; The step of dividing the clean data in the multimodal dataset into a clean dataset includes: Multimodal data is input into a dual-branch coding network for feature extraction to obtain text features and image features; Text features and image features are input into the first multimodal classifier to obtain the predicted values ​​of the semantic labels of the corresponding samples. Based on the predicted values ​​and the true values ​​of the semantic labels, the confidence of the corresponding samples is calculated. When the confidence of a sample is greater than a preset threshold, the corresponding sample is considered to be clean data. The text features and image features are input into the second multimodal classifier to obtain the predicted values ​​of the semantic labels of the corresponding samples. Based on the predicted values ​​and the true values ​​of the semantic labels, the confidence of the corresponding samples is calculated. When the confidence of a sample is greater than a preset threshold, the corresponding sample is considered to be clean data. For a given sample, if both the first multimodal classifier and the second multimodal classifier consider it to be clean data, then the corresponding sample is assigned to the clean dataset.

4. The robust cross-modal retrieval method as described in claim 3, characterized in that, The first and second multimodal classifiers have the same structure, both consisting of several fully connected layers.

5. The robust cross-modal retrieval method as described in claim 3, characterized in that, The dual-branch coding network encodes image data using a 19-layer VGGNet to obtain image features; and the dual-branch coding network encodes text data using a multilayer perceptron to obtain text features.

6. The robust cross-modal retrieval method as described in claim 3, characterized in that, Both the first and second multimodal classifiers use classification loss as the loss function; the classification loss is expressed as: ; in, Indicates classification loss; This represents the classification loss when predicting the semantic labels of samples using image features. This represents the classification loss when predicting the semantic label of a sample based on its features.

7. The robust cross-modal retrieval method as described in claim 6, characterized in that, The cross-modal pre-trained model based on contrastive learning employs the InfoNCE loss function. .

8. The robust cross-modal retrieval method as described in claim 7, characterized in that, The loss function of the cross-modal learning model is a combination of classification loss and InfoNCE loss function.

9. The robust cross-modal retrieval method as described in claim 8, characterized in that, The loss function of the cross-modal learning model is expressed as: ; in, Represents the loss function of a cross-modal learning model; This indicates the preset trade-off parameters.

10. A robust cross-modal retrieval device, characterized in that, include: The data acquisition module is used to acquire multimodal data and construct a multimodal dataset; A cross-modal knowledge learning module is used to train a pre-defined cross-modal learning model using the multimodal dataset, enabling the model to learn the mapping relationships between data from different modalities and obtain cross-modal knowledge. The cross-modal learning model includes a dual-model multimodal classifier and a cross-modal pre-trained model based on contrastive learning. The dual-model multimodal classifier is used to partition the clean data in the multimodal dataset into a clean dataset. The cross-modal pre-trained model based on contrastive learning is used to map data from different modalities in the clean dataset to the same feature space through contrastive learning, learning the mapping relationships between data from different modalities and obtaining cross-modal knowledge. The cross-modal retrieval module is used to perform cross-modal retrieval based on the obtained cross-modal knowledge.