Training method and apparatus for image-text retrieval model, retrieval method and apparatus, device, and medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By replacing attribute fields and calculating similarity in the labeled text of the image-text retrieval model, and adjusting the model parameters, the attribute problem in pedestrian image-text retrieval was solved, and the accuracy of image-text retrieval was improved.

WO2026124045A1PCT designated stage Publication Date: 2026-06-18SHENZHEN INTELLIFUSION TECHNOLOGIES CO LTD +2

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: SHENZHEN INTELLIFUSION TECHNOLOGIES CO LTD
Filing Date: 2025-11-04
Publication Date: 2026-06-18

AI Technical Summary

⚠Technical Problem

Existing technologies struggle to effectively address the attribute issues of search objects in pedestrian image and text retrieval, resulting in low accuracy in cross-modal image and text retrieval.

⚗Method used

By acquiring image-text pairs, a pre-trained image-text retrieval model is used to replace attribute fields in the labeled text, calculate the similarity between the replaced text and the target image, and adjust the model parameters so that the first similarity is greater than the second similarity, thereby improving the accuracy of the model.

🎯Benefits of technology

It improves the accuracy of image and text retrieval, enabling the model to encode text and images more accurately, ensuring that text retrieval yields accurate images.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN2025132406_18062026_PF_FP_ABST

Patent Text Reader

Abstract

The present application relates to the technical field of image retrieval, and in particular to a training method and apparatus for an image-text retrieval model, a retrieval method and apparatus, a device, and a medium. In the method, fine-tuning training is performed on a pre-trained model by means of a replacement text, so that the model can better distinguish a positive sample from a negative sample corresponding to the replacement text, and thus the trained model can more accurately encode a text and an image, and the alignment degree between the text encoding and the image encoding is higher. During subsequent retrieval, the trained model is used to establish a gallery, and the trained model is used to encode the text, so that an accurate image can be retrieved for the text from the gallery, that is, the accuracy of image-text retrieval is improved.

Need to check novelty before this filing date? Find Prior Art

Description

Training methods, retrieval methods, devices, equipment and media for image and text retrieval models

[0001] This application claims priority to Chinese Patent Application No. 202411844256.0, filed on December 12, 2024, entitled “Training Method, Retrieval Method, Apparatus, Device and Medium for Image and Text Retrieval Model”, the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of image retrieval technology, and in particular to a training method, retrieval method, apparatus, device and medium for an image and text retrieval model. Background Technology

[0003] With the development of image recognition and natural language processing technologies, image-text retrieval has gradually become an important direction in multimodal research. For example, pedestrian image-text retrieval aims to find pedestrian images that match the text content by inputting a text description. The process mainly relies on hand-designed feature engineering and rule-based methods, which extract features from text and images separately, and then calculate the matching degree between images and text through similarity measures (e.g., Euclidean distance or cosine similarity). However, due to the high semantic complexity of natural language descriptions, and the potential interference of external factors such as background, pose, and lighting in pedestrian images, traditional methods have the following shortcomings: (1) Insufficient feature expression; hand-designed features are difficult to effectively capture subtle semantic differences between images and text; (2) Large modal differences; there are huge differences in the representation spaces between text and images, making effective alignment difficult; (3) Poor scalability; as the scale and complexity of data increase, traditional methods have poor scalability and cannot cope with large-scale and diverse datasets.

[0004] To address this, we present the application of deep learning in image and text retrieval. Deep learning can automatically learn feature representations from data, significantly improving the accuracy of cross-modal matching. Pedestrian image and text multimodal retrieval is mainly divided into two types: one is a single-tower structure, which uses a single model to represent the interaction between text and image. This model has high accuracy but low inference efficiency. The other is a dual-tower structure, where text is represented by a text encoder and images by an image encoder, and finally the two features are interacted. This method has high inference efficiency, but its accuracy is slightly lower than that of the single-tower structure.

[0005] Currently, based on the dual-tower structure, research mainly focuses on the following three aspects: training paradigms, such as Contrastive Language-Image Pre-Training (CLIP), Learning Interpretability Tool (LIT), and Bootstrapping Language-Image Pre-training (BLIP); feature interaction methods, such as various attention mechanisms; and data generation, such as generating corresponding captions based on rules or by having a large model generate them. Although different methods can improve the performance of pedestrian image-text retrieval, they do not have a good ability to solve the attribute problem of the retrieval object in image-text retrieval.

[0006] Therefore, how to perform image retrieval on text containing attributes of the search object in order to improve the accuracy of cross-modal image-text retrieval has become an urgent problem to be solved. Summary of the Invention

[0007] In view of this, embodiments of this application provide a training method, retrieval method, apparatus, device, and medium for an image-text retrieval model to solve the problem of how to perform image retrieval on text containing attributes of the retrieval object, so as to improve the accuracy of cross-modal image-text retrieval.

[0008] In a first aspect, embodiments of this application provide a method for training an image-text retrieval model, comprising:

[0009] Obtain N image-text pairs, and use the N image-text pairs to pre-train the image-text retrieval model to obtain a pre-trained image-text retrieval model. The image-text pairs include training images and corresponding labeled text, where N is an integer greater than zero.

[0010] For any image-text pair, obtain the attribute field representing the image attribute in the annotation text of the image-text pair, and replace the attribute field in the annotation text to obtain the replacement text;

[0011] The labeled text in the image-text pair is input into the pre-trained image-text retrieval model to obtain the first similarity between the target image and the labeled text and the target image. The second similarity between the replacement text and the target image is calculated using the pre-trained image-text retrieval model.

[0012] Traverse all image-text pairs to obtain the first similarity and second similarity of each image-text pair. With the goal of all image-text pairs having a first similarity greater than the corresponding second similarity, adjust the parameters of the pre-trained image-text retrieval model to obtain the trained image-text retrieval model.

[0013] Secondly, embodiments of this application provide a retrieval method, including:

[0014] Obtain the query text, and execute the trained image and text retrieval model using the training method of the image and text retrieval model as described in the first aspect to perform text encoding on the query text to obtain a semantic vector;

[0015] Using the semantic vector, images corresponding to the recalled matching image vectors are retrieved from the image database, which stores existing images and corresponding image vectors. The image vectors are obtained by encoding the existing images using a trained image retrieval model obtained by performing the training method of the image retrieval model as described in the first aspect.

[0016] Thirdly, an embodiment of this application provides a training device for a text-image retrieval model, comprising:

[0017] The pre-training module is used to acquire N image-text pairs and use the N image-text pairs to pre-train the image-text retrieval model to obtain a pre-trained image-text retrieval model. The image-text pairs include training images and corresponding labeled text, where N is an integer greater than zero.

[0018] The text replacement module is used to, for any image-text pair, obtain the attribute fields representing image attributes in the annotation text of the image-text pair, replace the attribute fields in the annotation text, and obtain the replacement text.

[0019] The similarity calculation module is used to input the labeled text in the image-text pair into the pre-trained image-text retrieval model to obtain the first similarity between the target image and the labeled text and the target image, and to use the pre-trained image-text retrieval model to calculate the second similarity between the replacement text and the target image;

[0020] The retraining module is used to traverse all image-text pairs, obtain the first similarity and second similarity of each image-text pair, and adjust the parameters of the pre-trained image-text retrieval model with the goal that the first similarity of all image-text pairs is greater than the corresponding second similarity, so as to obtain the trained image-text retrieval model.

[0021] Fourthly, embodiments of this application provide a retrieval device, including:

[0022] The text encoding module is used to obtain the query text, execute the trained image and text retrieval model using the training method of the image and text retrieval model as described in the first aspect, and encode the query text to obtain a semantic vector.

[0023] The image retrieval module is used to retrieve images corresponding to matching image vectors from an image database using the semantic vectors. The image database stores existing images and corresponding image vectors. The image vectors are obtained by encoding the existing images using a trained image retrieval model trained using the image retrieval model training method described in the first aspect.

[0024] Fifthly, embodiments of this application provide a computer device, the computer device including a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, it implements the training method of the image and text retrieval model as described in the first aspect, or the retrieval method as described in the second aspect.

[0025] Sixthly, embodiments of this application provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the training method for the image and text retrieval model as described in the first aspect, or the retrieval method as described in the second aspect.

[0026] The beneficial effects of the embodiments in this application compared with the prior art are:

[0027] The method of this application pre-trains an image-text retrieval model by using image-text pairs consisting of training images and corresponding labeled text to obtain a pre-trained image-text retrieval model. Attribute fields in the labeled text of the image-text pairs are replaced to obtain replacement text. The labeled text is then input into the pre-trained image-text retrieval model to obtain the first similarity between the target image and the labeled text. Using the pre-trained image-text retrieval model, the second similarity between the replacement text and the target image is calculated. All image-text pairs are traversed to obtain the first and second similarities for each pair. With the goal of all image-text pairs having a first similarity greater than their corresponding second similarity, the parameters of the pre-trained image-text retrieval model are adjusted to obtain a trained image-text retrieval model.

[0028] By fine-tuning the pre-trained model through text replacement, the model can better distinguish between positive samples and negative samples corresponding to the replaced text. This allows the trained model to encode text and images more accurately, with higher alignment between text and image encoding. In subsequent retrieval, using this trained model to build an image library and encode text with it enables accurate image retrieval from the image library, thus improving the accuracy of image-text retrieval. Attached Figure Description

[0029] To more clearly illustrate the technical solutions in the embodiments of this application, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0030] Figure 1 is a schematic diagram of an application environment for a training method or retrieval method of an image and text retrieval model provided in Embodiment 1 of this application;

[0031] Figure 2 is a flowchart illustrating a training method for a text and image retrieval model provided in Embodiment 2 of this application;

[0032] Figure 3 is a flowchart illustrating a training method for a text and image retrieval model provided in Embodiment 3 of this application;

[0033] Figure 4 is a flowchart illustrating a training method for a text and image retrieval model provided in Embodiment 4 of this application;

[0034] Figure 5 is a flowchart illustrating a training method for a text and image retrieval model provided in Embodiment 5 of this application;

[0035] Figure 6 is a flowchart illustrating a retrieval method provided in Embodiment Six of this application;

[0036] Figure 7 is a schematic diagram of the structure of a training device for a text and image retrieval model provided in Embodiment 7 of this application;

[0037] Figure 8 is a schematic diagram of a retrieval device provided in Embodiment 8 of this application;

[0038] Figure 9 is a schematic diagram of the structure of a computer device provided in Embodiment 9 of this application. Detailed Implementation

[0039] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this application with unnecessary detail.

[0040] It should be understood that, when used in this application specification and the appended claims, the term "comprising" indicates the presence of the described features, integrals, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or a collection thereof.

[0041] It should also be understood that the term “and / or” as used in this application specification and the appended claims means any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.

[0042] As used in this application specification and the appended claims, the term "if" may be interpreted, depending on the context, as "when," "once," "in response to determination," or "in response to detection." Similarly, the phrase "if determined" or "if detected [the described condition or event]" may be interpreted, depending on the context, as meaning "once determined," "in response to determination," "once detected [the described condition or event]," or "in response to detection [the described condition or event]."

[0043] Furthermore, in the description of this application and the appended claims, the terms "first," "second," "third," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.

[0044] References to "one embodiment" or "some embodiments" as described in this specification mean that one or more embodiments of this application include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless otherwise specifically emphasized.

[0045] The embodiments of this application can acquire and process relevant data based on artificial intelligence technology. Artificial intelligence (AI) refers to the theories, methods, technologies, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to obtain optimal results.

[0046] Foundational technologies for artificial intelligence generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies mainly encompass computer vision, robotics, biometrics, speech processing, natural language processing, and machine learning / deep learning.

[0047] It should be understood that the sequence number of each step in the following embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0048] To illustrate the technical solution of this application, specific embodiments are described below.

[0049] The first embodiment of this application provides a training method or retrieval method for a text and image retrieval model, which can be applied in the application environment shown in Figure 1. In this method, the client communicates with the server. The client can send retrieval instructions to the server, thereby forming retrieval features on the server. The server then combines these features with the database it is connected to to perform the query operation and finally outputs the query results to the client.

[0050] Both the training and retrieval methods of this image-text retrieval model can be applied to the server. The server is used to deploy the models corresponding to the training and retrieval methods of the image-text retrieval model. The model corresponding to the training method of the image-text retrieval model is an untrained model, which is trained by acquiring image data from the client or the database. The model corresponding to the retrieval method is a model trained using the training method of the image-text retrieval model, which is used to obtain client instructions to perform retrieval or build an image database for retrieval.

[0051] The client side includes, but is not limited to, PDAs, desktop computers, laptops, ultra-mobile personal computers (UMPCs), netbooks, cloud terminal devices, and personal digital assistants (PDAs). The server side can be a standalone server or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (CDNs), and big data and artificial intelligence platforms.

[0052] Referring to Figure 2, which is a flowchart illustrating a training method for an image-text retrieval model according to Embodiment 2 of this application, the training method for the image-text retrieval model described above can be applied to the server in Figure 1. The computer device corresponding to the server connects to the client, database, etc., to obtain relevant data and process the data accordingly. Of course, the image-text retrieval model of this application may include two modules: a text encoding module for feature encoding of the query text and an image encoding module for feature encoding of the image. Details will be provided below.

[0053] As shown in Figure 2, the training method for this image-text retrieval model may include the following steps:

[0054] Step S201: Obtain N image-text pairs, and use the N image-text pairs to pre-train the image-text retrieval model to obtain the pre-trained image-text retrieval model.

[0055] In this embodiment, the image-text pair includes a training image and its corresponding labeled text. That is, the training image and the labeled text form a data pair. The labeled text is text describing the attributes of the training image. For example, the labeled text could be: "A man wearing a yellow helmet, a black short-sleeved shirt, and giving a thumbs-up." The corresponding image forms an image-text pair. Here, N is a positive integer representing the quantity of training data and can be designed according to user needs.

[0056] Image-text retrieval models are models that align images and text. These models may include image encoding modules and text encoding modules, and may also include a matching module. This matching module can be constructed using a pre-trained matching algorithm, and in this application, training is not required. The image encoding module encodes the image to form feature representations in the form of image vectors, and the text encoding module encodes the text to form feature representations in the form of text vectors. For example, for an image-text pair, where the training image is input into the image-text retrieval model, the image vector E is obtained through encoding by the image encoding module. img The text input to the image-text retrieval model is labeled with text, and the text encoding module encodes the text to obtain E. text E img and E text Calculate similarity, requiring a high degree of similarity to achieve the purpose of matching.

[0057] For example, if the user inputs a text query that is a string of natural language statements, such as "people wearing yellow helmets on the construction site," denoted as T, the extracted feature is E. text =M text (T), E text ∈R d , of which M text This represents the text encoding module, corresponding to image I, with extracted features E. img =M img (I), E img ∈R d , of which M img This represents the image encoding module, where R represents a real number and d represents the encoding dimension.

[0058] Image coding modules include, but are not limited to, Residual Networks (ResNet), Inception series models, MobileNet series models, EfficientNet models, ConvNeXt models, VGG, etc.

[0059] Text encoding modules include, but are not limited to, one-hot encoding, bag of words (BoW), TF-IDF (Term Frequency-Inverse Document Frequency), N-gram models, word embeddings, and pre-trained models such as BERT and GPT.

[0060] Similarity calculations include, but are not limited to, Euclidean distance, Manhattan distance, cosine similarity, Pearson correlation coefficient, Hamming distance, Jaccard similarity, and edit distance.

[0061] Using N image-text pairs, with the goal of aligning image encoding and text encoding, the image-text retrieval model can be pre-trained to obtain a pre-trained image-text retrieval model.

[0062] The pre-training phase aims to align the text feature space with the image feature space. In this phase, the input is text-image pairs, and the images are not limited to pedestrian images. During training, a common contrastive image-text loss is used for optimization, specifically: L clip =(L I→T +L T→I ) / 2

[0063] In the formula, clip represents CLIP (Contrastive Language-Image Pre-Training), a multimodal pre-trained neural network. Of course, this image-text retrieval model can also employ other model architectures. I→T The loss of the image to text is represented as the loss of the image encoding module, L. T→I The loss of text to image represents the loss of the text encoding module, L. clip The total loss is represented by the average of the loss from image to text and the loss from text to image.

[0064] In the formula, f I (x i Let f represent the features of the i-th image, xi represent the features of the i-th image, and f T (y i ) represents the feature of the labeled text corresponding to the i-th image, y i Let represent the labeled text corresponding to the i-th image. The optimization goal is to maximize the similarity between the text and its corresponding image, and minimize the similarity between the text and non-corresponding images.

[0065] In specific pedestrian retrieval scenarios, the collected image-text equivalent data is mainly divided into two parts. One part is image-text data, meaning a sample consists of an image and its description. This data can be used in the cross-modal retrieval pre-training stage. The other part consists of images without descriptions. For this part, if the image is related to pedestrians, it can be stored and processed later. If it is not related to pedestrians, a target retrieval model is needed. The input of this target retrieval model is an image, and the output is the detection box corresponding to all pedestrians in the image. The returned detection box generally contains two coordinate points: the coordinates of the upper left corner and the lower right corner of the pedestrian in the image. After obtaining a series of coordinates, the pedestrian sub-image is extracted from the image based on these coordinates and stored. In practical applications, the data source is images from cameras. The pedestrians extracted from these images may be duplicated, which may affect the training of the model. Therefore, image deduplication is required. At this point, the method of deduplication using imdedup+faiss-gpu is adopted. imdedup is an image deduplication library. Its basic principle is to convert the image into a hash code and then deduplicate the image according to the hash code. However, its internal implementation makes deduplication very time-consuming.

[0066] In one embodiment, the hash code obtained from the converted image can also be used, but deduplication is performed using Faiss-gpu. Specifically: First, the hash code is converted into binary code, and the integer binary code is converted into floating-point number. Then, a Faiss index is constructed from the binary codes of all images. To accelerate deduplication, the index is transferred to the GPU for speedup. After constructing the index to store the binary codes, each binary code is traversed, and the L2 distance between it and the binary codes of all other images is calculated. The lower the distance, the more similar the images are. A threshold of 8 is set, and all images below this threshold are considered duplicate images, which are then removed from the data. The above operation is repeated until all binary codes have been traversed, ultimately resulting in data that does not contain duplicate images.

[0067] After image deduplication, it is necessary to generate pedestrian-related annotations for these unannotated images. This embodiment uses qwen2-vl-72B as the base model for annotation generation and deploys it using vllm. After deployment, vllm provides a request URL, which can be called by multiple processes to accelerate annotation generation. After adjustments and testing, this embodiment ultimately uses the following text instructions to guide the model in generating the corresponding annotations:

[0068] Model Input: Please describe the attributes of the people in the image as briefly as possible. Gender can be male or female, and age can be middle-aged, elderly, child, etc. For example: A man wearing a white short-sleeved shirt and brown trousers, white sneakers, and a watch. Two people wearing black clothes and black trousers. A woman with long blonde hair, etc.

[0069] The model can generate corresponding annotations based on the input image and text instructions:

[0070] For example, the input image is: [Image Example]

[0071] Model output: A man wearing a yellow helmet, a black short-sleeved shirt, and giving a thumbs-up.

[0072] Using the method described above, unlabeled images of pedestrians can be labeled.

[0073] Step S202: For any image-text pair, obtain the attribute fields representing image attributes in the annotation text of the image-text pair, replace the attribute fields in the annotation text, and obtain the replacement text.

[0074] In this embodiment, image attributes can refer to features that characterize the image by analyzing the image, such as color features, action features, object features, human features, etc. Correspondingly, attribute fields are textual descriptions of features, such as yellow, man, glasses, etc.

[0075] Replace one or more attribute fields in the labeled text. The resulting text is the replacement text. In practice, the labeled text can be copied, and then the attribute fields within the copied text can be replaced. Replacement can be done by direct deletion or insertion, or by mask prediction. For example, if the labeled text is "a woman wearing a black top and white pants," replacing "black" with "red" will result in the replacement text "a woman wearing a red top and white pants."

[0076] Taking "a man wearing a yellow helmet, a black short-sleeved shirt, and giving a thumbs up" as an example, replacing the attributes will yield: "a man wearing a white helmet, a black short-sleeved shirt, and giving a thumbs up", "a man wearing a white helmet, a red short-sleeved shirt, and giving a thumbs up", "a man wearing a white helmet, a red jacket, and giving a thumbs up", etc.

[0077] Step S203: Input the labeled text in the image-text pair into the pre-trained image-text retrieval model to obtain the first similarity between the target image and the labeled text and the target image. Then, use the pre-trained image-text retrieval model to calculate the second similarity between the replacement text and the target image.

[0078] In this embodiment, the pre-trained image and text retrieval model has been obtained through pre-training in step S201. Steps S203 and S204 are adjustments to the pre-trained image and text retrieval model, which can be considered as a retraining or fine-tuning.

[0079] The labeled text is input into a pre-trained image-text retrieval model. Its text encoding module encodes the labeled text into a labeled text vector. This vector is then used for image retrieval to find the target image. During the retrieval process, the similarity between the labeled text and the target image is calculated to obtain a first similarity score. Furthermore, the retrieval process also records negative samples corresponding to the labeled text; therefore, the similarity between these negative samples and the target image can also be calculated.

[0080] The text encoding module of the pre-trained image-text retrieval model is used to encode the replacement text corresponding to the labeled text to obtain the replacement text vector. The image encoding module of the pre-trained image-text retrieval model is used to encode the target image to obtain the image encoding vector. Then, the similarity between the replacement text vector and the image encoding vector is calculated by the matching algorithm in the pre-trained image-text retrieval model, which yields the second similarity between the replacement text and the target image.

[0081] Step S204: Traverse all image-text pairs and obtain the first similarity and second similarity of each image-text pair. With the goal of all image-text pairs having a first similarity greater than the corresponding second similarity, adjust the parameters of the pre-trained image-text retrieval model to obtain the trained image-text retrieval model.

[0082] In this embodiment, the first similarity is required to be greater than the second similarity, so that the image can accurately match the labeled text, rather than replacing the text. For example, for the labeled text "woman wearing a black top and white pants", when searching for images, it is necessary to avoid retrieving "woman wearing a white top and black pants".

[0083] A new loss is constructed using the first and second similarities to adjust the parameters of the pre-trained image and text retrieval model. This results in the pre-trained image and text retrieval model with adjusted parameters having a higher first similarity than the second similarity after iteratively calculating the first and second similarities. In some cases, the first similarity is significantly higher than the second similarity.

[0084] Finally, after the first and second similarity scores meet the preset conditions, the iteration ends, and the trained image-text retrieval model is obtained. After the model is trained, a test set is used to test its performance. Typically, an image database is built. This database is used to recall corresponding images based on pedestrian text annotations. Then, manual judgment is used to determine whether the recalled images and annotations match. If they match, the retrieval is correct; otherwise, it is incorrect. Generally, the first 10 or 20 recalled images are used to judge the accuracy of the retrieval.

[0085] Optionally, after inputting the labeled text in the image-text pair into the pre-trained image-text retrieval model, a third similarity between the negative samples of the model and the target image is also obtained;

[0086] Iterate through all image-text pairs, obtaining the first and second similarities for each pair. With the goal of ensuring that the first similarity of all image-text pairs is greater than their corresponding second similarities, adjust the parameters of the pre-trained image-text retrieval model to obtain the trained image-text retrieval model, including:

[0087] Traverse all image-text pairs and obtain the first, second, and third similarities for each pair. With the goal of ensuring that the first similarity of all image-text pairs is greater than the corresponding second and third similarities, adjust the parameters of the pre-trained image-text retrieval model to obtain the trained image-text retrieval model.

[0088] The model is adjusted using negative samples. The goal is to make the third similarity between the negative samples and the target image much smaller than the first similarity, and the third similarity smaller than the second similarity. This adjusts the parameters of the pre-trained image and text retrieval model so that the parameters of the image and text retrieval model can match more accurately when encoding images and text, and prevent mismatches.

[0089] This application embodiment uses image-text pairs consisting of training images and corresponding labeled text to pre-train an image-text retrieval model, obtaining a pre-trained image-text retrieval model. Attribute fields in the labeled text of the image-text pairs are replaced to obtain replacement text. The labeled text is then input into the pre-trained image-text retrieval model to obtain the first similarity between the target image and the labeled text. Using the pre-trained image-text retrieval model, the second similarity between the replacement text and the target image is calculated. All image-text pairs are traversed to obtain the first and second similarities for each pair. With the goal of all image-text pairs having a first similarity greater than their corresponding second similarity, the parameters of the pre-trained image-text retrieval model are adjusted to obtain a trained image-text retrieval model. By fine-tuning the pre-trained model through text replacement, the model can better distinguish between positive samples and negative samples corresponding to the replaced text. This allows the trained model to encode text and images more accurately, with higher alignment between text and image encoding. In subsequent retrieval, using this trained model to build an image library and encode text with it enables accurate image retrieval from the image library, thus improving the accuracy of image-text retrieval.

[0090] Referring to Figure 3, which is a flowchart illustrating a training method for an image-text retrieval model according to Embodiment 3 of this application, as shown in Figure 3, adjusting the parameters of the pre-trained image-text retrieval model to obtain the trained image-text retrieval model, with the goal of ensuring that the first similarity of all image-text pairs is greater than the corresponding second and third similarities, may include the following steps:

[0091] Step S301: For any image-text pair, subtract the first similarity of the image-text pair from the corresponding second similarity to obtain the first difference; subtract the second similarity of the image-text pair from the corresponding third similarity to obtain the second difference.

[0092] Step S302: Add the maximum value between the negative number and zero of the first difference to the maximum value between the negative number and zero of the second difference to obtain the sum. Iterate through all image-text pairs to obtain the sum of all image-text pairs.

[0093] Step S303: Based on the sum of all image-text pairs, adjust the parameters of the pre-trained image-text retrieval model to obtain the fine-tuned image-text retrieval model.

[0094] Step S304: Using the fine-tuned image-text retrieval model as the trained image-text retrieval model, return to execute the input of the labeled text in the image-text pair into the pre-trained image-text retrieval model to obtain the first similarity between the target image and the labeled text and the target image, until the sum of all image-text pairs reaches the minimum, and obtain the trained image-text retrieval model.

[0095] In this embodiment, the first similarity, second similarity, and third similarity are arranged in a hierarchical order, with the first similarity being greater than the second similarity, the second similarity being greater than the third similarity, and the first similarity being significantly different from the second and third similarities.

[0096] If the negative value of the first difference is greater than zero, the maximum value between it and zero is the negative value of the first difference, which means that the second similarity is greater than the first similarity, which is contrary to the purpose. The same applies to the second difference. Therefore, it is required that the sum of the results be minimized.

[0097] In one implementation, the parameters of the pre-trained image-text retrieval model are adjusted based on the sum of all image-text pairs to obtain a fine-tuned image-text retrieval model, including:

[0098] The average difference is obtained by summing all the text and image pairs.

[0099] With the goal of minimizing the average difference, the parameters of the pre-trained image and text retrieval model are adjusted to obtain the fine-tuned image and text retrieval model.

[0100] This involves averaging the sum of all image-text pairs to obtain the average difference, and then adjusting the parameters of the pre-trained image-text retrieval model to minimize the average difference.

[0101] The embodiments of this application adopt a hierarchical sorting method, requiring the similarity between the labeled text, the replacement text, and the negative samples of the model and the target image to decrease sequentially, so that the model has more refined features that conform to the similarity.

[0102] Referring to Figure 4, which is a flowchart illustrating a training method for an image-text retrieval model provided in Embodiment 4 of this application, if the labeled text contains M attribute fields, then replacing the attribute fields in the labeled text in step S202 above to obtain the replaced text may include the following steps:

[0103] Step S401: Select m attribute fields from the M attribute fields, and mask the positions of the m attribute fields in the annotation text to obtain the masked text.

[0104] In this embodiment, a mask prediction method is used for replacement. M is a preset value, and the mask is randomly selected from 1 to M to obtain the masked text. Where M≥m>0.

[0105] Step S402: Use a text prediction model to predict the position of the mask in the masked text to obtain m-level predicted text, and use all m-level predicted text as replacement text.

[0106] In this embodiment, the masking is performed by sequentially increasing the number of attribute fields. For example, masking one attribute field and predicting the predicted text at level 1, masking one attribute field and predicting the predicted text at level 2, thereby constructing hierarchical negative samples with gradually deviating meaning representations.

[0107] Take, for example, "a man wearing a yellow helmet, a black short-sleeved shirt, and giving a thumbs-up."

[0108] First, the text is analyzed using a text part-of-speech analysis model, and then part-of-speech tagging is performed using the Language Technology Platform (LTP). The final result is:

[0109] A man wearing a yellow helmet, a black short-sleeved shirt, and giving a thumbs-up.

[0110] In this context, v represents a verb, n represents a noun, wp represents a punctuation mark, and u represents an auxiliary word.

[0111] Then, for words in the text that are verbs or nouns, they are randomly replaced with [MASK]. The number of [MASK] replacements varies depending on the level of the text; for example, level 1 contains one [MASK], level 2 contains two [MASK] replacements, and so on. The text containing [MASK] is then input into BERT to predict the text represented by the [MASK] positions. Texts different from the original text are considered candidates. Below are examples of the hard samples (i.e., the replacement texts) from the first three levels:

[0112] Original text: A man wearing a yellow helmet, a black short-sleeved shirt, and giving a thumbs-up;

[0113] Level 1 Difficult Sample: A man wearing a white helmet, a black short-sleeved shirt, and giving a thumbs-up;

[0114] Level 2 Difficulty Sample: A man wearing a white helmet and a red short-sleeved shirt, giving a thumbs-up;

[0115] Level 3 Difficulty Sample: A man wearing a white helmet and a red jacket, giving a thumbs-up.

[0116] This application uses a mask prediction method to perform text replacement, which requires no manual intervention and avoids interference from human factors. At the same time, it constructs hierarchically ranked samples to meet the requirements of hierarchical similarity calculation, thereby improving the training accuracy.

[0117] Referring to Figure 5, it is a flowchart illustrating a training method for an image-text retrieval model provided in Embodiment 5 of this application. As shown in Figure 5, the step S301 above, which involves subtracting the first similarity of the image-text pair from its corresponding second similarity to obtain the first difference, and subtracting the second similarity of the image-text pair from its corresponding third similarity to obtain the second difference, may include the following steps:

[0118] Step S501: Subtract the first similarity of the image-text pair from the second similarity corresponding to the first-level predicted text to obtain the first-level difference.

[0119] Step S502: Subtract the second similarity corresponding to the predicted text at level i from the second similarity corresponding to the predicted text at level i+1 to obtain the difference at level i+1, and determine the difference from level i to level i+1 as the first difference.

[0120] Where m-1≥i≥1.

[0121] Step S503: Subtract the second similarity and the corresponding third similarity of the predicted text at level m to obtain the second difference.

[0122] The step S302 above, which involves adding the maximum value between the negative number and zero of the first difference to the maximum value between the negative number and zero of the second difference, to obtain the sum, may include the following steps:

[0123] Step S504: Determine the maximum value between the negative number and zero of each difference in the first difference, and determine the maximum value between the negative number and the collar of the second difference.

[0124] Step S505: Add the maximum value corresponding to all differences in the first difference to the maximum value between the negative number and the collar of the second difference to obtain the sum result.

[0125] This embodiment aims to ensure that the similarity between the image and the query text is greater than the similarity between the image and hard samples and negative samples, specifically expressed as follows:

[0126] in, It represents y and The negative log-likelihood between them Let h0, h1, h2, and h represent the significance scores for the i-th level difficult sample. n The parameters for adjusting similarity scores are: Y = labeled text, S = ... P The target image.

[0127] Assume the similarity score between the positive sample (i.e., the labeled text) and the target image is 0.9, the similarities between the difficult samples (levels 1 to 3) are 0.95, 0.9, and 0.8 respectively, and the similarity between the positive sample and the target image is 0.7. Assume h0 = h1 = h2 = h n=0, therefore, L at this time hr =max(0,0+(-0.9+0.95))+max(0,0+(-0.95+0.9))+max(0,0+(-0.9+0.8))+max(0,0+(-0.8+0.7))=0.05+0+0+0=0.05. The goal of model optimization is to make this value smaller and smaller. Therefore, for this example, f(Y,S) will be continuously adjusted during training. P ), Let the value be -f(Y,S) P (greater than) The value of is the same for other cases, and will not be elaborated further here.

[0128] The embodiments of this application adopt a hierarchical sorting method, requiring the similarity between the labeled text, the replacement text, and the negative samples of the model and the target image to decrease sequentially, so that the model has more refined features that conform to the similarity.

[0129] Referring to Figure 6, which is a flowchart illustrating a retrieval method provided in Embodiment Six of this application, the training method of the above-described image-text retrieval model can be applied to the server in Figure 1. The computer device corresponding to the server connects to the client, database, etc., to obtain relevant data and process it accordingly. Of course, the image-text retrieval model of this application may include two modules: a text encoding module for feature encoding of the query text and an image encoding module for feature encoding of the image. Details will be provided below.

[0130] As shown in Figure 6, the retrieval method may include the following steps:

[0131] Step S601: Obtain the query text, execute the trained image and text retrieval model using the training method of the image and text retrieval model in the above embodiments, and encode the query text to obtain a semantic vector.

[0132] Step S602: Using semantic vectors, retrieve the image corresponding to the image vector that was recalled from the image database.

[0133] The image database stores existing images and their corresponding image vectors. The image vectors are obtained by encoding existing images using the trained image-text retrieval model trained according to the methods described in the above embodiments. One or more images can be recalled, and the top k images with the highest similarity are returned and displayed using a top-K approach.

[0134] Before searching, an image database needs to be built. Let's say we have collected K images, denoted as K. To facilitate retrieval, each image needs to undergo feature extraction and storage, using the following formula:

[0135] In the formula, Let R represent the image features corresponding to the m-th image, R represent real numbers, d represent the dimension of the vector, and M represent the vector dimensions. img This is the image encoding module in the image-text retrieval model. After extracting the image features, these features are stored for subsequent retrieval.

[0136] This application embodiment obtains query text, executes the trained retrieval model using the training methods of the image-text retrieval models in the above embodiments, performs text encoding on the query text to obtain semantic vectors, and uses the semantic vectors to retrieve images corresponding to matching image vectors from the image database. By replacing the text and fine-tuning the pre-trained model, the model can better distinguish between positive samples and negative samples corresponding to the replaced text, thereby enabling the trained model to encode text and images more accurately, with higher alignment between text and image encoding. In subsequent retrievals, by building an image library using this trained model and encoding the text with it, accurate images can be retrieved from the image library, thus improving the accuracy of image-text retrieval.

[0137] Corresponding to the training method of the image and text retrieval model in the above embodiments, Figure 7 shows a structural block diagram of the training device for the image and text retrieval model provided in Embodiment 7 of this application. The training device for the image and text retrieval model can be applied to the server in Figure 1. The computer device corresponding to the server connects to the client, database, etc., to obtain relevant data and process it accordingly. Of course, the image and text retrieval model of this application may include two modules: a text encoding module for feature encoding of the query text and an image encoding module for feature encoding of the image. For ease of explanation, only the parts related to the embodiments of this application are shown.

[0138] Referring to Figure 7, the training device for this image and text retrieval model includes:

[0139] The pre-training module 71 is used to obtain N image-text pairs and use the N image-text pairs to pre-train the image-text retrieval model to obtain the pre-trained image-text retrieval model. The image-text pairs include training images and corresponding labeled text, where N is an integer greater than zero.

[0140] The text replacement module 72 is used to obtain the attribute fields representing image attributes in the annotation text of any image-text pair, replace the attribute fields in the annotation text, and obtain the replacement text.

[0141] The similarity calculation module 73 is used to input the labeled text in the image-text pair into the pre-trained image-text retrieval model to obtain the first similarity between the target image and the labeled text and the target image, and to use the pre-trained image-text retrieval model to calculate the second similarity between the replacement text and the target image.

[0142] The retraining module 74 is used to traverse all image-text pairs, obtain the first similarity and second similarity of each image-text pair, and adjust the parameters of the pre-trained image-text retrieval model with the goal that the first similarity of all image-text pairs is greater than the corresponding second similarity, so as to obtain the trained image-text retrieval model.

[0143] Optionally, after inputting the labeled text in the image-text pair into the pre-trained image-text retrieval model, a third similarity between the negative samples of the model and the target image is also obtained;

[0144] This retraining module 74 is specifically used for:

[0145] Traverse all image-text pairs and obtain the first, second, and third similarities for each pair. With the goal of ensuring that the first similarity of all image-text pairs is greater than the corresponding second and third similarities, adjust the parameters of the pre-trained image-text retrieval model to obtain the trained image-text retrieval model.

[0146] Optionally, the retraining module 74 includes:

[0147] The difference calculation unit is used to calculate the first difference by subtracting the first similarity of the image-text pair from the corresponding second similarity for any image-text pair, and to calculate the second difference by subtracting the second similarity of the image-text pair from the corresponding third similarity.

[0148] The difference merging unit is used to add the maximum value between the negative number and zero of the first difference and the maximum value between the negative number and zero of the second difference to obtain the sum result. It iterates through all image-text pairs to obtain the sum result of all image-text pairs.

[0149] The parameter adjustment unit is used to adjust the parameters of the pre-trained image and text retrieval model based on the sum of all image and text pairs, so as to obtain the fine-tuned image and text retrieval model.

[0150] The iterative training unit is used to take the fine-tuned image-text retrieval model as the trained image-text retrieval model, and return to execute the input of the labeled text in the image-text pair into the pre-trained image-text retrieval model to obtain the first similarity between the target image and the labeled text and the target image, until the sum of all image-text pairs reaches the minimum, and the trained image-text retrieval model is obtained.

[0151] Optionally, if the labeled text contains M attribute fields, then the text replacement module 72 includes:

[0152] The masking unit is used to select m attribute fields from M attribute fields, mask the positions of the m attribute fields in the annotation text, and obtain the masked text, where M≥m>0;

[0153] The replacement unit is used to predict the position of the mask in the masked text using a text prediction model, obtain m-level predicted text, and use all m-level predicted text as replacement text.

[0154] Optionally, the difference calculation unit includes:

[0155] The first-level difference calculation subunit is used to calculate the difference between the first similarity of the image-text pair and the second similarity corresponding to the first-level predicted text to obtain the first-level difference;

[0156] The hierarchical difference calculation subunit is used to calculate the difference between the second similarity corresponding to the predicted text at level i and the second similarity corresponding to the predicted text at level i+1 to obtain the difference at level i+1. The difference from level 1 to level i+1 is determined as the first difference, where m-1≥i≥1.

[0157] The second difference calculation subunit is used to calculate the difference between the second similarity and the corresponding third similarity of the predicted text at level m to obtain the second difference;

[0158] This difference merging unit includes:

[0159] The maximum value determination subunit is used to determine the maximum value between the negative number and zero of each difference in the first difference, and to determine the maximum value between the negative number and the collar of the second difference;

[0160] The difference merging subunit is used to add the maximum value corresponding to all differences in the first difference to the maximum value between the negative number and the collar of the second difference, and obtain the sum result.

[0161] It should be noted that the information interaction and execution process between the above modules, units, and sub-units are based on the same concept as the method embodiments of this application. For details on their specific functions and technical effects, please refer to the method embodiments section, and they will not be repeated here.

[0162] Corresponding to the retrieval method in the above embodiments, Figure 8 shows a structural block diagram of the retrieval device provided in Embodiment 8 of this application. The retrieval device can be applied to the server in Figure 1. The computer device corresponding to the server connects to the client, database, etc., to obtain relevant data and process it accordingly. Of course, the image-text retrieval model of this application can include two modules: a text encoding module for feature encoding of the query text and an image encoding module for feature encoding of the image. For ease of explanation, only the parts related to the embodiments of this application are shown.

[0163] Referring to Figure 8, the retrieval device includes:

[0164] The text encoding module 81 is used to obtain the query text, execute the trained image and text retrieval model obtained by using the training method of the image and text retrieval model in the above embodiments, and encode the query text to obtain a semantic vector.

[0165] Image retrieval module 82 is used to retrieve images corresponding to matching image vectors from an image database using semantic vectors. The image database stores existing images and their corresponding image vectors. The image vectors are obtained by encoding existing images using the trained image retrieval model obtained by executing the training method of the image retrieval model in the above embodiments.

[0166] It should be noted that the information interaction and execution process between the above modules, units, and sub-units are based on the same concept as the method embodiments of this application. For details on their specific functions and technical effects, please refer to the method embodiments section, and they will not be repeated here.

[0167] Figure 9 is a schematic diagram of the structure of a computer device provided in Embodiment 9 of this application. As shown in Figure 9, the computer device of this embodiment includes: at least one processor (only one is shown in Figure 9), a memory, and a computer program stored in the memory and executable on at least one processor. When the processor executes the computer program, it implements the steps in the training method or retrieval method embodiment of any of the above-described image and text retrieval models.

[0168] The computer device may include, but is not limited to, a processor and memory. Those skilled in the art will understand that Figure 9 is merely an example of a computer device and does not constitute a limitation thereof. The computer device may include more or fewer components than illustrated, or a combination of certain components, or different components, such as a network interface, a display screen, and input devices.

[0169] The processor referred to can be a CPU, but it can also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor.

[0170] Memory includes readable storage media, internal memory, etc., wherein internal memory can be the RAM of a computer device, providing an environment for the operation of the operating system and computer-readable instructions stored in the readable storage media. The readable storage media can be the hard drive of a computer device, or in other embodiments, it can be an external storage device of the computer device, such as a plug-in hard drive, Smart Media Card (SMC), Secure Digital (SD) card, or Flash Card. Furthermore, memory can include both internal storage units and external storage devices of a computer device. Memory is used to store the operating system, applications, bootloader, data, and other programs, such as program code for computer programs. Memory can also be used to temporarily store data that has been output or will be output.

[0171] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is used as an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this application. The specific working process of the units and modules in the above device can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here. If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments of this application can be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the above method embodiments. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. A computer-readable medium can include at least: any entity or device capable of carrying computer program code, a recording medium, a computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media. Examples include USB flash drives, portable hard drives, magnetic disks, or optical disks. In some jurisdictions, according to legislation and patent practice, computer-readable media cannot be electrical carrier signals or telecommunication signals.

[0172] The implementation of all or part of the processes in the methods of the above embodiments can also be accomplished by a computer program product. When the computer program product is run on a computer device, it enables the computer device to execute the steps in the above method embodiments.

[0173] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0174] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0175] In the embodiments provided in this application, it should be understood that the disclosed apparatus / computer devices and methods can be implemented in other ways. For example, the apparatus / computer device embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0176] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0177] The above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.

Claims

1. A training method for an image-text retrieval model, characterized in that, include: Obtain N image-text pairs, and use the N image-text pairs to pre-train the image-text retrieval model to obtain a pre-trained image-text retrieval model. The image-text pairs include training images and corresponding labeled text, where N is an integer greater than zero. For any image-text pair, obtain the attribute field representing the image attribute in the annotation text of the image-text pair, and replace the attribute field in the annotation text to obtain the replacement text; The labeled text in the image-text pair is input into the pre-trained image-text retrieval model to obtain the first similarity between the target image and the labeled text and the target image. The second similarity between the replacement text and the target image is calculated using the pre-trained image-text retrieval model. Traverse all image-text pairs to obtain the first similarity and second similarity of each image-text pair. With the goal of all image-text pairs having a first similarity greater than the corresponding second similarity, adjust the parameters of the pre-trained image-text retrieval model to obtain the trained image-text retrieval model.

2. The training method for the image and text retrieval model according to claim 1, characterized in that, After inputting the labeled text in the image-text pair into the pre-trained image-text retrieval model, a third similarity between the negative samples of the model and the target image is also obtained; The process involves traversing all image-text pairs to obtain the first and second similarities for each pair. With the goal of ensuring that the first similarity of all image-text pairs is greater than the corresponding second similarity, the parameters of the pre-trained image-text retrieval model are adjusted to obtain the trained image-text retrieval model, including: Traverse all image-text pairs to obtain the first similarity, second similarity, and third similarity for each pair. With the goal of ensuring that the first similarity of all image-text pairs is greater than the corresponding second and third similarities, adjust the parameters of the pre-trained image-text retrieval model to obtain the trained image-text retrieval model.

3. The training method for the image and text retrieval model according to claim 2, characterized in that, The step of adjusting the parameters of the pre-trained image-text retrieval model to obtain a trained image-text retrieval model, with the goal of ensuring that the first similarity of all image-text pairs is greater than the corresponding second and third similarities, includes: For any image-text pair, the first similarity of the image-text pair is subtracted from the corresponding second similarity to obtain the first difference; the second similarity of the image-text pair is subtracted from the corresponding third similarity to obtain the second difference. Add the maximum value between the negative number and zero of the first difference to the maximum value between the negative number and zero of the second difference to obtain the sum. Iterate through all image-text pairs to obtain the sum of all image-text pairs. Based on the sum of all image-text pairs, the parameters of the pre-trained image-text retrieval model are adjusted to obtain the fine-tuned image-text retrieval model; The fine-tuned image-text retrieval model is used as the trained image-text retrieval model. The process of inputting the labeled text in the image-text pair into the pre-trained image-text retrieval model is repeated to obtain the first similarity between the target image and the labeled text and the target image. This process continues until the sum of all image-text pairs reaches the minimum, thus obtaining the trained image-text retrieval model.

4. The training method for the image and text retrieval model according to claim 3, characterized in that, If the labeled text contains M attribute fields, then replacing the attribute fields in the labeled text to obtain the replaced text includes: Select m attribute fields from the M attribute fields, and mask the positions of the m attribute fields in the labeled text to obtain the masked text, where M ≥ m > 0; Using a text prediction model, the position of the mask in the masked text is predicted to obtain m-level predicted text, and all m-level predicted text are used as replacement text.

5. The training method for the image and text retrieval model according to claim 4, characterized in that, The step of subtracting the first similarity of the image-text pair from the corresponding second similarity to obtain the first difference, and subtracting the second similarity of the image-text pair from the corresponding third similarity to obtain the second difference, includes: The difference between the first similarity of the image-text pair and the second similarity corresponding to the first-level predicted text is used to obtain the first-level difference; The difference between the second similarity corresponding to the predicted text at level i and the second similarity corresponding to the predicted text at level i+1 is obtained to obtain the difference at level i+1. The difference from level i to level i+1 is determined as the first difference, where m-1≥i≥1. The second difference is obtained by subtracting the second similarity and the corresponding third similarity of the predicted text at level m. The step of adding the maximum value between the negative number and zero of the first difference to the maximum value between the negative number and zero of the second difference to obtain the summation result includes: Determine the maximum value between the negative number and zero for each difference in the first difference, and determine the maximum value between the negative number and the collar for the second difference; Add the maximum value corresponding to all differences in the first difference to the maximum value between the negative number of the second difference and the collar to obtain the sum.

6. A retrieval method, characterized in that, include: Obtain the query text, and execute the trained image and text retrieval model using the training method of the image and text retrieval model as described in any one of claims 1 to 5 to perform text encoding on the query text to obtain a semantic vector; Using the semantic vector, images corresponding to the matched image vectors are retrieved from the image database. The image database stores existing images and corresponding image vectors. The image vectors are obtained by encoding the existing images using a trained image retrieval model obtained by performing image encoding on the image retrieval model training method as described in any one of claims 1 to 5.

7. A training device for an image-text retrieval model, characterized in that, include: The pre-training module is used to acquire N image-text pairs and use the N image-text pairs to pre-train the image-text retrieval model to obtain a pre-trained image-text retrieval model. The image-text pairs include training images and corresponding labeled text, where N is an integer greater than zero. The text replacement module is used to, for any image-text pair, obtain the attribute fields representing image attributes in the annotation text of the image-text pair, replace the attribute fields in the annotation text, and obtain the replacement text. The similarity calculation module is used to input the labeled text in the image-text pair into the pre-trained image-text retrieval model to obtain the first similarity between the target image and the labeled text and the target image, and to use the pre-trained image-text retrieval model to calculate the second similarity between the replacement text and the target image; The retraining module is used to traverse all image-text pairs, obtain the first similarity and second similarity of each image-text pair, and adjust the parameters of the pre-trained image-text retrieval model with the goal that the first similarity of all image-text pairs is greater than the corresponding second similarity, so as to obtain the trained image-text retrieval model.

8. A retrieval device, characterized in that, include: The text encoding module is used to obtain query text, execute the trained image and text retrieval model obtained by the training method of the image and text retrieval model as described in any one of claims 1 to 5, and perform text encoding on the query text to obtain a semantic vector; The image retrieval module is used to retrieve images corresponding to matching image vectors from an image database using the semantic vectors. The image database stores existing images and corresponding image vectors. The image vectors are obtained by encoding the existing images using a trained image retrieval model obtained by performing image encoding on the image retrieval model training method as described in any one of claims 1 to 5.

9. A computer device, characterized in that, The computer device includes a processor, a memory, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the training method of the image and text retrieval model as described in any one of claims 1 to 5, or the retrieval method as described in claim 6.

10. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the training method of the image and text retrieval model as described in any one of claims 1 to 5, or the retrieval method as described in claim 6.