Data retrieval methods and apparatus, non-volatile storage media

By integrating and training the multimodal feature extraction model and adjusting the dynamic loss function, the problem of input modality limitation in the multimodal retrieval system is solved, enabling flexible retrieval of arbitrary modal data and high-accuracy retrieval results, and improving retrieval performance under complex modalities.

CN120104650BActive Publication Date: 2026-06-30CHINA TELECOM ARTIFICIAL INTELLIGENCE TECHNOLOGY (BEIJING) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA TELECOM ARTIFICIAL INTELLIGENCE TECHNOLOGY (BEIJING) CO LTD
Filing Date
2025-02-17
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing multimodal retrieval systems require users to provide both image and text information as query input, which limits the flexibility of retrieval, fails to meet the complex and diverse needs of users, and lacks reinforcement learning of the interaction features between targets, thus failing to meet the data retrieval needs of complex modalities.

Method used

A multimodal feature extraction model is used to extract features from the input data. By fusing and training multiple single-modal feature extraction models of different types, it supports arbitrary modalities as input. The model is trained through similarity relationship transfer theory and dynamic adjustment of the loss function to ensure efficient retrieval under complex modalities.

Benefits of technology

It enables data retrieval that supports any modality, improves the accuracy of retrieval results in complex modality retrieval scenarios, enhances the ability to understand the interaction features between targets, and improves the flexibility and accuracy of the retrieval system.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120104650B_ABST
    Figure CN120104650B_ABST
Patent Text Reader

Abstract

This application discloses a data retrieval method and apparatus, and a non-volatile storage medium. The method includes: receiving input data, wherein the input data includes at least one of the following: single-modal data, and multimodal combined data obtained by combining multiple single-modal data; performing feature extraction processing on the input data using a multimodal feature extraction model to obtain feature information of the input data, wherein the multimodal feature extraction model is obtained by fusing and training multiple single-modal feature extraction models of different types, and each single-modal feature extraction model is used to perform feature extraction processing on one type of single-modal data; determining target data in a retrieval database whose similarity to the feature information is greater than a preset similarity value, and using the target data as the retrieval result. This application solves the technical problem that the multimodal retrieval technology in related technologies restricts the format of input information, making it impossible to achieve retrieval of complex modalities.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and more specifically, to a data retrieval method and apparatus, and a non-volatile storage medium. Background Technology

[0002] With the rapid development of deep learning technology, multimodal retrieval systems have emerged in the field of artificial intelligence, becoming an important tool for handling cross-media information retrieval. These systems can integrate information from multiple modalities, such as text, images, and speech, to provide users with more accurate and richer content retrieval services. For example, the emergence of Contrastive Language-Image Pre-training (CLIP) models enables systems to identify any category in an image based on text prompts, significantly improving the performance of cross-modal retrieval. However, related technologies still have significant limitations in terms of the flexibility of multimodal retrieval, understanding of inter-target interactions, and handling of long-tail tasks. For instance, most current multimodal retrieval systems require users to provide both image and text information as query input, which fails to meet the complex and diverse needs of users.

[0003] There is currently no effective solution to the above problems. Summary of the Invention

[0004] This application provides a data retrieval method and apparatus, and a non-volatile storage medium, to at least solve the technical problem that complex modal retrieval cannot be achieved due to the limitation of input information format by multimodal retrieval technology in related technologies.

[0005] According to one aspect of the embodiments of this application, a data retrieval method is provided, comprising: receiving input data, wherein the input data includes at least one of the following: single-modal data, multimodal combined data obtained by combining multiple single-modal data; performing feature extraction processing on the input data using a multimodal feature extraction model to obtain feature information of the input data, wherein the multimodal feature extraction model is obtained by fusing and training multiple single-modal feature extraction models of different types, and each single-modal feature extraction model is used to perform feature extraction processing on one type of single-modal data; determining target data in a retrieval database whose similarity to the feature information is greater than a preset similarity value, and using the target data as the retrieval result.

[0006] Optionally, the multimodal feature extraction model is trained as follows: Training data is acquired, including: various types of unimodal data and multimodal data generated from the unimodal data. The unimodal data includes: image data, text data, and speech data. The multimodal data includes: combinations of any unimodal data. A training data set corresponding to the multimodal training task is determined, including: a first-class training data set composed of unimodal data of the same type and a second-class training data set generated from unimodal data of different types. The unimodal feature extraction model corresponding to the training data set is used to extract features from the training data set to obtain multiple classes of unimodal features, where the type of unimodal feature corresponds to the type of unimodal data. The unimodal feature extraction model is trained based on the multiple classes of unimodal features, and the weights of the multiple classes of unimodal features are updated.

[0007] Optionally, during the training of the multimodal feature extraction model, the method further includes: determining a total loss function based on multiple classes of single-modal features, and updating the weight of one class of single-modal features in the multimodal features according to the total loss function, until the total change in the total loss function is less than or equal to a preset change, at which point the training is considered complete and the multimodal feature extraction model is obtained. Here, the total change in the total loss function is the difference between the previous total loss function and the next total loss function. The previous total loss function is the total loss function determined in the Nth training iteration, and the next total loss function is the total loss function determined in the N+1th training iteration. When updating the weight of one class of single-modal features, the other classes of single-modal features in the multimodal features remain unchanged.

[0008] Optionally, when the multimodal training task is an image-text retrieval task that indicates the determination of retrieval results based on image data and text data, determining the training data set corresponding to the multimodal training task includes: determining a second type of training data set generated based on image data and text data as the training data set corresponding to the multimodal training task; and using a single-modal feature extraction model corresponding to the training data set to extract features from the training data set, including: using a text feature extraction model to extract features from the text data to obtain text features; using an image feature extraction model to extract features from the image data to obtain first image features; and using an image feature extraction model to extract features from a masked image to obtain second image features, wherein the masked image is obtained by masking the image data based on the text data.

[0009] Optionally, when the multimodal training task is an image and text retrieval task, the total loss function is determined based on multiple single-modal features, including: determining a first loss function based on text features and a first image feature; fusing the first image feature and the second image feature to obtain a third image feature, and determining a second loss function based on text features and the third image feature; and determining the total loss function based on the first loss function, the second loss function, and the model parameters.

[0010] Optionally, the method further includes: determining the feature extraction result corresponding to each type of single-modal data; determining the independent loss function of the single-modal feature extraction model corresponding to each type of single-modal data based on the feature extraction result; and determining the fusion loss function based on multiple feature extraction results; determining the first change amount of the independent loss function and the second change amount of the fusion loss function; and determining the step of triggering the execution of determining the total loss function based on multiple single-modal feature extraction results when the first change amount is less than or equal to a preset change amount and the second change amount is less than or equal to a preset change amount.

[0011] Optionally, the retrieval library is generated by: extracting features from the input data using different unimodal feature extraction models to obtain multiple unimodal features; and analyzing the input data using an object detection model to determine the objects to be detected contained in the input data and extracting the feature information of the objects to be detected, wherein the feature information includes at least: the boundary information of the objects to be detected; and generating the retrieval library based on the multiple unimodal features and feature information.

[0012] According to another aspect of the embodiments of this application, a training method for a multimodal feature extraction model is also provided, comprising: acquiring training data, wherein the training data includes: multiple types of unimodal data, multimodal data generated from the unimodal data, wherein the unimodal data includes: image data, text data, and speech data, and the multimodal data includes: a combination of any unimodal data; using multiple types of unimodal data as training data to perform fusion training on multiple different types of unimodal feature extraction models to obtain a multimodal feature extraction model, wherein, during the fusion training process, each unimodal feature extraction model is used to perform feature extraction processing on one type of unimodal data.

[0013] According to another aspect of the embodiments of this application, a data retrieval apparatus is also provided, comprising: a receiving module for receiving input data, wherein the input data includes at least one of the following: single-modal data, multi-modal combined data obtained by combining multiple single-modal data; a feature extraction module for performing feature extraction processing on the input data using a multi-modal feature extraction model to obtain feature information of the input data, wherein the multi-modal feature extraction model is obtained by fusing and training multiple single-modal feature extraction models of different types, and each single-modal feature extraction model is used to perform feature extraction processing on one type of single-modal data; and a retrieval module for determining target data in a retrieval database whose similarity to the feature information is greater than a preset similarity value, and using the target data as the retrieval result.

[0014] According to another aspect of the embodiments of this application, a non-volatile storage medium is also provided, which stores a computer program, wherein the above-described data retrieval method is executed by running the computer program in the device where the non-volatile storage medium is located.

[0015] According to another aspect of the embodiments of this application, an electronic device is also provided, including a memory and a processor, wherein the memory stores a computer program and the processor is configured to execute the above-described data retrieval method through the computer program.

[0016] According to another aspect of the embodiments of this application, a computer program product is also provided, including computer instructions, which, when executed by a processor, implement the steps of the data retrieval method described above.

[0017] In this embodiment, input data is received, including at least one of the following: single-modal data, multimodal combined data obtained by combining multiple single-modal data; a multimodal feature extraction model is used to perform feature extraction processing on the input data to obtain feature information of the input data, wherein the multimodal feature extraction model is obtained by fusing and training multiple single-modal feature extraction models of different types, and each single-modal feature extraction model is used to perform feature extraction processing on one type of single-modal data; target data with a similarity greater than a preset similarity value with the feature information is determined in the retrieval database, and the target data is used as the retrieval result. By providing a multimodal retrieval model generated by fusing multiple single-modal feature extraction models, and performing data retrieval based on the multimodal retrieval model, the purpose of supporting arbitrary modality as input and performing arbitrary modality retrieval is achieved. This realizes the technical effect of providing a retrieval method that is adaptable to various retrieval tasks and improving the accuracy of retrieval results in complex modality retrieval scenarios, thereby solving the technical problem that the multimodal retrieval technology in related technologies is limited by the format of input information and cannot achieve complex modality retrieval. Attached Figure Description

[0018] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings:

[0019] Figure 1 This is a hardware structure block diagram of a computer terminal for implementing a data retrieval method according to an embodiment of this application;

[0020] Figure 2 This is a flowchart of the steps of a data retrieval method according to an embodiment of this application;

[0021] Figure 3 This is a schematic diagram illustrating the determination of the total loss function of a multimodal feature extraction model according to an embodiment of this application;

[0022] Figure 4 This is a flowchart illustrating the creation of a search database according to an embodiment of this application;

[0023] Figure 5 This is a flowchart illustrating the steps of a training method for a multimodal feature extraction model according to an embodiment of this application.

[0024] Figure 6 This is a schematic diagram of a training framework for a multimodal feature extraction model according to an embodiment of this application;

[0025] Figure 7 This is a structural diagram of a data retrieval device according to an embodiment of this application. Detailed Implementation

[0026] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present application, and not all embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative effort should fall within the scope of protection of the present application.

[0027] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0028] To better understand the embodiments of this application, the technical terms involved in the embodiments of this application are explained below:

[0029] Everything detection models are deep learning models used to detect multiple targets or objects in images, videos, or other data. These models typically use convolutional neural networks (CNNs) or other deep learning techniques to extract features and classify the input data to determine the location and category of different objects or targets present in the data; they have wide applications in computer vision, autonomous driving, security monitoring, and other fields. Common everything detection models include object detection models (You Only Look Once, YOLO) and fast convolutional neural network models (Region-based Convolutional Neural Networks, Faster R-CNN).

[0030] In related technologies, multimodal retrieval systems require users to input image and text pairs for retrieval. This input method limits the flexibility of retrieval and cannot meet the needs of data retrieval under complex modalities. Furthermore, machine learning models for data retrieval primarily focus on global and local features in image understanding, lacking reinforcement learning of interaction features between targets. This results in limited model performance in understanding relationships between targets in complex scenes, thus failing to meet the needs of data retrieval under complex modalities. To address this issue, this application provides relevant solutions, which are detailed below.

[0031] According to an embodiment of this application, a data retrieval method embodiment is provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Furthermore, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than that shown here.

[0032] The methods and embodiments provided in this application can be executed on mobile terminals, computer terminals, or similar computing devices. Figure 1 A hardware block diagram of a computer terminal for implementing a data retrieval method is shown. Figure 1 As shown, the computer terminal 10 may include one or more processors 102 (shown as 102a, 102b, ..., 102n in the figure) 102 (processor 102 may include, but is not limited to, a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 104 for storing data, and a transmission device 106 for communication functions. In addition, it may also include: a display, an input / output interface (I / O interface), a universal serial bus (USB) port (which may be included as one of the ports of a BUS bus), a network interface, a power supply, and / or a camera. Those skilled in the art will understand that... Figure 1 The structure shown is for illustrative purposes only and does not limit the structure of the aforementioned electronic device. For example, computer terminal 10 may also include... Figure 1 The more or fewer components shown, or having the same Figure 1 The different configurations shown.

[0033] It should be noted that the aforementioned one or more processors 102 and / or other data processing circuits are generally referred to herein as "data processing circuits". These data processing circuits may be embodied, in whole or in part, in software, hardware, firmware, or any other combination thereof. Furthermore, the data processing circuits may be a single, independent processing module, or may be integrated, in whole or in part, into any other element within the computer terminal 10. As involved in the embodiments of this application, the data processing circuits serve as a form of processor control (e.g., selection of a variable resistor termination path connected to an interface).

[0034] The memory 104 can be used to store software programs and modules of application software, such as the program instructions / data storage device corresponding to the data retrieval method in this embodiment. The processor 102 executes various functional applications and data processing by running the software programs and modules stored in the memory 104, thereby realizing the aforementioned data retrieval method. The memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory remotely located relative to the processor 102, and these remote memories can be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

[0035] The transmission device 106 is used to receive or send data via a network. Specific examples of the network described above may include a wireless network provided by the communication provider of the computer terminal 10. In one example, the transmission device 106 includes a Network Interface Controller (NIC), which can connect to other network devices via a base station to communicate with the Internet. In another example, the transmission device 106 may be a Radio Frequency (RF) module, used for wireless communication with the Internet.

[0036] The display may be, for example, a touchscreen liquid crystal display (LCD) that allows the user to interact with the user interface of the computer terminal 10.

[0037] This application provides a data retrieval method that can be applied in the above-described operating environment. Figure 2 This is a flowchart of the data retrieval method provided according to the embodiments of this application, such as... Figure 2 As shown, the method includes the following steps:

[0038] Step S202: Receive input data, wherein the input data includes at least one of the following: single-modal data, multi-modal combination data obtained by combining multiple single-modal data.

[0039] The method provided in this application supports input data of any modality. Therefore, when performing data retrieval, the input data received in step S202 can be any type of single-modal data, such as one type of data including images (including pictures and videos), text, or audio; it can also be multimodal combined data generated by combining different types of single-modal data, such as image-text pairs composed of images and text, image-audio pairs composed of images and audio, and image-audio pairs composed of audio and images. For example, an image-text pair composed of images and text can be in the following form: the text is "find a picture of two dogs running on the beach", and the image is a picture containing two dogs running on the beach. Since the retrieval method provided in this application supports retrieval of any modality, the method provided in this application can be widely applied to complex modal retrieval scenarios such as power energy retrieval.

[0040] Step S204: Use a multimodal feature extraction model to perform feature extraction processing on the input data to obtain the feature information of the input data. The multimodal feature extraction model is obtained by fusing and training multiple single-modal feature extraction models of different types. Each single-modal feature extraction model is used to perform feature extraction processing on one type of single-modal data.

[0041] The input data serves as a retrieval request, which can be a single-task request or a batch-task request. A single-task request occurs when the input data contains only one search request, while a batch-task request occurs when the input data contains multiple search requests. In step S204, a multimodal feature extraction model extracts general features, and the data that best matches the retrieval request is selected based on these general features. The multimodal large model used in this embodiment is obtained by fusing and training different types of single-modal feature extraction data. Different types of single-modal feature extraction models refer to models used for feature extraction from different types of single-modal data. For example, when the input data is the text "Find a picture of two dogs running on a beach" and an image containing two dogs running on a beach, the multimodal feature extraction model uses a text encoder (a unimodal feature extraction model for text data) to extract features from the text, and simultaneously uses an image feature extraction model (a unimodal feature extraction model for image data) to extract features. Furthermore, the features extracted from the text data and the features extracted from the image are aligned. Additionally, if the input data includes speech data, the multimodal feature extraction model also uses a speech encoder (a unimodal feature extraction model for speech data) for feature extraction.

[0042] In step S204, the multimodal feature extraction model can be loaded into memory. For example, the raw data of the multimodal feature extraction model can be loaded from non-volatile memory into volatile memory so that the processor can run the multimodal feature extraction model. The raw data of the multimodal feature extraction model refers to unprocessed data, which typically includes the parameters and structural data of the multimodal feature extraction model. The structural data can be the computational relationships based on the parameters, such as the forward propagation computational relationships between intermediate layers and between neurons. Specifically, the structural data can include code related to the structure of the first neural network, such as code used to perform related calculations between intermediate layers and between neurons.

[0043] In one implementation, a region can be partitioned in memory for loading the multimodal feature extraction model, which may include a structure data storage area and a parameter storage area. The structure data storage area stores structure-related code, and the parameters referenced by it can be pointed to by pointers to the addresses of specific parameters in the parameter storage area. During the training of the multimodal feature extraction model, it may be necessary to frequently update the parameters; in this case, updating the parameter values ​​in the parameter storage area is sufficient.

[0044] Optionally, the multimodal feature extraction model is trained as follows: Training data is acquired, including: various types of unimodal data and multimodal data generated from the unimodal data. The unimodal data includes: image data, text data, and speech data. The multimodal data includes: combinations of any unimodal data. A training data set corresponding to the multimodal training task is determined, including: a first-class training data set composed of unimodal data of the same type and a second-class training data set generated from unimodal data of different types. The unimodal feature extraction model corresponding to the training data set is used to extract features from the training data set to obtain multiple classes of unimodal features, where the type of unimodal feature corresponds to the type of unimodal data. The unimodal feature extraction model is trained based on the multiple classes of unimodal features, and the weights of the multiple classes of unimodal features are updated.

[0045] As mentioned in step S204, the multimodal feature extraction model in this embodiment is obtained by fusing and training different types of single-modal feature extraction models. Before the fusion training begins, a large amount of image data, text data, and speech data, as well as any combination thereof, are collected as training data. In this embodiment, the similarity relation transitivity theory is used for fusion training to ensure that the trained multimodal feature extraction model can be extended to any modality. When using the similarity relation transitivity theory for fusion training, the training process includes two training stages: a contrastive learning stage and a transitive learning stage. In the comparative learning phase, different types of unimodal feature extraction models are trained using different types of unimodal data. In the transfer learning phase, multimodal feature extraction models are trained as a whole using different combinations of multimodal data. Therefore, during the fusion training process, the training data needs to be differentiated. All training data is classified into training data for unimodal feature extraction models applied to different types of unimodal feature extraction models, and training data for multimodal feature extraction models applied to multimodal feature extraction models. For example, different types of unimodal data can be grouped into one dataset (i.e., the first type of training dataset) and used as training data for different unimodal feature extraction models in the comparative learning phase. Multimodal data generated from different types of unimodal data can be grouped into another dataset (i.e., the second type of training dataset) and used to train multimodal feature extraction models in the transfer learning phase. During the training process, in the contrastive learning phase, different types of unimodal data are used to train corresponding unimodal feature extraction models, updating the weights of each type of unimodal feature during retrieval when executing the retrieval method. For example, during training, an image feature extraction model is used to extract image features, a text feature extraction model to extract text features, and a speech feature extraction model to extract speech features. Then, the model is trained using contrastive loss (the loss function in the contrastive learning phase) to minimize the distance between features corresponding to different modalities.

[0046] According to some optional embodiments of this application, during the training of the multimodal feature extraction model, the method further includes: determining a total loss function based on multiple types of single-modal features, and updating the weight of one type of single-modal feature among the multiple types of single-modal features based on the total loss function, until the total change in the total loss function is less than or equal to a preset change, determining that the training is complete and obtaining the multimodal feature extraction model. The total change in the total loss function is the difference between the previous total loss function and the next total loss function. The previous total loss function is the total loss function determined in the Nth training iteration, and the next total loss function is the total loss function determined in the (N+1)th training iteration. When updating the weight of one type of single-modal feature, the other types of single-modal features among the multiple types of single-modal features remain unchanged.

[0047] As mentioned in the previous embodiment, when training the model based on the similarity relation transfer theory, the training process is divided into a contrastive learning stage and a transfer learning stage. In this embodiment, during the transfer learning stage, when training the transfer learning stage using multimodal data generated by combining different types of single-modal data, the training completion is determined by observing the degree of change of the overall loss function (i.e., the total loss function) of the multimodal feature extraction model. Specifically, the degree of change of the total loss function is represented by the difference between the total loss functions generated by two adjacent training sessions (i.e., the Nth training session and the (N+1)th training session). When the difference between the two total loss functions (i.e., the total change) in two adjacent training sessions is less than or equal to the preset change, it indicates that continuing training has little impact on the accuracy of the model output, i.e., the model converges, and the training of the model ends at this time. When the total loss function difference (i.e., the total change) is greater than the preset change, the weight of a certain type of unimodal feature is adjusted while the weights of other types of unimodal features remain unchanged. By adjusting the weights of unimodal features, the influence of unimodal features in the retrieval process is changed. For example, the matching priority of unimodal features can be defined by the weights of unimodal features. In the retrieval process, a preliminary retrieval result is first matched based on a unimodal feature with a higher weight. Then, a further retrieval result is matched based on another unimodal feature with a lower weight in the preliminary retrieval result obtained above. This continues until each unimodal feature has been used as a retrieval condition once, at which point the retrieval ends and the final retrieval result is output. As mentioned in the previous embodiment, when adjusting the weights of single-modal features during the transfer learning phase, the weights of one type of single-modal feature are adjusted while the weights of other types of single-modal features are kept fixed. For example, an image feature extraction model is selected for weight updates, while the weights of text and speech feature extraction models remain unchanged. In this process, only the weights of image features are updated, while the weights of text and speech features remain fixed. This ensures that the accuracy of text and speech features is not affected while optimizing image feature extraction. It should be noted that, in the method provided in this application embodiment, the weights of text features are preferably kept fixed. Therefore, the understanding of semantics by the multimodal feature extraction model can be improved, thereby improving the accuracy of retrieval results.

[0048] According to some optional embodiments of this application, the method further includes: determining the feature extraction result corresponding to each type of single-modal data; determining the independent loss function of the single-modal feature extraction model corresponding to each type of single-modal data based on the feature extraction result; and determining the fusion loss function based on multiple feature extraction results; determining a first change amount of the independent loss function and a second change amount of the fusion loss function; and determining, if the first change amount is less than or equal to a preset change amount and the second change amount is less than or equal to a preset change amount, triggering the step of determining the total loss function based on multiple single-modal feature extraction results.

[0049] To effectively train a multimodal feature extraction model and achieve performance balance and overall optimization when processing different modalities such as images, text, and speech, this application provides a training strategy that dynamically adjusts independent loss functions and fusion loss functions. As mentioned in the above embodiments, the training process of the multimodal feature extraction model is divided into two stages: contrastive learning and transitive learning. In the training of the multimodal feature extraction model, the contrastive learning stage includes an accuracy learning stage and a similarity learning stage. In the accuracy learning stage, the accuracy of feature extraction for each single-modal feature extraction model is determined. In the alignment learning stage (i.e., the similarity learning stage), the similarity of specific modal pairs (multimodal data) is learned. After transitioning to the transitive learning stage, the similarity between other modalities and that modality is learned by fixing the weight of one modality. In order to improve the model's understanding of semantics, the method provided in this application fixes the weight of data containing semantic information (text data and speech data) in the transitive learning stage and learns the similarity between image data (a single-modal data) and data containing semantic information. In this embodiment, the transition from the contrastive learning stage to the transitive learning stage is determined by evaluating the accuracy of feature extraction by each unimodal feature extraction model and the similarity of features extracted by multiple unimodal feature extraction models. When evaluating the accuracy of feature extraction by each unimodal feature extraction model, the loss function (i.e., independent loss function) of each model is evaluated separately. The convergence of the unimodal feature extraction model is determined by whether the difference (i.e., the first change) between the two independent loss functions of the unimodal feature extraction model in two adjacent training iterations is less than or equal to a preset change. When evaluating the similarity of features extracted by multiple unimodal feature extraction models, the semantic similarity of multiple unimodal features (i.e., the feature extraction results corresponding to each class of unimodal data) is determined. Specifically, this can be determined by comparing the retrieval results obtained by the model performing a multimodal training task with the actual retrieval results contained in the training data. Therefore, the overall loss function (i.e., the fusion loss function) of the multimodal feature extraction model can be used to evaluate the similarity of features extracted by multiple unimodal feature extraction models. The multimodal feature extraction model is determined based on the independent loss functions of multiple unimodal feature extraction models. Therefore, in this embodiment, the overall loss function of the multimodal feature extraction model is also referred to as the fusion loss function.During training, the difference between the overall loss function (i.e., the fusion loss function) of the multimodal feature extraction model in two adjacent iterations (i.e., the second change) and the preset change is used to determine whether the multiple single-modal feature extraction models have converged in learning the similarity of different single-modal feature data. If the single-modal feature extraction model converges in accuracy (i.e., the (first) change of the independent loss function is less than or equal to the preset change) and converges in similarity of different modal data (i.e., the (second) change of the fusion loss function is less than or equal to the preset change), the training of the model can enter the transfer learning stage. In the transfer learning stage, the total loss function is determined based on the multiple single-modal feature extraction models to continue to determine whether the multimodal feature extraction model has converged.

[0050] According to some alternative embodiments of this application, when the multimodal training task is an image-text retrieval task that instructs the determination of retrieval results based on image data and text data, determining the training data set corresponding to the multimodal training task includes: determining a second type of training data set generated based on image data and text data as the training data set corresponding to the multimodal training task; and performing feature extraction on the training data set corresponding to the training data set using a single-modal feature extraction model, including: performing feature extraction processing on text data using a text feature extraction model to obtain text features; performing feature extraction processing on image data using an image feature extraction model to obtain first image features; and performing feature extraction processing on a masked image using an image feature extraction model to obtain second image features, wherein the masked image is obtained by masking image data based on text data.

[0051] To improve the semantic feature understanding ability of the multimodal feature extraction model, the method provided in this application adds other modal data processed from text data to the training data during the model training process. That is, the generation of multimodal data includes the following two methods: combining different unimodal data (in this embodiment, the multimodal feature data generated according to this combination method is denoted as the first type of multimodal feature data); and processing another type of unimodal data based on unimodal data with semantic information to extract multimodal data that conforms to semantic information (in this embodiment, the multimodal feature data generated according to this processing method is denoted as the second type of multimodal feature data). In other words, the training data includes the following types of data: unimodal data (image data, text data, etc.), the aforementioned first type of multimodal data, and the second type of multimodal data. The aforementioned unimodal data with semantic information is not limited to text data; speech data also possesses semantic information, and its semantic information can be extracted by converting speech data into text. This application embodiment uses a multimodal training scenario comparing training image data and text data as an example to illustrate the process of training a multimodal feature extraction model using the similarity relation transitivity theory. As mentioned in the above embodiments, during the model training process, the training data set corresponding to the multimodal training task is first determined, and the multimodal feature extraction model uses the training data in the training set to implement the multimodal task. When the multimodal training task is a text-image retrieval task that retrieves data whose matching degree with both the input image data and the text data is greater than a preset matching degree, the data contained in its corresponding training set is a data set containing multimodal data generated from image data and text data (i.e., the second type of training set). Specifically, the first type of multimodal data contained in the training set corresponding to the text-image retrieval task is a combination of text data and image data expressing the same information; the second type of multimodal data contained is a mask image generated from text data and image data. That is, the data contained in the training set corresponding to the text-image retrieval task is: text data and image data expressing the same meaning, and mask image data generated from the above text data and image data expressing the same meaning. Only the target region related to the description of the text data is retained in the mask image (data). As mentioned above, when the multimodal training task is image-text retrieval, a mask image (data) is added to the training data. This mask image is obtained by masking the image data based on the text data. The generation process includes: extracting all text-related masks using a segmentation model (e.g., Semantic-SAM); setting the pixels outside the mask to zero; and retaining only the masked portion of the image as the text-guided mask image. The text-guided mask image retains only the target information related to the text, discarding some irrelevant background noise. This ensures that the extracted features focus only on the masked portion of the information, thereby enhancing the target interaction information in the image and improving the text understanding ability of the multimodal feature extraction model.In this embodiment, when the multimodal feature extraction model performs image and text retrieval tasks using data from its corresponding training set, during the contrastive learning phase, each unimodal feature extraction model extracts features for unimodal data with the same modality, resulting in unimodal data features. The text feature extraction model (the unimodal feature extraction model corresponding to the text data) only extracts features for text data, and the result is text features. The image feature extraction model (the unimodal feature extraction model corresponding to the image data) only extracts features for image data. In this embodiment, both image data and masked image data belong to image data. The image feature extraction model extracts features for image data (i.e., the first image feature) and extracts features for masked image data (i.e., the second image feature).

[0052] Optionally, when the multimodal training task is an image and text retrieval task, the total loss function is determined based on multiple single-modal features, including: determining a first loss function based on text features and a first image feature; fusing the first image feature and the second image feature to obtain a third image feature, and determining a second loss function based on text features and the third image feature; and determining the total loss function based on the first loss function, the second loss function, and the model parameters.

[0053] Next, using the multimodal training scenario comparing training image data and text data as an example, we will illustrate the method for determining and updating the overall loss function (i.e., total loss function) of the multimodal feature extraction model during the training process using similarity relation transitivity theory. As mentioned in the above embodiments, the total loss function of the multimodal feature extraction model is determined jointly based on the loss functions of different single-modal features; Figure 3 This is a schematic diagram illustrating the determination of the total loss function for a multimodal feature extraction model, such as... Figure 3 As shown, specifically in the scenario of a multimodal feature extraction model performing an image-text retrieval task, the total loss function can be determined by the following formula: Loss = Loss1 + a * Loss2, where Loss represents the total loss function, Loss1 is the loss calculated by aligning the text features (T) with the features (F1) of the original image data (i.e., the first loss function); Loss2 is the loss calculated by aligning the image features (F2) obtained by fusing the features of the masked image (data) and the features of the original image data (F1) with the text features (i.e., the third image features), and 'a' is a learnable parameter used to balance the contribution of different loss vectors to the total loss function, and the value of 'a' can be updated during the model training process. (First loss function) Among them, S i The features representing the original image data (equivalent to F1), Sj The first part is the text feature (equivalent to T), N is the number of training samples, and i represents the training data used to calculate Loss1; σ is a hyperparameter that is adjusted during training to balance the influence of different unimodal features on the output. The second loss function... Where k represents the training data used to calculate Loss2, and S k Features extracted from masked image data (equivalent to F2). The method provided in this embodiment uses masked images to assist in text-image matching during the training phase, allowing the model to focus more on text-related features and improve text understanding. However, during the inference phase (i.e., when applying the multimodal feature extraction model), the masked image input is removed, enabling the large multimodal model to enhance the overall text understanding ability without increasing inference time, thus improving the retrieval performance of user-input text in the retrieval system.

[0054] Step S206: Determine the target data in the retrieval database whose similarity to the feature information is greater than a preset similarity value, and use the target data as the retrieval result.

[0055] In this embodiment, after feature extraction, the multimodal feature extraction model intelligently sorts the data in the search library based on the relevance between the data in the search library and the search request, prioritizing the display of the most matching data as the search results. The data type can be text and image content or other types of data (determined according to the search request), improving the accuracy of the search and user satisfaction. The search library stores quantified image features, text features, and possible speech features. The relevance between the data in the search library and the search request can be determined by calculating the similarity between the data in the search library and the feature information extracted by the multimodal feature extraction model; the higher the similarity value, the higher the relevance. During intelligent sorting, only (target) data with a similarity greater than a preset similarity value to the feature information extracted by the multimodal feature extraction model can be sorted, or all data in the search library can be sorted. The preset similarity value can be adjusted according to actual needs to balance the accuracy and recall of the search.

[0056] Optionally, the retrieval library is generated by: extracting features from the input data using different unimodal feature extraction models to obtain multiple unimodal features; and analyzing the input data using an object detection model to determine the objects to be detected contained in the input data and extracting the feature information of the objects to be detected, wherein the feature information includes at least: the boundary information of the objects to be detected; and generating the retrieval library based on the multiple unimodal features and feature information.

[0057] The retrieval library mentioned in step S206 is also constructed based on the input data. The retrieval library can be created when the system / device executing the data retrieval method provided in this application is offline (it can also be created when online). To meet the needs of retrieval of any target, a multi-object detection model and a single-modal feature extraction model are jointly used in the library loading stage of generating the retrieval library. Figure 4 This is a flowchart for creating a search database, such as... Figure 4 As shown, after receiving the input data, the object detection model and the task-specific detection models M1 and M2 (i.e., unimodal feature extraction models) process the input data respectively, obtaining their respective detection results (boxes1 and boxes2). Boxes1 is the set of target detection boxes output by the object detection model for the input data, with each box containing one object to be detected from the input data. Boxes2 is the set of target detection boxes output by the unimodal feature extraction model for the input data. The overlap rate of the target detection boxes in boxes1 and boxes2 is compared to determine whether the input data has been correctly feature extracted. The overlap rate of any two target detection boxes can be determined based on the feature information of the target detection boxes in boxes2 (e.g., the coordinates of the bounding box (i.e., boundary information) and the size of the target detection box) and the feature information of the target detection boxes in boxes1 (i.e., unimodal features). If the overlap rate is greater than a preset overlap rate, the feature extraction is considered correct, and the target detection box is retained. If a target detection box in boxes2 is not in boxes1, the target detection box and its corresponding feature information are added, ultimately obtaining all the target detection boxes for the image. Furthermore, when image data is present in the input data, the mask region of the image is determined based on the image data and other types of data in the input data. The intersection of the mask region and the overlap rate filtering results (boxes) is taken to obtain the target image, such as the human body detection box. After merging with the mask region, only the pixel values ​​of the human body are present, and the pixel values ​​of the pixels outside the human body are 0. Using this target image (the target image is the data with feature information stored in the detection library) can reduce the noise brought by the background pixels, increase the focus of features on the target, and thus improve the intra-class similarity score in the retrieval.

[0058] Through the above steps, the system can support user input in any modality, including voice, images, and text, and enable retrieval of data in any modality. Furthermore, for retrieved visual information, it can output a complete image or a target image, along with target information within the image: coordinates and category, for use in downstream tasks.

[0059] The data retrieval method provided in this application can be applied to various scenarios, including image search (inputting an image and retrieving images with similar content), text search (inputting a text description and retrieving images semantically related to the text description), text + image search (inputting both text and an image and retrieving images related to both), video search (inputting a video or video description and retrieving video clips or videos related to the input), and multi-image search (inputting multiple images and retrieving images or image sets related to these image sets). For single-task requests or batch task requests, a multimodal feature extraction model extracts common features and intelligently sorts them according to the relevance between the user's intent and the query results, prioritizing the display of the most matching text and image content to improve retrieval accuracy and user satisfaction. In the scenario of text-to-image search: It searches for images that match the semantics of the text, extracts features from the text, projects them onto the image feature space, returns the image with the highest feature similarity, and provides a similarity score; it is suitable for precise retrieval by name and cross-modal retrieval. In image search scenarios: The system finds a set of images semantically similar to the searched image in a self-built image library and assigns a similarity score (considering features such as the scenario and specific target). This is applicable to various scenarios including similar image searching, related scene searching, and similar target searching. Search results are deduplicated and merged according to the following rules: images with the same name are merged as targets (including whole images and small targets); targets within an image are sorted in descending order of similarity score, and the image with the highest confidence score is used as its score; images are then sorted in descending order of their scores. The retrieval process in other retrieval modalities is the same as that for text search or image search.

[0060] Figure 5 This is a flowchart illustrating the steps of training a multimodal feature extraction model according to embodiments of this application, such as... Figure 5 As shown, the method includes the following steps:

[0061] Step S502: Obtain training data, wherein the training data includes: various types of unimodal data and multimodal data generated from the unimodal data. The unimodal data includes: image data, text data, and speech data. The multimodal data includes: any combination of unimodal data.

[0062] In step S502, training data for the model is acquired. In this embodiment, the multimodal feature extraction model is obtained by fusing and training different types of single-modal feature extraction models. Before the fusion training begins, a large amount of image data, text data, and speech data, as well as any combination thereof, are collected as training data. In this embodiment, similarity relation transitivity theory is used for fusion training to ensure that the trained multimodal feature extraction model can be extended to any modality.

[0063] Step S504: Multiple types of single-modal data are used as training data to fuse and train multiple single-modal feature extraction models of different types to obtain a multimodal feature extraction model. During the fusion training process, each single-modal feature extraction model is used to perform feature extraction processing on one type of single-modal data.

[0064] In step S504, the model is trained according to the similarity relation transitivity theory. Figure 6 This is a schematic diagram of the training framework for a multimodal feature extraction model, as shown below. Figure 6 As shown, this model is trained using similarity transitivity theory to achieve extension to any modality. For example, when comparing the similarity of training images and text, the text weights are fixed to train the similarity between speech and text. Through similarity transitivity theory, the model can also be trained to learn the similarity between image and speech features. Figure 6 As shown, a mask image was added to the training image-text pair. The mask image generation process is as follows: based on the text, a segmentation model (such as SAM) extracts all text-related masks, sets the pixels outside the masks to zero, and retains only the masked portion of the image as the text-guided mask image. The text-guided mask image retains only the target information related to the text, discarding some irrelevant background noise, so that the extracted features only focus on the information in the masked portion, thereby enhancing the target interaction information in the image and improving text understanding capabilities.

[0065] Through the above steps, during the training phase of the multimodal feature extraction model, a step is added based on the feature extraction results of input data from other modalities filtered by text. This achieves supervised learning of the features of the large multimodal model, improves the ability to understand multi-objective spatial interactions between text and images, and thus enhances the ability to understand multi-objective text descriptions during retrieval, thereby improving retrieval accuracy. Furthermore, training the large multimodal model using similarity relation transitivity theory can also achieve the technical effect of extending the model to any modality.

[0066] Figure 7 This is a structural diagram of a data retrieval device provided according to an embodiment of this application, such as... Figure 7 As shown, the device includes: a receiving module 70 for receiving input data, wherein the input data includes at least one of the following: single-modal data, multi-modal combined data obtained by combining multiple single-modal data; a feature extraction module 72 for performing feature extraction processing on the input data using a multi-modal feature extraction model to obtain feature information of the input data, wherein the multi-modal feature extraction model is obtained by fusing and training multiple single-modal feature extraction models of different types, and each single-modal feature extraction model is used to perform feature extraction processing on one type of single-modal data; and a retrieval module 74 for determining target data in the retrieval database whose similarity to the feature information is greater than a preset similarity value, and using the target data as the retrieval result.

[0067] When using a data retrieval device for retrieval, the receiving module 70 receives input data. Since the method provided in this embodiment supports retrieval of any modality, the input data can be single-type single-modality data or multi-modality combined data obtained by combining multiple types of single-modality data. The receiving module 70 transmits the input data to the feature extraction module 72, which calls the multi-modal feature extraction model to perform feature extraction processing on the input data to obtain the feature information of the input data. The feature extraction module transmits the extracted feature information to the retrieval module 74, which retrieves data that matches the feature information from the retrieval database as the retrieval result.

[0068] It should be noted that, Figure 7 Preferred embodiments of the shown examples can be found in [reference needed]. Figure 2 The relevant descriptions of the embodiments shown will not be repeated here.

[0069] This application also provides a non-volatile storage medium storing a computer program, wherein the above data retrieval method is executed by running the computer program on the device where the non-volatile storage medium is located.

[0070] The aforementioned non-volatile storage medium is used to store a program that performs the following functions: receiving input data, wherein the input data includes at least one of the following: single-modal data, multi-modal combined data obtained by combining multiple single-modal data; performing feature extraction processing on the input data using a multi-modal feature extraction model to obtain feature information of the input data, wherein the multi-modal feature extraction model is obtained by fusing and training multiple single-modal feature extraction models of different types, and each single-modal feature extraction model is used to perform feature extraction processing on one type of single-modal data; determining target data in the retrieval database whose similarity to the feature information is greater than a preset similarity value, and using the target data as the retrieval result.

[0071] This application also provides an electronic device, including a memory and a processor. The memory stores a computer program, and the processor is configured to execute the above-described data retrieval method through the computer program.

[0072] The processor in the aforementioned electronic device is used to run a program that performs the following functions: receiving input data, wherein the input data includes at least one of the following: unimodal data, multimodal combined data obtained by combining multiple unimodal data; performing feature extraction processing on the input data using a multimodal feature extraction model to obtain feature information of the input data, wherein the multimodal feature extraction model is obtained by fusing and training multiple different types of unimodal feature extraction models, and each unimodal feature extraction model is used to perform feature extraction processing on one type of unimodal data; determining target data in the retrieval database whose similarity to the feature information is greater than a preset similarity value, and using the target data as the retrieval result.

[0073] This application also provides a computer program product, including computer instructions, which, when executed by a processor, implement the steps of the above-described data retrieval method.

[0074] It should be noted that each module in the above data retrieval device can be a program module (e.g., a set of program instructions to implement a certain function) or a hardware module. For the latter, it can be manifested in the following forms, but is not limited to them: each of the above modules is manifested as a processor, or the functions of each of the above modules are implemented by a processor.

[0075] The sequence numbers of the embodiments in this application are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.

[0076] In the above embodiments of this application, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0077] In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units can be a logical functional division, and in actual implementation, there may be other division methods. For instance, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual coupling, direct coupling, or communication connection may be through some interfaces; the indirect coupling or communication connection between units or modules may be electrical or other forms.

[0078] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0079] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0080] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to related technologies, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.

[0081] The above description is only a preferred embodiment of this application. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of this application, and these improvements and modifications should also be considered within the scope of protection of this application.

Claims

1. A data retrieval method, characterized in that, include: Receive input data, wherein the input data includes at least one of the following: single-modal data, multi-modal combination data obtained by combining multiple single-modal data; A multimodal feature extraction model is used to extract features from the input data to obtain the feature information of the input data. The multimodal feature extraction model is obtained by fusing and training multiple single-modal feature extraction models of different types. Each single-modal feature extraction model is used to extract features from one type of single-modal data. When the multimodal training task is a text and image retrieval task, the training set used to train the multimodal feature extraction model includes: text data and image data that express the same meaning, and mask image data generated based on the text data and image data that express the same meaning, wherein the mask image data represents a mask image that retains only the target region related to the description of the text data; Training the multimodal feature extraction model includes: determining whether the multimodal feature extraction model has converged based on the total loss function, provided that multiple single-modal feature extraction models have converged in accuracy and in semantic similarity across different modal data. The step of determining whether the multimodal feature extraction model has converged based on the total loss function includes: updating the weights of one class of single-modal features while keeping the weights of other classes of single-modal features unchanged when the total change in the total loss function is greater than a preset change, wherein the other classes of single-modal features include text features; and determining that the multimodal feature extraction model has converged when the total change is less than or equal to the preset change. The total loss function is determined by the following method: a first loss function is determined based on text features and a first image feature, wherein the text features are obtained by feature extraction from the text data, and the first image feature is obtained by feature extraction from the image data; the first image feature and the second image feature are fused to obtain a third image feature, and a second loss function is determined based on the text features and the third image feature, wherein the second image feature is obtained by feature extraction from the mask image; the total loss function is determined based on the first loss function, the second loss function, and the model parameters. In the retrieval database, target data with a similarity greater than a preset similarity value to the feature information is identified, and the target data is used as the retrieval result.

2. The method according to claim 1, characterized in that, The multimodal feature extraction model is trained in the following way: Acquire training data, wherein the training data includes: multiple types of unimodal data and multimodal data generated based on the unimodal data, wherein the unimodal data includes: image data, text data, and speech data, and the multimodal data includes: any combination of the unimodal data; A training data set corresponding to the multimodal training task is determined, wherein the training data set includes: a first type of training data set composed of the single-modal data of the same type, and a second type of training data set generated according to the single-modal data of different types; The single-modal feature extraction model corresponding to the training data set is used to extract features from the training data set to obtain multiple types of single-modal features, wherein the type of the single-modal feature corresponds to the type of the single-modal data; The single-modal feature extraction model is trained based on the single-modal features of the multiple classes, and the weights of the single-modal features of the multiple classes are updated.

3. The method according to claim 2, characterized in that, The method further includes the following steps during the training of the multimodal feature extraction model: The total loss function is determined based on the multiple types of single-modal features, and the weight of one type of single-modal feature among the multiple types of single-modal features is updated according to the total loss function until the total change of the total loss function is less than or equal to a preset change. The training is then considered complete, and the multimodal feature extraction model is obtained. The total change of the total loss function is the difference between the previous total loss function and the next total loss function. The previous total loss function is the total loss function determined in the Nth training iteration, and the next total loss function is the total loss function determined in the (N+1)th training iteration.

4. The method according to claim 2, characterized in that, In the case where the multimodal training task is an image-text retrieval task that instructs the determination of retrieval results based on the image data and the text data, determining the training data set corresponding to the multimodal training task includes: determining a second type of training data set generated based on the image data and the text data as the training data set corresponding to the multimodal training task; The single-modal feature extraction model corresponding to the training dataset is used to extract features from the training dataset, including: using a text feature extraction model to extract features from the text data to obtain text features; using an image feature extraction model to extract features from the image data to obtain first image features; and using the image feature extraction model to extract features from a masked image to obtain second image features, wherein the masked image is obtained by masking the image data based on the text data.

5. The method according to claim 3, characterized in that, The method further includes: Determine the feature extraction result corresponding to each class of the single-modal data, determine the independent loss function of the single-modal feature extraction model corresponding to each class of the single-modal data based on the feature extraction result, and determine the fusion loss function based on multiple feature extraction results; Determine the first change amount of the independent loss function and the second change amount of the fusion loss function. If the first change amount is less than or equal to a preset change amount and the second change amount is less than or equal to the preset change amount, determine to trigger the step of determining the total loss function based on the extraction results of multiple single-modal features.

6. The method according to claim 1, characterized in that, The search database was generated using the following method: Different unimodal feature extraction models are used to extract features from the input data to obtain multiple unimodal features; and, The input data is analyzed using an object detection model to determine the objects to be detected contained in the input data, and the feature information of the objects to be detected is extracted, wherein the feature information includes at least the boundary information of the objects to be detected; The retrieval library is generated based on multiple single-modal features and the feature information.

7. A training method for a multimodal feature extraction model, characterized in that, include: Acquire training data, wherein the training data includes: multiple types of unimodal data and multimodal data generated based on the unimodal data, wherein the unimodal data includes: image data, text data, and speech data, and the multimodal data includes: any combination of the unimodal data; The multimodal feature extraction model is obtained by fusing and training multiple single-modal feature extraction models of different types using various types of single-modal data as training data. During the fusing and training process, each single-modal feature extraction model is used to extract features from one type of single-modal data. In the case of a text-image retrieval task, the training set used to train the multimodal feature model includes: text data and image data expressing the same meaning, and masked image data generated from the text data and image data expressing the same meaning. The masked image data represents a masked image that retains only the target region related to the description of the text data. Training the multimodal feature extraction model includes: determining whether the multimodal feature extraction model has converged based on a total loss function, provided that the multiple single-modal feature extraction models have converged in accuracy and semantic similarity across different modalities. The determination of the multimodal feature extraction model based on the total loss function... The convergence of the extraction model is determined by: updating the weights of one class of single-modal features while keeping the weights of other classes of single-modal features unchanged when the total change in the total loss function is greater than a preset change, wherein the other classes of single-modal features include text features; determining that the multimodal feature extraction model has converged when the total change is less than or equal to the preset change; wherein the total loss function is determined by: determining a first loss function based on text features and a first image feature, wherein the text features are obtained by feature extraction from the text data and the first image feature is obtained by feature extraction from the image data; fusing the first image feature and the second image feature to obtain a third image feature, and determining a second loss function based on the text feature and the third image feature, wherein the second image feature is obtained by feature extraction from the mask image; and determining the total loss function based on the first loss function, the second loss function, and the model parameters.

8. A data retrieval device, characterized in that, include: A receiving module is configured to receive input data, wherein the input data includes at least one of the following: single-modal data, multi-modal combined data obtained by combining multiple single-modal data; The feature extraction module is used to perform feature extraction processing on the input data using a multimodal feature extraction model to obtain the feature information of the input data. The multimodal feature extraction model is obtained by fusing and training multiple single-modal feature extraction models of different types. Each single-modal feature extraction model is used to extract features from one type of single-modal data. When the multimodal training task is an image-text retrieval task, the training set used to train the multimodal feature model includes: text data and image data expressing the same meaning, and masked image data generated based on the text data and image data expressing the same meaning. The masked image data represents a masked image that retains only the target region related to the description of the text data. Training the multimodal feature extraction model includes: if multiple single-modal feature extraction models converge in accuracy and semantic similarity across different modalities, determining whether the multimodal feature extraction model has converged based on a total loss function. Determining whether the multimodal feature extraction model has converged includes: updating the weights of one class of single-modal features while keeping the weights of other classes of single-modal features unchanged when the total change in the total loss function is greater than a preset change, wherein the other classes of single-modal features include text features; determining that the multimodal feature extraction model has converged when the total change is less than or equal to the preset change; wherein the total loss function is determined by the following method: determining a first loss function based on text features and a first image feature, wherein the text features are obtained by feature extraction from the text data, and the first image feature is obtained by feature extraction from the image data; fusing the first image feature and the second image feature to obtain a third image feature, and determining a second loss function based on the text feature and the third image feature, wherein the second image feature is obtained by feature extraction from the mask image; and determining the total loss function based on the first loss function, the second loss function, and model parameters. The retrieval module is used to identify target data in the retrieval database that has a similarity greater than a preset similarity value to the feature information, and to use the target data as the retrieval result.

9. A non-volatile storage medium, characterized in that, The non-volatile storage medium stores a computer program, wherein the device containing the non-volatile storage medium executes the data retrieval method according to any one of claims 1 to 6 by running the computer program.

10. An electronic device comprising a memory and a processor, characterized in that, The memory stores a computer program, and the processor is configured to execute the data retrieval method according to any one of claims 1 to 6 through the computer program.

11. A computer program product comprising computer instructions, characterized in that, When the computer instructions are executed by the processor, they implement the steps of the data retrieval method according to any one of claims 1 to 6.