Data retrieval method, retrieval model training method, and related device

By combining global and local semantic identifiers, the problems of high latency and low accuracy in large-scale corpora of traditional cross-modal retrieval methods are solved, and efficient and accurate multimodal data retrieval is achieved.

WO2026130331A1PCT designated stage Publication Date: 2026-06-25HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2025-12-16
Publication Date
2026-06-25

Smart Images

  • Figure CN2025142811_25062026_PF_FP_ABST
    Figure CN2025142811_25062026_PF_FP_ABST
Patent Text Reader

Abstract

The present application relates to the field of AI, and specifically relates to a data retrieval method, a retrieval model training method, and a related device. The method comprises: a retrieval apparatus acquiring first retrieved text; inputting the first retrieved text into a retrieval model for processing, so as to obtain a first semantic identifier, wherein the first semantic identifier comprises a first global semantic identifier and a first local semantic identifier; and the retrieval apparatus determining a first target retrieval object on the basis of the first semantic identifier and a correspondence table between a retrieval object and a semantic identifier, wherein a correspondence exists between the first target retrieval object and the first semantic identifier. The solution of the present application is beneficial to improving retrieval accuracy and retrieval efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Data retrieval methods, retrieval model training methods and related equipment

[0001] This application claims priority to Chinese Patent Application No. 202411889542.9, filed on December 19, 2024, entitled "Data Retrieval Method, Retrieval Model Training Method and Related Equipment", the entire contents of which are incorporated herein by reference. Technical Field

[0002] This application relates to the field of artificial intelligence (AI), and in particular to a data retrieval method, a retrieval model training method, and related equipment. Background Technology

[0003] A retrieval task refers to the process of finding relevant data that meets given query criteria within a database or dataset. In cross-modal retrieval tasks, the input and output involve different modalities (such as text, images, audio, and video).

[0004] Traditional cross-modal retrieval methods tend to use single-tower or dual-tower models. Single-tower models perform fine-grained interactions between queries and candidates within a unified module, offering excellent retrieval accuracy, but require real-time calculation of the match between queries and candidates, resulting in high latency and making them unsuitable for large-scale corpora. Dual-tower models map different modalities to a joint embedding space using two encoders, and the mapping encoding of candidate data can be pre-computed and cached, improving retrieval efficiency compared to single-tower models. However, due to the differences between modalities, dual-tower models often struggle to effectively align multimodal data, thus reducing the accuracy of retrieval results. Summary of the Invention

[0005] This application provides a data retrieval method, a retrieval model training method, and related equipment. Using this application's embodiments is beneficial for improving retrieval accuracy and precision.

[0006] Firstly, embodiments of this application provide a data retrieval method. The method includes:

[0007] The retrieval device acquires the first retrieval text; inputs the first retrieval text into the retrieval model for processing to obtain the first semantic identifier, which includes a first global semantic identifier and a first local semantic identifier; the retrieval device determines the first target retrieval object based on the first semantic identifier and a correspondence table between the retrieval object and the semantic identifier, and there is a correspondence between the first target retrieval object and the first semantic identifier.

[0008] The first target retrieval object includes one or more of the following: text, video, audio, image, hyperlink, and webpage.

[0009] As can be seen, semantic identifiers derived from the search text contain richer semantic information, including both global and local semantic identifiers. Therefore, using these identifiers to determine the target search object improves search accuracy. Furthermore, when identifying the target search object using semantic identifiers, the search scope can be narrowed down first using global semantic identifiers, and then the target search object can be precisely retrieved using local semantic identifiers, thus improving search efficiency. Moreover, this semantic identifier-based search method, compared to explicitly calculating the similarity between the query and candidate options, reduces the memory usage of the search device.

[0010] In conjunction with the first aspect, in one possible implementation, the retrieval device acquires the first retrieval text, including:

[0011] The retrieval device acquires retrieval condition information, which includes text-type retrieval condition information or non-text-type retrieval condition information. When the retrieval condition information is non-text-type retrieval condition information, the retrieval device converts the non-text-type retrieval condition information into text-type retrieval condition information, wherein the first retrieval text is text-type retrieval condition information.

[0012] The non-text search criteria include one or more of the following: video, audio, images, hyperlinks, and web pages.

[0013] It can be seen that converting non-text search criteria into text search criteria and then using the text search criteria for retrieval is beneficial for achieving multimodal input retrieval.

[0014] In conjunction with the first aspect, in one possible implementation, the method of this embodiment further includes:

[0015] The retrieval device acquires the second retrieval text; the retrieval device inputs the second retrieval text into the retrieval model for processing to obtain the second semantic identifier, which includes a second global semantic identifier and a second local semantic identifier; the retrieval device determines the second target retrieval object based on the second semantic identifier and the correspondence table, and there is a correspondence between the second target retrieval object and the second semantic identifier; the modality of the first target retrieval object is different from the modality of the second target retrieval object.

[0016] The second target retrieval object includes one or more of the following: text, video, audio, image, hyperlink, and webpage.

[0017] It can be seen that by using a retrieval model with the first and second search texts, different modal search results can be obtained, that is, a retrieval model can be used to retrieve information of different modalities.

[0018] In conjunction with the first aspect, in one possible implementation, the retrieval model includes an encoder and a decoder. The retrieval device inputs the first retrieval text into the retrieval model for processing to obtain a first semantic identifier, including:

[0019] The retrieval device uses an encoder to extract multi-scale features from the retrieved text to obtain multiple first feature vectors. Based on these first feature vectors, the retrieval device obtains a second feature vector and a first vector matrix, where the second feature vector is obtained by concatenating the multiple first feature vectors, and the first vector matrix is ​​obtained by stacking the multiple first feature vectors. The retrieval device uses a decoder to perform multiple first operations based on the second feature vector, the first vector matrix, and first intermediate data to obtain multiple logical values, each corresponding to a first operation. These logical values ​​are then processed to obtain a first semantic identifier. During the first first operation, the first intermediate data is the starting token. During the x-th first operation, where x is greater than 1 and not less than M, the first intermediate data is the (x-1)th token in the first semantic identifier. The token obtained based on the logical value from the last first operation is the termination token.

[0020] In conjunction with the first aspect, in one possible implementation, the decoder includes multiple decoding layers, and the retrieval device utilizes the decoder to perform multiple first operations based on the second feature vector, the first vector matrix, and the first intermediate data to obtain multiple logical values, including:

[0021] During the x-th first operation, the retrieval device uses the s-th decoding layer to process the second feature vector, the first vector matrix, and the second intermediate data to obtain the output of the s-th decoding layer. Specifically, when s > 1, the second intermediate data is the output of the (s-1)-th decoding layer; when s = 1, the second intermediate data is the first intermediate data; and when the s-th decoding layer is the last decoding layer, its output is the logical value corresponding to the x-th first operation.

[0022] In conjunction with the first aspect, in one possible implementation, the s-th decoding layer includes a cross-attention layer, a fusion layer, a linearization layer, and an activation function. The retrieval device uses the s-th decoding layer to process the second feature vector, the first vector matrix, and the second intermediate data to obtain the output of the s-th decoding layer, including:

[0023] The retrieval device uses a fusion layer to fuse the second feature vector to obtain a first fused vector, the dimension of which is lower than that of the second feature vector. The retrieval device uses a cross-attention layer to process the second intermediate data and the first fused vector to obtain a third feature vector. An activation function is used to process the third feature vector to obtain a fourth feature vector. The fourth feature vector is averaged to obtain a first average value. The retrieval device uses a cross-attention layer to process multiple first feature vectors in the first vector matrix to obtain multiple fifth feature vectors, each corresponding to a first feature vector. The second intermediate data serves as the Q-value of the cross-attention layer, and the first feature vectors serve as the K-value and V-value of the cross-attention layer. The retrieval device uses a linearization layer to linearize the multiple fifth feature vectors to obtain multiple first processing results. An activation function is used to process the multiple first processing results to obtain multiple weights corresponding to the multiple fifth feature vectors. A dot product operation is performed between the multiple fifth feature vectors and their corresponding weights to obtain multiple sixth feature vectors. Mathematical operations are performed between the multiple sixth feature vectors and the first average value to obtain the output of the s-th decoding layer.

[0024] As can be seen, the features output from each layer of the encoder are concatenated to form global features, which interact with the decoder input to generate global fusion features. The features output from each layer of the encoder undergo a cross-attention operation with the decoder input, and then are weighted and summarized to obtain local fusion features. These global and local fusion features are then input into the next decoder to generate the next token for the semantic identifier. Through this coarse-grained to fine-grained feature fusion strategy, the resulting semantic identifiers possess rich semantic information, thereby improving the accuracy of the semantic identifiers and consequently enhancing retrieval accuracy.

[0025] In conjunction with the first aspect, in one possible implementation, the method of this embodiment further includes:

[0026] The retrieval device acquires multiple candidate retrieval objects; performs feature extraction on each candidate retrieval object to obtain multiple seventh feature vectors; performs clustering processing on the multiple seventh feature vectors to obtain multiple first clusters and multiple first cluster centers; wherein each of the multiple first clusters includes one or more seventh feature vectors; the multiple first clusters correspond to multiple first cluster centers, and the global semantic identifier of the semantic identifier of any candidate retrieval object A among the multiple candidate retrieval objects is the index of the first cluster to which candidate retrieval object A belongs; the retrieval device determines multiple first residual vectors based on the multiple first clusters and multiple first cluster centers, the multiple first residual vectors correspond to the multiple candidate retrieval objects, and any first residual vector B among the multiple first residual vectors is the difference between the seventh feature vector of the candidate retrieval object corresponding to the first residual vector B and the first cluster center corresponding to the first cluster to which the seventh feature vector of the candidate retrieval object corresponding to the first residual vector B belongs; the retrieval device determines the local semantic identifier of the semantic identifier of each candidate retrieval object among the multiple candidate retrieval objects based on the multiple first residual vectors; the retrieval device establishes a correspondence table based on the multiple candidate retrieval objects and their corresponding semantic identifiers.

[0027] In conjunction with the first aspect, in one possible implementation, the local semantic identifier of each retrieval object includes k tokens, where k is an integer greater than 1. The retrieval device determines the local semantic identifiers of multiple candidate retrieval objects based on multiple first residual vectors, including:

[0028] The retrieval device performs dimensionality reduction processing on multiple first residual vectors to obtain multiple second residual vectors. The retrieval device performs k processing steps on each of the multiple second residual vectors to obtain k tokens in the local semantic identifier. When i is greater than 1, the i-th token among the k tokens is determined by the input data during the i-th processing step. The input data during the i-th processing step is determined based on the input data during the (i-1)-th processing step and the (i-1)-th token among the k tokens. When i = 1, the input data during the i-th processing step is the second residual vector.

[0029] In conjunction with the first aspect, in one possible implementation, for a first candidate retrieval object and a second candidate retrieval object in which both the global semantic identifier and the local semantic identifier are the same among multiple candidate retrieval objects, the semantic identifier of the first candidate retrieval object and the semantic identifier of the second candidate retrieval object further include a first identifier and a second identifier, respectively, which are used to distinguish the first candidate retrieval object and the second candidate retrieval object.

[0030] Multimodal data is discretely encoded using clustering algorithms and RQ-VAE to obtain global and local semantic identifiers. By referencing the first and second identifiers, candidate retrieval objects with similar semantic information can be distinguished by semantic identifiers, ensuring the uniqueness of semantic identifiers.

[0031] Secondly, embodiments of this application provide a method for training a retrieval model. This method includes:

[0032] The training device acquires multiple training samples. Each training sample includes a third search text and a third semantic identifier of the search object corresponding to the third search text. The third semantic identifier includes a third global semantic identifier and a third local semantic identifier. The training device trains the retrieval model based on the third search text and the third semantic identifier in the multiple training samples.

[0033] It can be seen that semantic identifiers containing both global and local semantic identifiers were used during training. Training the retrieval model with semantically rich semantic identifiers can make the semantic information of the candidate retrieval objects learned by the retrieval model richer, which in turn helps to improve the accuracy of retrieval results and retrieval precision when using the retrieval model for retrieval.

[0034] In conjunction with the second aspect, in one possible implementation, multiple training samples include a first training sample and a second training sample, wherein the modality of the retrieval object corresponding to the semantic identifier in the first training sample is different from the modality of the retrieval object corresponding to the semantic identifier in the second training sample.

[0035] It can be seen that by introducing semantic identifiers of different modalities of retrieval objects into the training samples, the trained retrieval model can perform retrieval of retrieval objects of different modalities.

[0036] In conjunction with the second aspect, in one possible implementation, the training device acquires multiple training samples including:

[0037] The training device acquires multiple search condition information, including text-type search condition information and / or non-text-type search condition information; for the non-text-type search condition information, the training device converts it into text-type search condition information; based on the multiple search condition information, it retrieves multiple corresponding search objects, and determines multiple third semantic identifiers corresponding to the multiple search objects; based on the obtained multiple text-type search condition information and multiple third semantic identifiers, it obtains multiple training samples, wherein the text-type search condition information is the search text in the training samples.

[0038] The non-text search criteria include one or more of the following: video, audio, images, hyperlinks, and web pages.

[0039] It can be seen that by converting non-textual search criteria into textual search criteria and using this data to train the search model, the search model can perform multimodal input retrieval.

[0040] In conjunction with the second aspect, in one possible implementation, the retrieval model includes an encoder, a decoder, and a softmax function. The training device trains the retrieval model based on third retrieval text and third semantic identifier samples from multiple training samples, including:

[0041] The training device uses an encoder to perform multi-scale feature extraction on the third retrieval text in each of the multiple training samples to obtain multiple eighth feature vectors for each training sample. Based on these eighth feature vectors, the training device obtains a ninth feature vector and a second vector matrix for each training sample. The ninth feature vector is obtained by concatenating multiple eighth feature vectors, and the second vector matrix is ​​obtained by stacking multiple eighth feature vectors. The training device uses a decoder to perform two first operations on the ninth feature vector, the second vector matrix, and the third semantic identifier in each training sample to obtain two output result sets for each training sample. The training device uses a softmax function to process one of the two output result sets for each training sample to obtain a first probability value, which represents the probability of generating the third semantic identifier in each training sample. The training device determines the cross-entropy loss value based on the first probability values ​​for multiple training samples. The training device determines the consistency loss value for each training sample based on the two output result sets for each training sample. The training device trains the retrieval model based on the cross-entropy loss value and the consistency loss value for multiple training samples.

[0042] It can be seen that when calculating the loss, the cross-entropy loss value is introduced to maximize the probability of the retrieval model generating the correct semantic identifier, which helps to ensure the accuracy of the retrieval model; by introducing the consistency loss value, it helps to reduce the overfitting of the retrieval model during training.

[0043] In conjunction with the second aspect, in one possible implementation, the training device performs the first operation twice on the ninth feature vector, the second vector matrix, and the third semantic identifier in each training sample to obtain two sets of output results for each training sample, including:

[0044] The training device uses the decoder to perform M second operations based on the ninth feature vector, the second vector matrix, and the third semantic identifier corresponding to each training sample, to obtain M first output results corresponding to each training sample; the third semantic identifier includes M tokens;

[0045] In the first second operation, the decoder's input data includes the ninth feature vector, the second vector matrix, and the starting token for each training sample. In the j-th second operation, the decoder's input data includes the ninth feature vector, the second vector matrix, and the (j-1)-th token from the third semantic identifier for each training sample. The training device then uses the decoder to perform M second operations based on the ninth feature vector, the second vector matrix, and the third semantic identifier for each training sample to obtain M second output results for each training sample. The two output result sets each include M first output results and M second output results.

[0046] In conjunction with the second aspect, in one possible implementation, the decoder includes multiple decoding layers. The training device utilizes the decoder to perform M second operations based on the ninth feature vector, the second vector matrix, and the third semantic identifier corresponding to each training sample, to obtain M first output results corresponding to each training sample, including:

[0047] During the j-th second operation in M ​​second operations, the s-th decoding layer processes the ninth feature vector, the second vector matrix, and the third intermediate data corresponding to each training sample to obtain the output of the s-th decoding layer. When s is greater than 1, the third intermediate data is the output of the (s-1)-th decoding layer; when s = 1, the third intermediate data is the fourth intermediate data. When j = 1, the fourth intermediate data is the starting token; when j is greater than 1 and not greater than M, the fourth intermediate data is the (j-1)-th token in the third semantic identifier. The output of the last decoding layer in the s decoding layers is the j-th first output among the M first outputs.

[0048] In conjunction with the second aspect, in one possible implementation, the s-th decoding layer includes a cross-attention layer, a fusion layer, a linearization layer, and an activation function. The training device uses the input data of the s-th decoding layer to process the ninth feature vector, the second vector matrix, and the third intermediate data corresponding to each training sample to obtain the output result of the s-th decoding layer, including:

[0049] The training device uses a fusion layer to fuse the ninth feature vector to obtain a second fused vector, the dimension of which is lower than that of the ninth feature vector. The training device uses a cross-attention layer to process the third intermediate data and the second fused vector to obtain a tenth feature vector. An activation function is then used to process the tenth feature vector to obtain an eleventh feature vector. The eleventh feature vector is averaged to obtain a second average value. The training device uses a cross-attention layer to process multiple eighth feature vectors in the second vector matrix to obtain multiple twelfth feature vectors, each corresponding to a different eighth feature vector. The third intermediate data serves as the Q-value of the cross-attention layer, and the eighth feature vectors serve as the K and V-values. The training device uses a linearization layer to linearize the multiple twelfth feature vectors to obtain multiple second processing results. An activation function is then used to process these second processing results to obtain multiple weights corresponding to the twelfth feature vectors. A dot product operation is performed between the multiple twelfth feature vectors and their corresponding weights to obtain multiple thirteenth feature vectors. Finally, a mathematical operation is performed between the multiple thirteenth feature vectors and the second average value to obtain the output of the s-th decoding layer.

[0050] As can be seen, the features output from each layer of the encoder are concatenated to form global features, which interact with the decoder input to generate global fusion features. The features output from each layer of the encoder undergo a cross-attention operation with the decoder input, and then are weighted and summarized to obtain local fusion features. These global and local fusion features are then input into the next decoder to generate the next token for the semantic identifier. This coarse-to-fine-grained feature fusion strategy results in semantic identifiers with rich semantic information, thus improving the accuracy of the retrieval model.

[0051] In conjunction with the second aspect, in one possible implementation, the training device determines multiple third semantic identifiers corresponding to multiple retrieval objects based on multiple retrieval objects, including:

[0052] The training device extracts features from multiple retrieval objects to obtain multiple fourteenth feature vectors. It then performs clustering on these fourteenth feature vectors to obtain multiple second clusters and multiple second cluster centers. Each second cluster includes one or more fourteenth feature vectors. The multiple second clusters correspond to multiple second cluster centers, and the global semantic identifier of the third semantic identifier of any retrieval object C is the index of the second cluster to which retrieval object C belongs. Based on the multiple second clusters and cluster centers, the training device determines multiple third residual vectors, each corresponding to a retrieval object. Any third residual vector D is the difference between the fourteenth feature vector of the retrieval object corresponding to D and the second cluster center corresponding to the second cluster to which the fourteenth feature vector of the retrieval object corresponding to D belongs. Finally, based on the multiple third residual vectors, the training device determines the local semantic identifier of the third semantic identifier of each retrieval object.

[0053] In conjunction with the second aspect, in one possible implementation, the local semantic identifier of each retrieved object includes k tokens, where k is an integer greater than 1. The training device determines the local semantic identifiers of the first semantic identifiers of multiple retrieved objects based on multiple first residual vectors, including:

[0054] The training device performs dimensionality reduction on multiple third residual vectors to obtain multiple fourth residual vectors. The training device performs k processing steps on each of the multiple fourth residual vectors to obtain k tokens in the local semantic identifier. When i is greater than 1, the i-th token among the k tokens is determined by the input data during the i-th processing step. The input data during the i-th processing step is determined based on the input data during the (i-1)-th processing step and the (i-1)-th token among the k tokens. When i = 1, the input data during the i-th processing step is the fourth residual vector.

[0055] In conjunction with the second aspect, in one possible implementation, for a first retrieval object and a second retrieval object in which the global semantic identifier and the local semantic identifier are the same in multiple retrieval objects, the third semantic identifier of the first retrieval object and the third semantic identifier of the second retrieval object also include a third identifier and a fourth identifier, respectively, which are used to distinguish the first retrieval object and the second retrieval object.

[0056] Multimodal data is discretely encoded using clustering algorithms and RQ-VAE to obtain global and local semantic identifiers. By referencing third and fourth identifiers, retrieval objects with similar semantic information can also be distinguished by semantic identifiers, ensuring the uniqueness of semantic identifiers.

[0057] Thirdly, embodiments of this application provide a retrieval device, including units or modules for implementing the method provided in the first aspect or any possible implementation of the first aspect.

[0058] Fourthly, embodiments of this application provide a training apparatus, including units or modules for implementing the method provided in the second aspect or any possible implementation of the second aspect.

[0059] Fifthly, embodiments of this application provide an electronic device including a processor and a memory. The memory is used to store program code. The processor is used to invoke the program code stored in the memory to execute the method provided in the first aspect or any possible implementation of the first aspect, or the method provided in the second aspect or any possible implementation of the second aspect.

[0060] In a sixth aspect, embodiments of this application provide a computer storage medium including computer instructions that, when executed on an electronic device, cause the electronic device to perform a method provided by any possible implementation of the first aspect, or a method provided by the second aspect or any possible implementation of the second aspect.

[0061] In a seventh aspect, embodiments of this application provide a computer program product that, when run on a computer, causes the computer to perform a method as provided in any possible implementation of the first aspect, or a method provided in the second aspect or any possible implementation of the second aspect.

[0062] Understandably, the retrieval device described in the third aspect above is used to execute any of the methods provided in the first aspect; the training device described in the fourth aspect above is used to execute any of the methods provided in the second aspect; the electronic device described in the fifth aspect is used to execute any of the methods provided in the first aspect or any of the methods provided in the second aspect; and the computer storage medium described in the fourth aspect and the computer program product described in the fifth aspect are both used to implement any of the methods provided in the first aspect or any of the methods provided in the second aspect. Therefore, the beneficial effects they can achieve can be referred to the beneficial effects in the corresponding methods, and will not be repeated here. Attached Figure Description

[0063] Figure 1 is a schematic diagram of a system architecture provided in an embodiment of this application;

[0064] Figure 2 is a schematic flowchart of a data retrieval method provided in an embodiment of this application;

[0065] Figure 3 illustrates the conversion of non-text search criteria information into text-type search criteria information;

[0066] Figure 4 is a schematic diagram of the structure of a retrieval model provided in an embodiment of this application;

[0067] Figure 5 is a schematic diagram of a decoding layer provided in an embodiment of this application;

[0068] Figure 6 illustrates the process of generating semantic identifiers;

[0069] Figure 7 is a schematic diagram of a residual quantization model provided in an embodiment of this application;

[0070] Figure 8 is a flowchart illustrating a retrieval model training method provided in an embodiment of this application;

[0071] Figure 9 illustrates the performance comparison results of the retrieval models;

[0072] Figure 10 is a schematic diagram of a retrieval device provided in an embodiment of this application;

[0073] Figure 11 is a schematic diagram of a training device provided in an embodiment of this application;

[0074] Figure 12 is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0075] The terms “first,” “second,” “third,” and “fourth,” etc., used in the specification, claims, and drawings of this application are used to distinguish different objects, not to describe a specific order.

[0076] "Multiple" refers to two or more. "And / or" describes the relationship between related objects, indicating three possible relationships. For example, A and / or B means: A exists alone, A and B exist simultaneously, or B exists alone. The character " / " generally indicates that the preceding and following related objects have an "or" relationship.

[0077] The embodiments of this application will now be described with reference to the accompanying drawings.

[0078] Referring to Figure 1, Figure 1 is a schematic diagram of a system architecture provided by an embodiment of this application. As shown in Figure 1, the system architecture includes a user equipment 101, a retrieval server 102, and a data server 103.

[0079] User equipment 101 is a device that provides voice and / or data connectivity to a user, such as a handheld device or in-vehicle device with wireless connectivity. Common user devices include mobile phones, tablets, laptops, PDAs, mobile internet devices (MIDs), and wearable devices such as smartwatches, smart bracelets, and pedometers.

[0080] The retrieval server 102 is a device capable of data processing and data transmission, such as a cloud server, distributed server, rack server, blade server, tower server, etc.

[0081] Data server 103 is a server capable of storing data, such as cloud server, distributed server, rack server, blade server, tower server, etc.

[0082] Alternatively, the retrieval server 102 and the data server 103 may be integrated together or be two separate physical entities.

[0083] User equipment 101 sends a search request to search server 102. The search request includes search criteria information, which includes at least one of the following: search text, search image, search video, and search audio. Search server 102 determines a corresponding semantic identifier based on the search criteria information included in the search request. This semantic identifier includes a global semantic identifier and a local semantic identifier. Search server 102 retrieves the target search object corresponding to the determined semantic identifier from the data server and returns the target search object to user equipment 101. The data server 103 stores the correspondence between search objects and semantic identifiers.

[0084] As can be seen, in the scheme of this embodiment, semantic identifiers are obtained based on the search text. Since the semantic identifiers include global semantic identifiers and local semantic identifiers, the semantic information contained in such semantic identifiers is richer. Therefore, when determining the target search object through the semantic identifiers, it is beneficial to improve the search accuracy. At the same time, when determining the target search object through the semantic identifiers, the search scope can be narrowed first through the global semantic identifiers, and then the target search object can be accurately retrieved through the local semantic identifiers, thereby improving the search efficiency.

[0085] Referring to Figure 2, Figure 2 is a flowchart illustrating a data retrieval method provided in an embodiment of this application. As shown in Figure 2, the method includes:

[0086] S201, The retrieval device obtains the first retrieval text.

[0087] The retrieval device is either the retrieval server 102 in Figure 1 or a functional module or functional unit within the retrieval server 102.

[0088] Optionally, the first search text is text-type search criteria information entered by the user, such as text data, or it can be obtained based on non-text-type search criteria information entered by the user. The non-text-type search criteria information may include at least one of video data, image data, audio data, web pages, hyperlinks, etc.

[0089] For non-textual retrieval criteria, the retrieval device uses a multimodal large language model to generate textual retrieval criteria based on the non-textual retrieval criteria. This textual retrieval criteria can be seen as a textual description of the corresponding non-textual retrieval criteria. For example, the multimodal large language model could be the Qwen-VL model, the BLIP-2 model, or the Qwen-Audio model. Among these, the Qwen-VL model or the BLIP-2 model can be used to generate corresponding textual descriptions based on video or image data, while the Qwen-Audio model can be used to generate textual descriptions corresponding to audio data.

[0090] As shown in Figure 3, the retrieval device uses a multimodal large language model to convert image data into one or more corresponding text descriptions. Optionally, the retrieval device outputs the text description, such as by displaying it or playing it via voice, allowing the user to manually check the text description to ensure its accuracy.

[0091] In this way, the retrieval device can convert non-text data into text data to facilitate subsequent retrieval, thus enabling retrieval from multimodal input.

[0092] Furthermore, when the retrieval device uses a multimodal large language model to generate text-type retrieval condition information from non-textual retrieval condition information, it also uses prompts to guide the text descriptions generated by the multimodal large language model to describe non-textual retrieval condition information from different perspectives, such as image data. The retrieval device uses the multimodal large language model to generate text descriptions about image background, main objects, colors, and actions based on prompts and image data.

[0093] Optionally, the prompt is input by the user through their device, or generated by the retrieval device based on pre-set rules.

[0094] This method effectively avoids repetitive and monotonous text descriptions generated for different non-text data, thereby improving the accuracy of search results.

[0095] S202, The retrieval device inputs the first retrieval text into the retrieval model to obtain the first semantic identifier.

[0096] The first semantic identifier includes a first global semantic identifier and a first local semantic identifier. It should be understood that the first semantic identifier is used to represent the semantic information of the first target retrieval object; in other words, the first semantic identifier has two functions: first, to identify the first target retrieval object; and second, to contain the semantic information of the first target retrieval object.

[0097] In one feasible implementation, the retrieval device performs multi-scale feature extraction on the first retrieved text to obtain multiple first feature vectors; based on the multiple first feature vectors, it obtains a second feature vector and a first vector matrix, wherein the second feature vector is obtained by concatenating multiple first feature vectors, and the first vector matrix is ​​obtained by stacking multiple first feature vectors. For example, the number of first feature vectors is N, each first feature vector is a 1*512 vector, the second feature vector is a 1*(N*512) vector, in other words, the second feature vector is a one-dimensional vector, and the number of elements in the second feature vector is N*512; the size of the first vector matrix is ​​N*512.

[0098] In one example, the retrieval model includes an encoder and a decoder. The retrieval device uses the encoder to perform multi-scale feature extraction on the first retrieval text to obtain multiple first feature vectors. The encoder includes N encoding layers, as shown in Figure 4. The retrieval device uses these N encoding layers to extract features from the first retrieval text to obtain N first feature vectors, which are the outputs of the N encoding layers. The first feature vector output by the d-th encoding layer is the input of the (d+1)-th encoding layer, where d is an integer greater than 1 and less than N. The input of the first encoding layer is the first retrieval text.

[0099] The retrieval device uses a decoder to perform multiple first operations on the second feature vector, the first vector matrix, and the first intermediate data to obtain multiple logical values, which correspond to the multiple first operations; the multiple logical values ​​are then processed to obtain the first semantic identifier.

[0100] Specifically, during the first second operation, the first intermediate data is the starting token; during the xth second operation, where x is greater than 1 and not less than M, the first intermediate data is the (x-1)th token in the first semantic identifier; the token obtained based on the logical value obtained from the last first operation is the termination token. When the result of the first operation is the termination token, this first operation is the last first operation.

[0101] In a specific example, the retrieval device uses the softmax function to process multiple logical values ​​to obtain a first semantic identifier. Specifically, the retrieval device first obtains a global semantic identifier for the first semantic identifier, and then obtains a local semantic identifier.

[0102] In one example, as shown in Figure 4, the retrieval device performs a first operation based on the starting token, the second feature vector, and the first vector matrix, resulting in the first operation: 9; 9 is the first token of the first semantic identifier. The retrieval device then performs a second operation based on the first token of the first semantic identifier, the second feature vector, and the first vector matrix, resulting in the second operation: 21; 21 is the second token of the first semantic identifier. The retrieval device then performs a third operation based on the second token of the first semantic identifier, the second feature vector, and the first vector matrix, resulting in the third operation: 38; 38 is the third token of the first semantic identifier. Finally, the retrieval device performs a fourth operation based on the third token of the first semantic identifier, the second feature vector, and the first vector matrix, resulting in the termination token. Obtaining the termination token indicates the end of the above processing. Thus, the retrieval device obtains the first semantic identifier: 9 21 38.

[0103] In one possible implementation, during the x-th first operation, the retrieval device uses a decoder to process the first intermediate result, the second feature vector, and the first vector matrix to obtain the logical value L. x In one example, as shown in Figure 4, the decoder includes N decoding layers. The retrieval device uses these N decoding layers to process the first intermediate result, the second feature vector, and the first vector matrix to obtain the logical value L. x The input data of the s-th decoding layer includes a second intermediate result, a second feature vector, and a first vector matrix, where s is an integer greater than 0 and not greater than N. When s = 1, the second intermediate data is the first intermediate data. When s is greater than 1 and not less than N, the second intermediate data is the output data of the (s-1)-th decoding layer. When s = N, the output data of the s-th decoding layer is the logical value L. x The retrieval device is based on the logical value L. x Determine the x-th token in the first semantic identifier; in one example, the retrieval device uses the softmax function to optimize the logical value L. x Process it to obtain the xth token in the first semantic identifier.

[0104] In one possible implementation, as shown in Figure 5, the s-th decoding layer includes a cross-attention layer, a fusion layer, a linearization layer, and an activation function. The retrieval device uses the s-th decoding layer to process the second intermediate result, the second feature vector, and the first vector matrix to obtain the output result of the s-th decoding layer, including:

[0105] The retrieval device uses a fusion layer to fuse the second feature vector to obtain a first fused vector, the dimension of which is lower than that of the second feature vector. The retrieval device then inputs the first fused vector and the second intermediate result into a cross-attention layer to process a third feature vector, where the first fused feature vector serves as the K and V values ​​of the cross-attention layer, and the second intermediate data serves as the Q value. An activation function is then used to process the third feature vector to obtain a fourth feature vector. Finally, the retrieval device performs an averaging operation on the fourth feature vector to obtain a first average value.

[0106] The retrieval device uses a cross-attention layer to process each of the N first feature vectors and the second intermediate data to obtain N fifth feature vectors. The first feature vectors serve as the K and V values ​​of the cross-attention layer, and the second intermediate data serves as the Q value of the cross-attention layer. The retrieval device linearizes each of the N fifth feature vectors to obtain N first processing results. The retrieval device then processes each of the N first processing results using an activation function to obtain N weights, which correspond to the N fifth feature vectors. Based on these weights, the retrieval device weights the N fifth feature vectors to obtain N sixth feature vectors. Finally, the retrieval device sums the N sixth feature vectors with the first average value to obtain the output of the s-th decoding layer.

[0107] It should be noted that, since the first vector matrix is ​​composed of N first feature vectors stacked together, the retrieval device can obtain N first feature vectors from the first vector matrix during the above processing.

[0108] In one example, the first fused vector can be represented as: Z = W[E1, E2, ..., E N ]+b. Where W is the weight, E1,E2,…,E N These are the outputs of N coding layers, i.e., N first feature vectors, and b is the bias.

[0109] The third feature vector can be expressed as: C(Y,Z) = Attention(W) q Y, W k Z, W v Z). Where Y is the second intermediate data, W q W k and W v These are the weight matrices corresponding to the Q, K, and V values ​​of the cross-attention layer, respectively.

[0110] The first average can be expressed as: Where N is the number of coding layers in the encoder, and σ is the activation function.

[0111] The i-th sixth eigenvector among multiple sixth eigenvectors can be represented as: α i ⊙Attention(Y,E i );α i For the i-th weight among multiple weights, Attention(Y,E) i Let E be the i-th fifth eigenvector among N fifth eigenvectors. i The feature output by encoding layer i is the i-th first feature vector among N first feature vectors, where α i =σ(W i [Y,Attention(Y,E i )]+b), W i is the weight, and b is the bias.

[0112] S203. The retrieval device determines the first target retrieval object based on the first semantic identifier and the correspondence table between the retrieval object and the semantic identifier, wherein there is a correspondence between the first semantic identifier and the first target retrieval object.

[0113] Specifically, the retrieval device first determines multiple reference retrieval objects from the correspondence table between retrieval objects and semantic identifiers based on the global semantic identifier in the first semantic identifier. The global semantic identifier of the semantic identifier corresponding to these multiple reference retrieval objects is the same as the global semantic identifier in the first semantic identifier. Then, it determines a first target retrieval object from the multiple reference retrieval objects based on the local semantic identifier in the first semantic identifier. The local semantic identifier of the semantic identifier corresponding to the first target retrieval object is the same as the local semantic identifier in the first semantic identifier. If there are multiple retrieval objects among the multiple reference retrieval objects whose local semantic identifier is the same as the local semantic identifier in the first semantic identifier, the retrieval device determines the first target retrieval object from these multiple retrieval objects based on the unique identifier in the first semantic identifier. The first target retrieval object is the retrieval object among these multiple retrieval objects whose unique identifier is the same as the unique identifier in the first semantic identifier. It should be understood that these multiple retrieval objects are retrieval objects whose global semantic identifier and local semantic identifier in the correspondence table between retrieval objects and semantic identifiers are both the same as the global semantic identifier and local semantic identifier in the first semantic identifier.

[0114] In one feasible implementation, the retrieval device obtains a table of correspondences between semantic identifiers and retrieval objects, which can be obtained from other devices or generated by the retrieval device itself.

[0115] It should be noted that the table of relationships between semantic identifiers and search objects generated by the retrieval device can be obtained during the training of the retrieval model or generated at other times.

[0116] In one specific implementation, as shown in Figure 6, the training device acquires multiple candidate retrieval objects. These candidate retrieval objects include, but are not limited to, at least one of video, image, audio, text, webpage, hyperlink, and document. The training device extracts features from the multiple candidate retrieval objects to obtain multiple seventh feature vectors. In one example, the training device extracts features from the multiple candidate retrieval objects using a trained ImageBind model to obtain multiple seventh feature vectors. The training device performs clustering processing on the multiple seventh feature vectors to obtain one or more first clusters and one or more corresponding first cluster centers. The global semantic identifier of the semantic identifier of any candidate retrieval object A among the multiple candidate retrieval objects serves as the index of the first cluster to which candidate retrieval object A belongs.

[0117] The training device obtains multiple first residual vectors corresponding to multiple candidate retrieval objects based on one or more first clusters and one or more first cluster centers. Among the multiple candidate first residual vectors, any first residual vector A is the difference between the first residual vector A and the seventh feature vector of the candidate retrieval object corresponding to the first residual vector A and the first cluster center corresponding to the first cluster to which the seventh feature vector belongs. In other words, the training device subtracts the corresponding first cluster center from each of the multiple seventh feature vectors, and the difference obtained is the first residual vector.

[0118] The training device obtains local semantic identifiers of semantic identifiers of multiple candidate retrieval objects based on multiple first residual vectors, and establishes a correspondence table between multiple candidate retrieval objects and their corresponding semantic identifiers.

[0119] The training device processes each of the multiple first residual vectors as follows to obtain the local semantic identifier of the semantic identifier of the candidate retrieval object corresponding to each first residual vector:

[0120] The training device performs dimensionality reduction on the multiple first residual vectors to obtain multiple second residual vectors. The training device performs k processing steps on each of the multiple second residual vectors to obtain k tokens in the local semantic identifier. When i is greater than 1, the i-th token among the k tokens is determined by the input data during the i-th processing step. The input data during the i-th processing step is determined based on the input data during the (i-1)-th processing step and the (i-1)-th token among the k tokens. When i = 1, the input data during the i-th processing step is the second residual vector.

[0121] Specifically, each first residual vector is linearized to obtain a second residual vector; k processing steps are performed based on the second residual vector to obtain k tokens of the local semantic identifier. During the i-th processing step, the i-th dictionary is queried based on the i-th query vector to obtain the i-th token of the local semantic identifier. The i-th dictionary includes multiple vectors, and the i-th token of the local semantic identifier is the vector in the i-th dictionary with the smallest distance to the i-th query vector. When i = 1, the i-th query vector is the second residual vector; when i is greater than 1 and not greater than k, the i-th query vector is the difference between the (i-1)-th query vector and the (i-1)-th token of the local semantic identifier. M is an integer greater than 1, k is the number of dictionaries, and the local semantic identifier of the semantic identifier of the candidate retrieval object includes k tokens. Following this method, the training device can obtain the local semantic identifier of the semantic identifier of each retrieval candidate object.

[0122] In the above manner, the training device obtains multiple semantic identifiers corresponding to multiple candidate retrieval objects, and then determines the correspondence between multiple candidate retrieval objects and multiple semantic identifiers.

[0123] It should be noted that the operation of the training device performing M query operations based on the second residual vector to obtain the local semantic label can be achieved through the residual quantization model.

[0124] It should be noted that the operation of the training device performing M query operations based on the second residual vector to obtain the local semantic label can be achieved through the residual quantization model.

[0125] In one possible implementation, as shown in Figure 7, the training device acquires a first residual vector sample, performs a first linearization process (dimensionality reduction) on the first residual vector sample to obtain a second residual vector sample; inputs the second residual vector sample into the residual quantization model to obtain multiple tokens, sums the multiple tokens to obtain a third residual vector; performs a second linearization process (dimensionality increase) on the third residual vector to obtain a fourth residual vector; wherein the dimension of the fourth residual vector is the same as the dimension of the first residual vector sample; the training device calculates a loss value based on the dimension of the fourth residual vector and the first residual vector sample, and adjusts the parameters in the residual quantization model based on the loss value to achieve the purpose of training the residual quantization model.

[0126] It should be noted that the above training process can be iterated multiple times, and the training can be terminated when the number of iterations reaches a threshold or when the loss value converges.

[0127] In one possible implementation, for a first candidate search object and a second candidate search object among multiple candidate search objects, the semantic identifiers of the first candidate search object and the second candidate search object further include a first identifier and a second identifier, respectively. The first identifier and the second identifier are used to distinguish the first candidate search object and the second candidate search object; wherein, the global semantic identifier and the local semantic identifier of the first candidate search object are the same as the global semantic identifier and the local semantic identifier of the second candidate search object. The first identifier and the second identifier can be regarded as unique identifiers of the first candidate search object and the second candidate search object, respectively.

[0128] In a specific example, the training device stores a semantic identifier database, which includes semantic identifiers determined by the training device. After obtaining the global and local semantic identifiers of the candidate retrieval object E, the training device determines whether there are semantic identifiers in the semantic identifier database whose global and local semantic identifiers are the same as those of the candidate retrieval object E. If so, the training device determines the number of semantic identifiers whose global and local semantic identifiers are the same as those of the candidate retrieval object E, and uses the result of this number + 1 as the unique identifier of the candidate retrieval object.

[0129] In a specific example, as shown in Figure 6, the training device performs clustering operations on multiple seventh feature vectors to obtain one or more first clusters and one or more first cluster centers, with one or more first clusters corresponding to one or more first cluster centers; in one example, the first cluster center corresponding to the first cluster to which the seventh feature vector of the candidate retrieval object F belongs is 56, that is, the global semantic identifier of the semantic identifier of the candidate retrieval object F is 56. The training device subtracts the first cluster center corresponding to the first cluster to which the candidate retrieval object F belongs from the seventh feature vector of the candidate retrieval object F. The difference is the first residual vector corresponding to the candidate retrieval object F. The training device linearizes (reduces the dimension) the first residual vector corresponding to the candidate retrieval object F to obtain the second residual vector corresponding to the candidate retrieval object F. The training device queries the first dictionary based on the second residual vector corresponding to the candidate retrieval object F and obtains the first vector: 12. The first vector is the vector with the smallest distance to the second residual vector corresponding to the candidate retrieval object F in the dictionary. The first vector is the first token in the local semantic identifier of the semantic identifier of the candidate retrieval object F. The training device subtracts the second residual vector corresponding to the candidate retrieval object F from the seventh vector to obtain the fifth residual vector corresponding to the candidate retrieval object A. The training device queries the second dictionary based on the fifth residual vector corresponding to the candidate retrieval object F and obtains the second vector: 34. The second vector is the vector with the smallest distance to the fifth residual vector corresponding to the candidate retrieval object F in the dictionary. The second vector is the second token in the local semantic identifier of the semantic identifier of the candidate retrieval object F. At this point, the training device obtains the local semantic identifier of the semantic identifier of the selected retrieval object F: 12 34.

[0130] It should be noted that the global semantic identifier and local semantic identifier of the candidate retrieval object can be referred to as the prefix of the candidate retrieval object.

[0131] The training device maintains a semantic identifier database, which includes semantic identifiers already determined by the training device. After obtaining the prefix (i.e., 56 12 34) of the candidate retrieval object F, the training device checks whether a semantic identifier with the prefix 56 12 34 already exists in the semantic identifier database. If it does, the training device counts the number of semantic identifiers with the prefix 56 12 34 as 12. That is, there are 12 semantic identifiers with the prefix 56 12 34 in the semantic identifier database, and uses 13 as the unique identifier of the semantic identifier of the candidate retrieval object F. Thus, the training device obtains the semantic identifier of the candidate retrieval object F: 56 12 34 13.

[0132] In one feasible implementation, the method of this embodiment further includes:

[0133] The retrieval device acquires the second retrieval text; the retrieval device inputs the second retrieval text into the retrieval model for processing to obtain the second semantic identifier, which includes a second global semantic identifier and a second local semantic identifier; the retrieval device determines the second target retrieval object based on the second semantic identifier and the correspondence table, and there is a correspondence between the second target retrieval object and the second semantic identifier; the modality of the first target retrieval object is different from the modality of the second target retrieval object.

[0134] As can be seen, semantic identifiers derived from the search text, including both global and local semantic identifiers, contain richer semantic information. Therefore, using these identifiers to determine the target search object improves search accuracy. Furthermore, when determining the target search object using semantic identifiers, the search scope can be narrowed down first using global semantic identifiers, and then the target search object can be precisely retrieved using local semantic identifiers, thus improving search efficiency. Moreover, this semantic identifier-based search method, compared to explicitly calculating the similarity between the query and candidate options, reduces the memory usage of the search device. Converting non-text search criteria into text-based search criteria and then using these text-based criteria facilitates multimodal input retrieval. Using a single search model with both the first and second search texts yields search results across different modalities, meaning that a single search model can retrieve information from various modalities.

[0135] Referring to Figure 8, which is a flowchart illustrating a retrieval model training method provided in an embodiment of this application, the method includes:

[0136] S801, the training device acquires multiple training samples.

[0137] Optionally, the training device is the retrieval server 102 in Figure 1 or a unit or module in the retrieval server 102.

[0138] Each training sample in the multiple training samples includes a third retrieval text and a third semantic identifier. The third semantic identifier is the semantic identifier of the retrieval object corresponding to the third retrieval text, and the third semantic identifier includes a third global semantic identifier and a third local semantic identifier.

[0139] The training samples include a first training sample and a second training sample. The modality of the retrieval object corresponding to the semantic identifier in the first training sample is different from that of the retrieval object corresponding to the semantic identifier in the second training sample.

[0140] By introducing semantic identifiers for different modalities of retrieval objects into the training samples, the trained retrieval model can retrieve objects of different modalities. For example, it can retrieve image data using text data, and retrieve video data using image data.

[0141] Alternatively, the training device can generate multiple training samples itself or obtain multiple training samples from other devices.

[0142] In one possible implementation, the training device acquires multiple training samples in the following manner:

[0143] The training device acquires multiple search condition information, including text-type search condition information and / or non-text-type search condition information; for the non-text-type search condition information, the training device converts it into text-type search condition information; based on the multiple search condition information, it retrieves multiple corresponding search objects, and determines multiple third semantic identifiers corresponding to the multiple search objects; based on the obtained multiple text-type search condition information and multiple third semantic identifiers, it obtains multiple training samples, wherein the text-type search condition information is the search text in the training samples.

[0144] The non-text search criteria include one or more of the following: video, audio, images, hyperlinks, and web pages.

[0145] For non-textual retrieval criteria, the retrieval device uses a multimodal large language model to generate textual retrieval criteria based on the non-textual retrieval criteria. This textual retrieval criteria can be seen as a textual description of the corresponding non-textual retrieval criteria. For example, the multimodal large language model could be the Qwen-VL model, the BLIP-2 model, or the Qwen-Audio model. Among these, the Qwen-VL model or the BLIP-2 model can be used to generate corresponding textual descriptions based on video or image data, while the Qwen-Audio model can be used to generate textual descriptions corresponding to audio data.

[0146] As shown in Figure 3, the retrieval device uses a multimodal large language model to convert image data into one or more corresponding text descriptions. Optionally, the retrieval device outputs the text description, such as by displaying it or playing it via voice, allowing the user to manually check the text description to ensure its accuracy.

[0147] In this way, the retrieval device can convert non-text data into text data to facilitate subsequent retrieval, thus enabling retrieval from multimodal input.

[0148] Furthermore, when the retrieval device uses a multimodal large language model to generate text-type retrieval condition information from non-textual retrieval condition information, it also uses prompts to guide the text descriptions generated by the multimodal large language model to describe non-textual retrieval condition information from different perspectives, such as image data. The retrieval device uses the multimodal large language model to generate text descriptions about image background, main objects, colors, and actions based on prompts and image data.

[0149] Optionally, the prompt is input by the user through their device, or generated by the retrieval device based on pre-set rules.

[0150] It should be understood that the search object corresponding to the third search text is the object obtained by searching using the third search text. If the third search text is obtained from non-textual search condition information, then the search object corresponding to the third search text is obtained by searching using that non-textual search condition information.

[0151] In one feasible implementation, the training device determines multiple third semantic identifiers corresponding to multiple retrieval objects based on multiple retrieval objects, including:

[0152] The training device extracts features from multiple retrieval objects to obtain multiple fourteenth feature vectors. It then performs clustering on these fourteenth feature vectors to obtain multiple second clusters and multiple second cluster centers. Each second cluster includes one or more fourteenth feature vectors. The multiple second clusters correspond to multiple second cluster centers, and the global semantic identifier of the third semantic identifier of any retrieval object C is the index of the second cluster to which retrieval object C belongs. Based on the multiple second clusters and cluster centers, the training device determines multiple third residual vectors, each corresponding to a retrieval object. Any third residual vector D is the difference between the fourteenth feature vector of the retrieval object corresponding to D and the second cluster center corresponding to the second cluster to which the fourteenth feature vector of the retrieval object corresponding to D belongs. Finally, based on the multiple third residual vectors, the training device determines the local semantic identifier of the third semantic identifier of each retrieval object.

[0153] In one possible implementation, the local semantic identifier of each retrieved object includes k tokens, where k is an integer greater than 1. The training device determines the local semantic identifiers of the first semantic identifiers of multiple retrieved objects based on multiple first residual vectors, including:

[0154] The training device performs dimensionality reduction on multiple third residual vectors to obtain multiple fourth residual vectors. The training device performs k processing steps on each of the multiple fourth residual vectors to obtain k tokens in the local semantic identifier. When i is greater than 1, the i-th token among the k tokens is determined by the input data during the i-th processing step. The input data during the i-th processing step is determined based on the input data during the (i-1)-th processing step and the (i-1)-th token among the k tokens. When i = 1, the input data during the i-th processing step is the fourth residual vector.

[0155] It should be noted that the specific implementation process of the training device determining multiple third semantic identifiers corresponding to multiple retrieval objects based on multiple retrieval objects can be found in the relevant description of S203, and will not be described here again.

[0156] S802, The training device trains the retrieval model based on the third retrieval text and third semantic identifier of multiple training samples.

[0157] In one possible implementation, as shown in Figure 6, the retrieval model includes an encoder, a decoder, and a softmax function. The training device uses the encoder to perform multi-scale feature extraction on the third retrieval text of each training sample from multiple training samples to obtain multiple eighth feature vectors corresponding to each training sample. The training device obtains a ninth feature vector and a second vector matrix based on the multiple eighth feature vectors. The ninth feature vector is obtained by concatenating multiple eighth feature vectors, and the second vector matrix is ​​obtained by stacking multiple eighth feature vectors. For example, the encoder includes N coding layers, where N is an integer greater than 1. Each eighth feature vector is a 1*512 vector, and the ninth feature vector is a 1*(N*512) vector. In other words, the ninth feature vector is a one-dimensional vector with N*512 elements. The size of the second vector matrix is ​​N*512.

[0158] The training device uses a decoder to perform two first operations on the ninth feature vector, the second vector matrix, and the third semantic identifier in each training sample to obtain two output result sets for each training sample. Each output result set includes multiple logical values. The softmax function is used to process the multiple logical values ​​of one of the data result sets in the two output result sets to obtain a first probability value for each training sample. This first probability value is used to characterize the probability of generating the third semantic identifier in each training sample. The cross-entropy loss value is determined based on the multiple first probability values ​​corresponding to multiple training samples, and the consistency loss value is determined based on the two output result sets corresponding to each training sample. The training device calculates the target loss value based on the cross-entropy loss value and the multiple consistency loss values ​​corresponding to multiple training samples, and adjusts the parameters of the retrieval model based on the target loss value to achieve the purpose of training the retrieval model.

[0159] In one possible implementation, as shown in Figure 4, the encoder includes multiple coding layers. The training device uses multiple coding layers to perform multi-scale feature extraction on the third retrieval text to obtain multiple eighth feature vectors. The multiple eighth feature vectors are the output data of multiple coding layers. The input data of the s-th coding layer is the output data of the (s-1)-th coding layer. s is an integer greater than 1 and not greater than N, and N is the number of coding layers in the encoder. The input data of the first coding layer is the third retrieval text.

[0160] The training device uses the decoder to perform the first operation twice on the ninth feature vector, the second vector matrix, and the third semantic identifier in each training sample to obtain two sets of output results for each training sample, including:

[0161] In multiple training samples, the third semantic identifier of each training sample includes M tokens, where M is an integer greater than 1. The training device uses a decoder to perform M second operations based on the ninth feature vector and the second vector matrix corresponding to each training sample to obtain M first output results for each training sample. Specifically, during the first second operation, the decoder's input data includes the ninth feature vector, the second vector matrix, and the initial token; during the j-th (j greater than 1 and not greater than M) second operation, the decoder's input data includes the ninth feature vector, the second vector matrix, and the (j-1)-th token of the third semantic identifier in each training sample. The training device repeats the first operation once in the above manner to obtain M second output results. It should be noted that the two output result sets corresponding to each training sample respectively include M first output results and M second output results.

[0162] The training device processes either the M first output results or the Mth second output result using the softmax function to obtain M second probability values. The j-th second probability value among the M second probability values ​​represents the probability of generating the j-th token of the third semantic identifier in the training sample. The first probability values ​​corresponding to the training sample include the M second probability values.

[0163] In one example, the cross-entropy loss value can be expressed as:

[0164] Where q and T are the third retrieval text and the third semantic identifier in the target training samples, respectively; D represents the multiple training samples mentioned above; M is the length of the third semantic identifier, i.e., the number of tokens in the third semantic identifier; and t is the number of tokens in the third semantic identifier. i Let be the i-th identifier in T, i.e., the i-th token in T; θ is the parameter of the retrieval model; p(t) i |E(q),t <i ,θ i ) to generate t i The probability, t <i Indicates at t i The identifier generated previously.

[0165] In one example, the consistency loss value corresponding to the training sample can be expressed as:

[0166] Where P(i) and Q(i) are the i-th output result in the two output result sets, respectively.

[0167] In one example, the target loss value can be expressed as:

[0168] Where ω is the scaling factor for the consistency loss value.

[0169] It should be noted that the cross-entropy loss is used to maximize the probability of generating correct semantic identifiers, while the consistency loss is used to mitigate overfitting of the retrieval model, and is achieved through bidirectional KL divergence loss. By adjusting the parameters of the retrieval model based on the target loss values ​​determined by the cross-entropy and consistency losses, the probability of the retrieval model generating correct semantic identifiers can be improved while avoiding overfitting.

[0170] In one possible implementation, the decoder includes multiple decoding layers, with the input of the first decoding layer being the input of the decoder, the output of the last decoding layer being the output of the decoder, and the input of the intermediate decoding layers including the ninth feature vector, the second vector matrix, and the output data of the previous decoding layer.

[0171] During the j-th second operation in the M-th second operation, the training device uses the s-th decoding layer to process the ninth feature vector, the second vector matrix and the third intermediate data corresponding to each training sample, and obtains the output result of the s-th decoding layer.

[0172] Wherein, when s is greater than 1, the third intermediate data is the output result of the (s-1)th decoding layer, and when s = 1, the third intermediate data is the fourth intermediate data; when j = 1, the fourth intermediate data is the starting token, and when j is greater than 1 and not greater than M, the fourth intermediate data is the (j-1)th token in the third semantic identifier; the output result of the last decoding layer in the s decoding layers is the jth first output result in the M first output results.

[0173] As shown in Figure 5, each decoding layer in the decoder includes a cross-attention layer, a fusion layer, a linearization layer, and an activation function. Specifically, the process of the training device using the s-th decoding layer includes: fusing the ninth feature vector using the fusion layer to obtain a second fused vector, the dimension of which is lower than that of the ninth feature vector; processing the third intermediate data and the second fused vector using the cross-attention layer to obtain the tenth feature vector; wherein the third intermediate data serves as the Q-value of the cross-attention layer, and the second fused vector serves as the K-value and V-value of the cross-attention layer; activating the tenth feature vector using the activation function to obtain the eleventh feature vector, and averaging the eleventh feature vector to obtain the first average value. This process can be called a coarse-grained feature fusion process.

[0174] The training device uses a cross-attention layer to process multiple eighth feature vectors in the second vector matrix to obtain multiple twelfth feature vectors. These twelfth feature vectors correspond to the multiple eighth feature vectors. The third intermediate data serves as the Q-value of the cross-attention layer, and the eighth feature vectors serve as the K and V-values ​​of the cross-attention layer. The training device then uses a linearization layer to linearize each of the twelfth feature vectors, resulting in multiple second processing results. Next, the training device uses an activation function to process each of the second processing results, obtaining multiple weights corresponding to the twelfth feature vectors. A dot product operation is then performed between each of the twelfth feature vectors and its corresponding weight to obtain multiple thirteenth feature vectors. Finally, the training device performs mathematical operations on the multiple thirteenth feature vectors and their first average value to obtain the output of the s-th decoding layer. These data operations can be addition or weighted summation. This process can be called fine-grained feature fusion.

[0175] Figure 9 illustrates the performance of the retrieval model trained using the above method.

[0176] The retrieval model was trained and tested using publicly available datasets: Flickr30k, MS-COCO, Clotho, AudioCaps, MSR-VTT, and MSVD. Four classification metrics—Recall@1, 5, 10, and MRR@10—were used to measure model performance, with higher metrics indicating better classification results. The performance of the retrieval model was compared with other existing models, including CLIP, OpenCLIP, MobileCLIP, ONE-PEACE, ImageBind, LanguageBind, CLAP, CLIP2Video, CLIP4Clip, UMT-B, and Cap4Video. The comparison results are shown in Figure 8. The retrieval model achieved better results than the baseline model in Recall@1, 5, 10, and MRR@10 across different datasets. Experiments on six datasets demonstrate that the retrieval model achieves state-of-the-art performance in cross-modal retrieval, and on average, outperforms the strong baseline on Recall@1 by 15.27%.

[0177] As can be seen, in this embodiment, semantic identifiers include global semantic identifiers and local semantic identifiers. Therefore, training the retrieval model with semantically rich semantic identifiers enables the retrieval model to learn richer semantic information about candidate retrieval objects, thereby improving the accuracy and precision of retrieval results when using the retrieval model. Introducing semantic identifiers of retrieval objects of different modalities into the training samples enables the trained retrieval model to perform retrieval of retrieval objects of different modalities. During the training process, non-textual retrieval condition information is acquired and converted into textual retrieval condition information. This data is then used to train the retrieval model, enabling the retrieval model to perform retrieval with multimodal inputs.

[0178] Referring to FIG10, a schematic diagram of a retrieval device provided in an embodiment of this application is shown. As shown in FIG10, the retrieval device 1000 includes:

[0179] Acquisition unit 1001 is used to acquire the first search text;

[0180] The processing unit 1002 is used to input the first search text into the search model for processing to obtain a first semantic identifier, the first semantic identifier including a first global semantic identifier and a first local semantic identifier;

[0181] The determining unit 1003 is used to determine the first target retrieval object based on the first semantic identifier and the correspondence table between the retrieval object and the semantic identifier, wherein there is a correspondence between the first target retrieval object and the first semantic identifier.

[0182] In one feasible implementation, the acquisition unit 1001 acquires the first search text, including:

[0183] The retrieval device acquires retrieval condition information, which includes text-type retrieval condition information or non-text-type retrieval condition information. When the retrieval condition information is non-text-type retrieval condition information, the retrieval device converts the non-text-type retrieval condition information into text-type retrieval condition information, wherein the first retrieval text is text-type retrieval condition information.

[0184] In one feasible implementation, the acquisition unit 1001 is also used to acquire the second search text;

[0185] The processing unit 1002 is also used to input the second search text into the search model for processing to obtain a second semantic identifier, the second semantic identifier including a second global semantic identifier and a second local semantic identifier;

[0186] The determining unit 1003 is further configured to determine the second target retrieval object based on the second semantic identifier and the correspondence table, wherein there is a correspondence between the second target retrieval object and the second semantic identifier; the modality of the first target retrieval object is different from the modality of the second target retrieval object.

[0187] In one feasible implementation, the retrieval model includes an encoder and a decoder. The processing unit 1002 inputs the first retrieval text into the retrieval model for processing to obtain a first semantic identifier, including:

[0188] An encoder is used to extract multi-scale features from the retrieved text to obtain multiple first feature vectors. Based on these first feature vectors, a second feature vector and a first vector matrix are obtained, where the second feature vector is obtained by concatenating the multiple first feature vectors, and the first vector matrix is ​​obtained by stacking the multiple first feature vectors. A decoder is used to perform multiple first operations based on the second feature vectors, the first vector matrix, and first intermediate data to obtain multiple logical values, each corresponding to a first operation. These logical values ​​are then processed to obtain a first semantic identifier. During the first first operation, the first intermediate data is the starting token. During the x-th first operation, where x is greater than 1 and not less than M, the first intermediate data is the (x-1)-th token in the first semantic identifier. The token obtained based on the logical value from the last first operation is the termination token.

[0189] In one feasible implementation, the decoder includes multiple decoding layers, and the processing unit 1002 utilizes the decoder to perform multiple first operations based on the second feature vector, the first vector matrix, and the first intermediate data to obtain multiple logical values, including:

[0190] During the x-th first operation, the s-th decoding layer processes the second feature vector, the first vector matrix, and the second intermediate data to obtain the output of the s-th decoding layer. Specifically, when s > 1, the second intermediate data is the output of the (s-1)-th decoding layer; when s = 1, the second intermediate data is the first intermediate data; and when the s-th decoding layer is the last decoding layer, its output is the logical value corresponding to the x-th first operation.

[0191] In one feasible implementation, the s-th decoding layer includes a cross-attention layer, a fusion layer, a linearization layer, and an activation function. The processing unit 1002 uses the s-th decoding layer to process the second feature vector, the first vector matrix, and the second intermediate data to obtain the output of the s-th decoding layer, including:

[0192] The second feature vector is fused using a fusion layer to obtain a first fused vector, the dimension of which is lower than that of the second feature vector. A cross-attention layer processes the second intermediate data and the first fused vector to obtain a third feature vector. An activation function processes the third feature vector to obtain a fourth feature vector. The fourth feature vector is averaged to obtain a first average value. A cross-attention layer processes multiple first feature vectors in the first vector matrix to obtain multiple fifth feature vectors, each corresponding to a first feature vector. The second intermediate data serves as the Q-value of the cross-attention layer, and the first feature vectors serve as the K-value and V-value of the cross-attention layer. The retrieval device linearizes the multiple fifth feature vectors using a linearization layer to obtain multiple first processing results. An activation function processes the multiple first processing results to obtain multiple weights corresponding to the multiple fifth feature vectors. A dot product operation is performed between the multiple fifth feature vectors and their corresponding weights to obtain multiple sixth feature vectors. Mathematical operations are performed between the multiple sixth feature vectors and the first average value to obtain the output of the s-th decoding layer.

[0193] In one feasible implementation, the acquisition unit 1001 is also used to acquire multiple candidate retrieval objects;

[0194] The processing unit 1002 is further configured to extract features from multiple candidate retrieval objects to obtain multiple seventh feature vectors; perform clustering processing on the multiple seventh feature vectors to obtain multiple first clusters and multiple first cluster centers; wherein each of the multiple first clusters includes one or more seventh feature vectors; the multiple first clusters correspond to multiple first cluster centers, and the global semantic identifier of the semantic identifier of any candidate retrieval object A among the multiple candidate retrieval objects is the index of the first cluster to which candidate retrieval object A belongs; the retrieval device determines multiple first residual vectors based on the multiple first clusters and multiple first cluster centers, the multiple first residual vectors correspond to multiple candidate retrieval objects, and any one of the multiple first residual vectors B is the difference between the seventh feature vector of the candidate retrieval object corresponding to the first residual vector B and the first cluster center corresponding to the first cluster to which the seventh feature vector of the candidate retrieval object corresponding to the first residual vector B belongs; determine the local semantic identifier of the semantic identifier of each candidate retrieval object among the multiple candidate retrieval objects based on the multiple first residual vectors; and establish a correspondence table based on the multiple candidate retrieval objects and their corresponding semantic identifiers.

[0195] In one feasible implementation, the local semantic identifier of each retrieval object includes k tokens, where k is an integer greater than 1. The processing unit 1002 determines the local semantic identifiers of the semantic identifiers of multiple candidate retrieval objects based on multiple first residual vectors, including:

[0196] Multiple first residual vectors are dimensionality reduced to obtain multiple second residual vectors. Each second residual vector is processed k times to obtain k tokens in the local semantic identifier. When i is greater than 1, the i-th token among the k tokens is determined by the input data during the i-th processing. The input data during the i-th processing is determined based on the input data during the (i-1)-th processing and the (i-1)-th token among the k tokens. When i = 1, the input data during the i-th processing is the second residual vector.

[0197] In one feasible implementation, for a first candidate retrieval object and a second candidate retrieval object among multiple candidate retrieval objects, where the global semantic identifier and the local semantic identifier are the same, the semantic identifier of the first candidate retrieval object and the semantic identifier of the second candidate retrieval object further include a first identifier and a second identifier, respectively. The first identifier and the second identifier are used to distinguish the first candidate retrieval object and the second candidate retrieval object.

[0198] It is worth noting that the specific functional implementation of the retrieval device 1000 is described in the relevant description of the data retrieval method above. For example, the acquisition unit 1001 is used to execute the relevant content of S201, the processing unit 1002 is used to execute the relevant content of S202, and the determination unit 1003 is used to execute the relevant content of S203. Each unit or module in the retrieval device 1000 can be individually or entirely merged into one or more other units or modules, or some of the units or modules can be further divided into multiple functionally smaller units or modules. This achieves the same operation without affecting the technical effect of the embodiments of this application. The above-mentioned units or modules are based on logical function division. In practical applications, the function of one unit (or module) is implemented by multiple units (or modules), or the function of multiple units (or modules) is implemented by one unit (or module).

[0199] Referring to FIG11, a schematic diagram of a training device provided in an embodiment of this application is shown. As shown in FIG11, the training device 1100 includes:

[0200] The acquisition unit 1101 is used to acquire multiple training samples. Each training sample includes a third search text and a third semantic identifier of the search object corresponding to the third search text. The third semantic identifier includes a third global semantic identifier and a third local semantic identifier.

[0201] Training unit 1102 is used to train a retrieval model based on third retrieval text and third semantic identifiers from multiple training samples.

[0202] In one feasible implementation, multiple training samples include a first training sample and a second training sample, wherein the modality of the retrieval object corresponding to the semantic identifier in the first training sample is different from the modality of the retrieval object corresponding to the semantic identifier in the second training sample.

[0203] In one feasible implementation, the acquisition unit 1101 acquires multiple training samples including:

[0204] Multiple search criteria are obtained, including text-based and / or non-text-based search criteria. For non-text-based search criteria, they are converted into text-based search criteria. Based on the multiple search criteria, multiple corresponding search objects are retrieved, and multiple third-party semantic identifiers are determined based on the multiple search objects. Based on the obtained multiple text-based search criteria and multiple third-party semantic identifiers, multiple training samples are obtained, where the text-based search criteria are the search text in the training samples.

[0205] In one feasible implementation, the retrieval model includes an encoder, a decoder, and a softmax function. The training unit 1102 trains the retrieval model based on third-party retrieval text and third-party semantic identifier samples from multiple training samples, including:

[0206] An encoder is used to extract multi-scale features from the third retrieval text in each of the multiple training samples to obtain multiple eighth feature vectors for each training sample. Based on these eighth feature vectors, a ninth feature vector and a second vector matrix are obtained for each training sample. The ninth feature vector is obtained by concatenating multiple eighth feature vectors, and the second vector matrix is ​​obtained by stacking multiple eighth feature vectors. A decoder performs two first operations on the ninth feature vector, the second vector matrix, and the third semantic identifier in each training sample to obtain two output result sets for each training sample. A softmax function is used to process one of the two output result sets for each training sample to obtain a first probability value, which represents the probability of generating the third semantic identifier in each training sample. A cross-entropy loss value is determined based on the first probability values ​​for multiple training samples. A consistency loss value is determined based on the two output result sets for each training sample. The retrieval model is trained based on the cross-entropy loss value and the consistency loss value for multiple training samples.

[0207] In one feasible implementation, the training unit 1102 performs the first operation twice on the ninth feature vector, the second vector matrix, and the third semantic identifier in each training sample to obtain two sets of output results for each training sample, including:

[0208] The decoder performs M second operations based on the ninth feature vector, the second vector matrix, and the third semantic identifier corresponding to each training sample to obtain M first output results corresponding to each training sample; the third semantic identifier includes M tokens.

[0209] In the first second operation, the decoder's input data includes the ninth feature vector, the second vector matrix, and the starting token for each training sample. In the j-th second operation, the decoder's input data includes the ninth feature vector, the second vector matrix, and the (j-1)-th token from the third semantic identifier for each training sample. The training device then uses the decoder to perform M second operations based on the ninth feature vector, the second vector matrix, and the third semantic identifier for each training sample to obtain M second output results for each training sample. The two output result sets each include M first output results and M second output results.

[0210] In one feasible implementation, the decoder includes multiple decoding layers. The training unit 1102 utilizes the decoder to perform M second operations based on the ninth feature vector, the second vector matrix, and the third semantic identifier corresponding to each training sample, to obtain M first output results corresponding to each training sample, including:

[0211] During the j-th second operation in M ​​second operations, the s-th decoding layer processes the ninth feature vector, the second vector matrix, and the third intermediate data corresponding to each training sample to obtain the output of the s-th decoding layer. When s is greater than 1, the third intermediate data is the output of the (s-1)-th decoding layer; when s = 1, the third intermediate data is the fourth intermediate data. When j = 1, the fourth intermediate data is the starting token; when j is greater than 1 and not greater than M, the fourth intermediate data is the (j-1)-th token in the third semantic identifier. The output of the last decoding layer in the s decoding layers is the j-th first output among the M first outputs.

[0212] In one feasible implementation, the s-th decoding layer includes a cross-attention layer, a fusion layer, a linearization layer, and an activation function. The training unit 1102 uses the input data of the s-th decoding layer to process the ninth feature vector, the second vector matrix, and the third intermediate data corresponding to each training sample to obtain the output result of the s-th decoding layer, including:

[0213] The ninth feature vector is fused using a fusion layer to obtain a second fused vector, the dimension of which is lower than that of the ninth feature vector. A cross-attention layer processes the third intermediate data and the second fused vector to obtain a tenth feature vector. An activation function is used to process the tenth feature vector to obtain an eleventh feature vector. The eleventh feature vector is then averaged to obtain a second average value. The training device uses a cross-attention layer to process multiple eighth feature vectors in the second vector matrix to obtain multiple twelfth feature vectors, each corresponding to a different eighth feature vector. The third intermediate data serves as the Q-value of the cross-attention layer, and the eighth feature vectors serve as the K and V-values. A linearization layer linearizes the multiple twelfth feature vectors to obtain multiple second processing results. An activation function is used to process these second processing results to obtain multiple weights corresponding to the twelfth feature vectors. A dot product operation is performed between the twelfth feature vectors and their corresponding weights to obtain multiple thirteenth feature vectors. Finally, a mathematical operation is performed between the multiple thirteenth feature vectors and the second average value to obtain the output of the s-th decoding layer.

[0214] In one feasible implementation, the acquisition unit 1101 determines multiple third semantic identifiers corresponding to multiple search objects based on multiple search objects, including:

[0215] Feature extraction is performed on multiple search objects to obtain multiple fourteenth feature vectors. These fourteenth feature vectors are then clustered to obtain multiple second clusters and multiple second cluster centers. Each second cluster includes one or more fourteenth feature vectors. The multiple second clusters correspond to multiple second cluster centers, and the global semantic identifier of the third semantic identifier of any search object C is the index of the second cluster to which search object C belongs. Based on the multiple second clusters and multiple cluster centers, multiple third residual vectors are determined. These third residual vectors correspond to multiple search objects, and any third residual vector D is the difference between the fourteenth feature vector of the search object corresponding to third residual vector D and the second cluster center corresponding to the second cluster to which the fourteenth feature vector of the search object corresponding to third residual vector D belongs. Based on the multiple third residual vectors, the local semantic identifier of the third semantic identifier of each search object is determined.

[0216] In one feasible implementation, the local semantic identifier of each retrieved object includes k tokens, where k is an integer greater than 1. The acquisition unit 1101 determines the local semantic identifiers of the first semantic identifiers of multiple retrieved objects based on multiple first residual vectors, including:

[0217] Dimensionality reduction is performed on multiple third residual vectors to obtain multiple fourth residual vectors. Each of the multiple fourth residual vectors is processed k times to obtain k tokens in the local semantic identifier. When i is greater than 1, the i-th token among the k tokens is determined by the input data during the i-th processing. The input data during the i-th processing is determined based on the input data during the (i-1)-th processing and the (i-1)-th token among the k tokens. When i = 1, the input data during the i-th processing is the fourth residual vector.

[0218] In one feasible implementation, for a first retrieval object and a second retrieval object in which the global semantic identifier and the local semantic identifier are the same in multiple retrieval objects, the third semantic identifier of the first retrieval object and the third semantic identifier of the second retrieval object also include a third identifier and a fourth identifier, respectively. The third identifier and the fourth identifier are used to distinguish the first retrieval object and the second retrieval object.

[0219] It is worth noting that the specific functional implementation of the training device 1100 is described in the relevant description of the retrieval model training method above. For example, the acquisition unit 1101 is used to execute the relevant content of S801, and the training unit 1102 is used to execute the relevant content of S802. Each unit or module in the training device 1100 can be individually or entirely merged into one or more other units or modules, or some of the units or modules can be further divided into multiple functionally smaller units or modules. This achieves the same operation without affecting the technical effect of the embodiments of this application. The above-mentioned units or modules are based on logical function division. In practical applications, the function of one unit (or module) is implemented by multiple units (or modules), or the function of multiple units (or modules) is implemented by one unit (or module).

[0220] Based on the description of the above method embodiments and related device embodiments, please refer to FIG12, which provides a schematic diagram of the structure of an electronic device 1200. The electronic device 1200 shown in FIG12 includes a memory 1201, a processor 1202, a communication interface 1203, and a bus 1204. The memory 1201, the processor 1202, and the communication interface 1203 are interconnected through the bus 1204.

[0221] Optionally, the memory 1201 can be a ROM, a static storage device, a dynamic storage device, or RAM.

[0222] The memory 1201 is capable of storing programs. When the program stored in the memory 1201 is executed by the processor 1202, the processor 1202 and the communication interface 1203 are used to execute the various steps of the data retrieval method of the embodiment shown in FIG2 or the retrieval model training method of the embodiment shown in FIG8.

[0223] The processor 1202 uses a general-purpose CPU, microprocessor, application-specific integrated circuit (ASIC), GPU, or one or more integrated circuits to execute relevant programs to implement the data retrieval method of the embodiment shown in Figure 2 or the retrieval model training method of the embodiment shown in Figure 8.

[0224] The processor 1202 can also be an integrated circuit chip with signal processing capabilities. During implementation, each step of the data retrieval method or retrieval model training method of this application can be completed through the integrated logic circuitry of the hardware in the processor 1202 or through software instructions. Optionally, the processor 1202 can be a general-purpose processor, DSP, ASIC, FPGA, or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. The processor can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor is a microprocessor or any conventional processor, etc. The steps of the methods disclosed in the embodiments of this application can be directly embodied in the execution of a hardware decoding processor, or can be executed by a combination of hardware and software modules in the decoding processor. Optional software modules are located in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. The storage medium is located in the memory 1201. The processor 1202 reads the information in the memory 1201 and, in conjunction with its hardware, performs the functions required by the units included in the retrieval device 1000 or training device 1100 of this application embodiment, or executes the data retrieval method or retrieval model training method of the method embodiment of this application.

[0225] The communication interface 1203 uses transceiver-related devices, such as, but not limited to, transceivers, to enable communication between the electronic device 1200 and other devices or communication networks.

[0226] Bus 1204 may include a pathway for transmitting information between various components of electronic device 1200 (e.g., memory 1201, processor 1202, communication interface 1203).

[0227] It should be noted that although the electronic device 1200 shown in Figure 12 only illustrates the memory, processor, and communication interface, those skilled in the art should understand that in specific implementations, the electronic device 1200 may also include other devices necessary for normal operation. Furthermore, depending on specific needs, those skilled in the art should understand that the electronic device 1200 may also include hardware devices for implementing other additional functions. In addition, those skilled in the art should understand that the electronic device 1200 may only include the devices necessary for implementing the embodiments of this application, and not necessarily all the devices shown in Figure 12.

[0228] This application also provides a chip, which includes a processor and a data interface. The processor reads instructions stored in the memory through the data interface to implement the data retrieval method or retrieval model training method of this application.

[0229] Optionally, as one implementation, the chip further includes a memory storing instructions, and the processor is used to execute the instructions stored in the memory. When the instructions are executed, the processor is used to execute the data retrieval method or the retrieval model training method.

[0230] This application also provides a computer-readable storage medium storing instructions that, when executed on a computer or processor, cause the computer or processor to perform one or more steps of any of the above methods.

[0231] This application also provides a computer program product containing instructions. When the computer program product is run on a computer or processor, it causes the computer or processor to perform one or more steps of any of the methods described above.

[0232] Those skilled in the art will appreciate that the functionality described in conjunction with the various illustrative logic blocks, modules, and algorithmic steps disclosed herein can be implemented by hardware, software, firmware, or any combination thereof. If implemented in software, the functionality described by the various illustrative logic blocks, modules, and steps can be stored or transmitted as one or more instructions or codes on a computer-readable medium and executed by a hardware-based processing unit. The computer-readable medium may comprise a computer-readable storage medium, which corresponds to a tangible medium, such as a data storage medium, or a communication medium that includes any medium facilitating the transfer of a computer program from one place to another (e.g., based on a communication protocol). In this way, the computer-readable medium may substantially correspond to (1) a non-transitory tangible computer-readable storage medium, or (2) a communication medium, such as a signal or carrier wave. The data storage medium may be any available medium accessible by one or more computers or one or more processors to retrieve instructions, code, and / or data structures for implementing the techniques described in this application. A computer program product may comprise a computer-readable medium.

[0233] By way of example and not limitation, such computer-readable storage media includes RAM, ROM, EEPROM, CD-ROM or other optical disc storage devices, magnetic disk storage devices or other magnetic storage devices, flash memory, or any other media that can be used to store desired program code in the form of instructions or data structures and is accessible by a computer. Furthermore, any connection is properly referred to as computer-readable media. For example, if instructions are transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, then coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of media. However, it should be understood that the computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other temporary media, but are specifically referring to non-temporary tangible storage media. As used herein, disks and optical discs include Compact Discs (CDs), Laser Discs, Optical Discs, Digital Versatile Discs (DVDs), and Blu-ray Discs, where disks typically reproduce data magnetically, while optical discs reproduce data optically using lasers. Combinations of these should also be included within the scope of computer-readable media.

[0234] Instructions can be executed by one or more processors, such as one or more DSPs, general-purpose microprocessors, ASICs, FPGAs, or other equivalent integrated or discrete logic circuits. Therefore, the term "processor" as used herein can refer to any of the foregoing structures or any other structures suitable for implementing the techniques described herein. Furthermore, in some aspects, the functions described in the various illustrative logic blocks, modules, and steps described herein are provided within dedicated hardware and / or software modules configured for encoding and decoding, or incorporated into combined codecs. Moreover, the techniques can be fully implemented within one or more circuit or logic elements.

[0235] In the embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the division of units is merely a logical functional division, and in actual implementation, there may be other division methods. For instance, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Optionally, the coupling, direct coupling, or communication connection shown or discussed between them may be through some interfaces, indirect coupling or communication connection of devices or units, such as electrical, mechanical, or other forms.

[0236] Optionally, the units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units, i.e., located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0237] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented using software, it can be implemented, in whole or in part, as a computer program product. This computer program product includes one or more computer instructions. When these computer program instructions are loaded and executed on a computer, all or part of the processes or functions according to the embodiments of this application are generated.

[0238] The above description is merely a specific implementation of the embodiments of this application, but the protection scope of the embodiments of this application is not limited thereto. Any changes or substitutions within the technical scope disclosed in the embodiments of this application should be covered within the protection scope of the embodiments of this application. Therefore, the protection scope of the embodiments of this application should be determined by the protection scope of the claims.

Claims

1. A data retrieval method, characterized in that, The method includes: Get the first search text; The first search text is input into the search model for processing to obtain a first semantic identifier, which includes a first global semantic identifier and a first local semantic identifier. Based on the first semantic identifier and the correspondence table between the search object and the semantic identifier, a first target search object is determined, and there is a correspondence between the first target search object and the first semantic identifier.

2. The method according to claim 1, characterized in that, The process of obtaining the first search text includes: Obtain search condition information, which includes text-type search condition information or non-text-type search condition information; When the search criteria information is non-text type, it will be converted into text type. Wherein, the first search text is the search condition information of the text type.

3. The method according to claim 1 or 2, characterized in that, The method further includes: Retrieve the second search text; The second search text is input into the search model for processing to obtain a second semantic identifier, which includes a second global semantic identifier and a second local semantic identifier. The second target retrieval object is determined based on the second semantic identifier and the correspondence table. There is a correspondence between the second target retrieval object and the second semantic identifier. The modality of the first target retrieval object is different from that of the second target retrieval object.

4. The method according to any one of claims 1-3, characterized in that, The retrieval model includes an encoder and a decoder. The step of inputting the first retrieval text into the retrieval model for processing to obtain a first semantic identifier includes: The encoder is used to extract multi-scale features from the retrieved text to obtain multiple first feature vectors; A second feature vector and a first vector matrix are obtained based on the plurality of first feature vectors, wherein the second feature vector is obtained by concatenating the plurality of first feature vectors, and the first vector matrix is ​​obtained by stacking the plurality of first feature vectors; The decoder performs multiple first operations based on the second feature vector, the first vector matrix, and the first intermediate data to obtain multiple logical values, which correspond to the multiple first operations; the multiple logical values ​​are then processed to obtain a first semantic identifier. Specifically, during the first operation, the first intermediate data is the starting token; during the xth operation, where x is greater than 1 and not less than M, the first intermediate data is the (x-1)th token in the first semantic identifier; and the token obtained based on the logical value obtained during the last first operation is the termination token.

5. The method according to claim 4, characterized in that, The decoder includes multiple decoding layers. The decoder performs multiple first operations based on the second feature vector, the first vector matrix, and the first intermediate data to obtain multiple logical values, including: During the x-th first operation, the s-th decoding layer is used to process the second feature vector, the first vector matrix, and the second intermediate data to obtain the output of the s-th decoding layer. Wherein, when s is greater than 1, the second intermediate data is the output result of the (s-1)th decoding layer; when s = 1, the second intermediate data is the first intermediate data; when the s-th decoding layer is the last decoding layer, the output result of the s-th decoding layer is the logical value corresponding to the x-th first operation.

6. The method according to claim 5, characterized in that, The s-th decoding layer includes a cross-attention layer, a fusion layer, a linearization layer, and an activation function. The process of using the s-th decoding layer to process the second feature vector, the first vector matrix, and the second intermediate data to obtain the output of the s-th decoding layer includes: The second feature vector is fused using the fusion layer to obtain a first fused vector, the dimension of which is lower than the dimension of the second feature vector. The second intermediate data and the first fused vector are processed using the cross-attention layer to obtain the third feature vector; the third feature vector is processed using the activation function to obtain the fourth feature vector; and the fourth feature vector is averaged to obtain the first average value. The cross-attention layer is used to process multiple first feature vectors in the first vector matrix to obtain multiple fifth feature vectors, which correspond to the multiple first feature vectors; wherein, the second intermediate data is used as the Q value of the cross-attention layer, and the first feature vectors are used as the K value and V value of the cross-attention layer. The linearization layer is used to linearize the plurality of fifth feature vectors to obtain a plurality of first processing results; the activation function is used to process the plurality of first processing results to obtain a plurality of weights corresponding to the plurality of fifth feature vectors; the plurality of fifth feature vectors and their corresponding weights are multiplied by a dot product to obtain a plurality of sixth feature vectors; the plurality of sixth feature vectors and their first average value are mathematically calculated to obtain the output result of the s-th decoding layer.

7. The method according to any one of claims 1-6, characterized in that, The method further includes: Retrieve multiple candidate search objects; Feature extraction is performed on the multiple candidate retrieval objects to obtain multiple seventh feature vectors; Clustering is performed on the plurality of seventh feature vectors to obtain a plurality of first clusters and a plurality of first cluster centers; wherein, each of the plurality of first clusters includes one or more of the seventh feature vectors; the plurality of first clusters correspond to the plurality of first cluster centers, and the global semantic identifier of the semantic identifier of any candidate retrieval object A among the plurality of candidate retrieval objects is the index of the first cluster to which the candidate retrieval object A belongs; Multiple first residual vectors are determined based on the multiple first clusters and the multiple first cluster centers. The multiple first residual vectors correspond to the multiple candidate retrieval objects. Any one of the multiple first residual vectors, B, is the difference between the seventh feature vector of the candidate retrieval object corresponding to the first residual vector B and the first cluster center corresponding to the first cluster to which the seventh feature vector of the candidate retrieval object corresponding to the first residual vector B belongs. Based on the plurality of first residual vectors, the local semantic identifiers of the semantic identifiers of the plurality of candidate retrieval objects are determined respectively; The corresponding relationship table is established based on multiple candidate search objects and their corresponding semantic identifiers.

8. The method according to claim 7, characterized in that, The local semantic identifier of each retrieval object includes k tokens, where k is an integer greater than 1. The step of determining the local semantic identifiers of the multiple candidate retrieval objects based on the multiple first residual vectors includes: The multiple first residual vectors are subjected to dimensionality reduction processing to obtain multiple second residual vectors; Based on each of the plurality of second residual vectors, k processing operations are performed to obtain k tokens in the local semantic identifier. When i is greater than 1, the i-th token among the k tokens is determined by the input data during the i-th processing. The input data during the i-th processing is determined based on the input data during the (i-1)-th processing and the (i-1)-th token among the k tokens. When i = 1, the input data during the i-th processing is the second residual vector.

9. The method according to claim 7 or 8, characterized in that, For the first candidate retrieval object and the second candidate retrieval object among the plurality of candidate retrieval objects, the semantic identifiers of the first candidate retrieval object and the second candidate retrieval object are the same as the global semantic identifier and the local semantic identifier, respectively, and the first identifier and the second identifier are used to distinguish the first candidate retrieval object and the second candidate retrieval object.

10. A method for training a retrieval model, characterized in that, include: Multiple training samples are obtained, each of which includes a third search text and a third semantic identifier of the search object corresponding to the third search text; the third semantic identifier includes a third global semantic identifier and a third local semantic identifier. The retrieval model is trained based on the third retrieval text and the third semantic identifier in the multiple training samples.

11. The method according to claim 10, characterized in that, The plurality of training samples include a first training sample and a second training sample, wherein the modality of the retrieval object corresponding to the semantic identifier in the first training sample is different from the modality of the retrieval object corresponding to the semantic identifier in the second training sample.

12. The method according to claim 10 or 11, characterized in that, The acquisition of multiple training samples includes: Obtain multiple search criteria information, including text-type search criteria information and / or non-text-type search criteria information; For non-text search criteria among the multiple search criteria, convert the non-text search criteria into text search criteria. Based on the multiple search criteria information, multiple corresponding search objects are obtained, and multiple third semantic identifiers corresponding to the multiple search objects are determined based on the multiple search objects; The multiple training samples are obtained based on the multiple search condition information of the multiple text types and the multiple third semantic identifiers, wherein the search condition information of the text types is the search text in the training samples.

13. The method according to any one of claims 10-12, characterized in that, The retrieval model includes an encoder, a decoder, and a softmax function. Training the retrieval model based on third retrieval text and third semantic identifier samples from the plurality of training samples includes: The encoder is used to perform multi-scale feature extraction on the third retrieval text in each of the multiple training samples to obtain multiple eighth feature vectors corresponding to each training sample. The ninth feature vector and the second vector matrix corresponding to each training sample are obtained based on the plurality of eighth feature vectors, wherein the ninth feature vector is obtained by concatenating the plurality of eighth feature vectors, and the second vector matrix is ​​obtained by stacking the plurality of eighth feature vectors. The decoder is used to perform two first operations on the ninth feature vector, the second vector matrix and the third semantic identifier in each training sample to obtain two sets of output results for each training sample. The softmax function is used to process one of the two output result sets corresponding to each training sample to obtain a first probability value corresponding to each training sample. The first probability value is used to characterize the probability of generating a third semantic identifier in each training sample. The cross-entropy loss value is determined based on the first probability value corresponding to the plurality of training samples; the consistency loss value corresponding to each training sample is determined based on the two output result sets corresponding to each training sample. The retrieval model is trained based on the cross-entropy loss value and the consistency loss value corresponding to the multiple training samples.

14. The method according to claim 13, characterized in that, The first operation is performed twice on the ninth feature vector, the second vector matrix, and the third semantic identifier in each training sample to obtain two sets of output results for each training sample, including: The decoder performs M second operations based on the ninth feature vector, the second vector matrix, and the third semantic identifier corresponding to each training sample to obtain M first output results corresponding to each training sample; the third semantic identifier includes M tokens. Specifically, during the first second operation, the input data of the decoder includes the ninth feature vector, the second vector matrix, and the starting token corresponding to each training sample; during the j-th second operation, the input data of the decoder includes the ninth feature vector, the second vector matrix, and the (j-1)-th token in the third semantic identifier corresponding to each training sample. The decoder is then used again to perform M second operations based on the ninth feature vector, the second vector matrix, and the third semantic identifier corresponding to each training sample, in order to obtain M second output results corresponding to each training sample. The two sets of output results each include the M first output results and the M second output results.

15. The method according to claim 14, characterized in that, The decoder includes multiple decoding layers. The decoder performs M second operations based on the ninth feature vector, the second vector matrix, and the third semantic identifier corresponding to each training sample to obtain M first output results corresponding to each training sample, including: During the j-th second operation of the M second operations, the s-th decoding layer is used to process the ninth feature vector, the second vector matrix, and the third intermediate data corresponding to each training sample to obtain the output result of the s-th decoding layer. Wherein, when s is greater than 1, the third intermediate data is the output result of the (s-1)th decoding layer; when s = 1, the third intermediate data is the fourth intermediate data; when j = 1, the fourth intermediate data is the starting token; when j is greater than 1 and not greater than M, the fourth intermediate data is the (j-1)th token in the third semantic identifier; the output result of the last decoding layer in the s decoding layers is the jth first output result in the M first output results.

16. The method according to claim 15, characterized in that, The s-th decoding layer includes a cross-attention layer, a fusion layer, a linearization layer, and an activation function. The input data from the s-th decoding layer is used to process the ninth feature vector, the second vector matrix, and the third intermediate data corresponding to each training sample to obtain the output result of the s-th decoding layer, including: The fusion layer is used to fuse the ninth feature vector to obtain a second fused vector, the dimension of which is lower than that of the ninth feature vector. The cross-attention layer is used to process the third intermediate data and the second fused vector to obtain the tenth feature vector; the activation function is used to process the tenth feature vector to obtain the eleventh feature vector; the eleventh feature vector is averaged to obtain the second average value. The cross-attention layer is used to process the multiple eighth feature vectors in the second vector matrix to obtain multiple twelfth feature vectors, which correspond to the multiple eighth feature vectors; wherein, the first intermediate data is used as the Q value of the cross-attention layer, and the eighth feature vectors are used as the K value and V value of the cross-attention layer; The linearization layer is used to linearize the multiple twelfth feature vectors to obtain multiple second processing results; the activation function is used to process the multiple second processing results to obtain multiple weights corresponding to the multiple twelfth feature vectors; the multiple twelfth feature vectors and their corresponding weights are multiplied by a dot product to obtain multiple thirteenth feature vectors; mathematical operations are performed on the multiple thirteenth feature vectors and the second average value to obtain the output result of the s-th decoding layer.

17. The method according to claim 12, characterized in that, The step of determining multiple third semantic identifiers corresponding to the multiple search objects based on the multiple search objects includes: Feature extraction is performed on the multiple search objects to obtain multiple fourteenth feature vectors; Clustering is performed on the multiple fourteenth feature vectors to obtain multiple second clusters and multiple second cluster centers; wherein, each of the multiple second clusters includes one or more of the fourteenth feature vectors; the multiple second clusters correspond to the multiple second cluster centers, and the global semantic identifier of the third semantic identifier of any retrieval object C among the multiple retrieval objects is the index of the second cluster to which the retrieval object C belongs; Based on the multiple second clusters and multiple cluster centers, multiple third residual vectors are determined. The multiple third residual vectors correspond to the multiple retrieval objects. Any one of the multiple third residual vectors, D, is the difference between the fourteenth feature vector of the retrieval object corresponding to the third residual vector D and the second cluster center corresponding to the second cluster to which the fourteenth feature vector of the retrieval object corresponding to the third residual vector D belongs. Based on the multiple third residual vectors, the local semantic identifier of the third semantic identifier of each of the multiple retrieval objects is determined.

18. The method according to claim 17, characterized in that, The local semantic identifier of each retrieved object includes k tokens, where k is an integer greater than 1. The step of determining the local semantic identifier of the first semantic identifier of the multiple retrieved objects based on the multiple first residual vectors includes: The multiple third residual vectors are subjected to dimensionality reduction processing to obtain multiple fourth residual vectors; Based on each of the plurality of fourth residual vectors, k processing operations are performed to obtain k tokens in the local semantic identifier. When i is greater than 1, the i-th token among the k tokens is determined by the input data during the i-th processing. The input data during the i-th processing is determined based on the input data during the (i-1)-th processing and the (i-1)-th token among the k tokens. When i = 1, the input data during the i-th processing is the fourth residual vector.

19. The method according to claim 17 or 18, characterized in that, For the first and second search objects among the plurality of search objects, where the global semantic identifier and the local semantic identifier are the same, the third semantic identifier of the first search object and the third semantic identifier of the second search object further include a third identifier and a fourth identifier, respectively, which are used to distinguish the first search object and the second search object.

20. A retrieval device, characterized in that, The retrieval device includes a unit or module for implementing the method as described in any one of claims 1-9.

21. A training device, characterized in that, The retrieval device includes a unit or module for implementing the method as described in any one of claims 10-19.

22. An electronic device, characterized in that, The method includes a processor and a memory, wherein the memory is used to store program code, and the processor is used to execute the program code to implement the method of any one of claims 1 to 19.

23. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the method according to any one of claims 1-19.

24. A computer program product that, when run on a computer, causes the computer to perform the method as described in any one of claims 1-19.