Data retrieval method and apparatus, electronic device, and storage medium

By constructing and optimizing the feature encoding model, the problem of semantic information being ignored in data retrieval was solved, achieving more efficient and accurate data retrieval.

CN116737749BActive Publication Date: 2026-06-16PING AN TECH (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
PING AN TECH (SHENZHEN) CO LTD
Filing Date
2023-04-20
Publication Date
2026-06-16

Smart Images

  • Figure CN116737749B_ABST
    Figure CN116737749B_ABST
Patent Text Reader

Abstract

The application relates to the technical field of artificial intelligence, and provides a data retrieval method and device, electronic equipment and a storage medium. Historical sample data is constructed based on historical retrieval records, pseudo historical sample data is generated according to the historical sample data, the historical sample data and the corresponding pseudo historical sample data are input into a same preset feature coding model for coding to obtain feature coding data, the relationship between the historical sample data and the pseudo historical sample data is sufficiently learned, a target feature coding model is obtained when the preset feature coding model is optimized based on multiple feature coding data, the accuracy of the model can be improved, and the accuracy of retrieval is improved. Finally, the target feature coding model is used to obtain the most similar target sample data of the to-be-retrieved data from multiple historical sample data, and the efficiency of retrieval is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, specifically to a data retrieval method, apparatus, electronic device, and storage medium. Background Technology

[0002] Data retrieval is the process or technique of storing selected, organized, and evaluated data in a certain medium, and retrieving accurate data that can answer questions from a certain dataset according to user needs.

[0003] In the process of realizing this invention, the inventors discovered that most data retrieval methods use vector matching, but vector matching often ignores the semantic information between query data and historical retrieval data. Furthermore, during the query process, the query data needs to be matched with historical retrieval data one by one, resulting in inaccurate retrieval results and low retrieval efficiency. Summary of the Invention

[0004] In view of the above, it is necessary to propose a data retrieval method, device, electronic device and storage medium that can improve the accuracy and efficiency of data retrieval.

[0005] A first aspect of the present invention provides a data retrieval method, the method comprising:

[0006] Historical sample data is constructed based on historical retrieval records;

[0007] Pseudo-historical sample data is generated based on the historical sample data;

[0008] The feature-encoding data is obtained by encoding based on the historical sample data and the corresponding pseudo-historical sample data using a preset feature encoding model.

[0009] The preset feature coding model is optimized based on multiple feature coding data to obtain the target feature coding model;

[0010] The target feature encoding model is used to obtain the closest target sample data of the data to be retrieved from multiple historical sample data.

[0011] According to an optional embodiment of the present invention, constructing historical sample data based on historical retrieval records includes:

[0012] Obtain each historical query data and its corresponding historical search data, as well as the click tags of the historical search data, from the historical search records;

[0013] A first three-dimensional data group is constructed based on the query data, the corresponding historical retrieval data, and the click tags of the historical retrieval data;

[0014] Each of the first three data sets is treated as a historical sample data.

[0015] According to an optional embodiment of the present invention, generating pseudo-historical sample data based on the historical sample data includes:

[0016] The first three data sets corresponding to the historical sample data are input into the pre-trained language model;

[0017] The pre-trained language model outputs pseudo-historical query data corresponding to each of the first three data sets;

[0018] A second set of three data elements is generated based on each of the pseudo-historical query data, the corresponding historical retrieval data, and the click tags of the historical retrieval data.

[0019] The second three data set is used as a pseudo-historical sample data.

[0020] According to an optional embodiment of the present invention, the step of encoding based on the historical sample data and the corresponding pseudo-historical sample data using a preset feature encoding model to obtain feature encoded data includes:

[0021] The historical query data in the historical sample data is input into the first encoder of the preset feature encoding model to obtain the first encoding vector;

[0022] The pseudo-historical query data and the corresponding historical retrieval data in the pseudo-historical sample data are input into the second encoder of the preset feature encoding model to obtain the second encoding vector;

[0023] The feature encoding data is obtained based on the first encoding vector and the second encoding vector.

[0024] According to an optional embodiment of the present invention, obtaining the closest target sample data of the data to be retrieved from the plurality of historical sample data using the target feature encoding model includes:

[0025] The target feature encoding model is used to encode the historical sample data to obtain target feature encoded data;

[0026] The target feature encoding model is used to encode the data to be retrieved to obtain retrieval feature encoded data;

[0027] Based on the target feature encoding data and the retrieval feature encoding data, the closest target sample data of the data to be retrieved is obtained from multiple historical sample data.

[0028] According to an optional embodiment of the present invention, the step of encoding the historical sample data using the target feature encoding model to obtain target feature encoded data includes: inputting the historical retrieval data and the corresponding pseudo-historical query data in the historical sample data into the second encoder of the target feature encoding model to obtain the target feature encoded data.

[0029] According to an optional embodiment of the present invention, the step of encoding the data to be retrieved using the target feature encoding model to obtain retrieval feature encoded data includes: inputting the data to be retrieved into the first encoder of the target feature encoding model to obtain the retrieval feature encoded data.

[0030] According to an optional embodiment of the present invention, obtaining the closest target sample data of the data to be retrieved from a plurality of historical sample data based on the target feature encoding data and the retrieval feature encoding data includes:

[0031] Based on a preset hash function, the hash bucket to which the retrieval feature encoded data is mapped is determined;

[0032] Use the target feature encoding data corresponding to the hash bucket as the target retrieval set;

[0033] Calculate the similarity between the retrieval feature encoding data and the target feature encoding data included in the target retrieval set;

[0034] Determine whether the similarity is greater than a preset similarity threshold;

[0035] If the similarity is greater than the preset similarity threshold, then the historical sample data corresponding to the target feature encoding data corresponding to the similarity is determined as the closest target sample data of the data to be retrieved.

[0036] A second aspect of the present invention provides a data retrieval apparatus, the apparatus comprising:

[0037] The building module is used to construct historical sample data based on historical retrieval records;

[0038] The generation module is used to generate pseudo-historical sample data based on the historical sample data;

[0039] The encoding module is used to encode the historical sample data and the corresponding pseudo-historical sample data using a preset feature encoding model to obtain feature encoded data.

[0040] The optimization module is used to optimize the preset feature coding model based on multiple feature coding data to obtain the target feature coding model;

[0041] The retrieval module is used to obtain the closest target sample data of the data to be retrieved from multiple historical sample data using the target feature encoding model.

[0042] A third aspect of the present invention provides an electronic device comprising a processor and a memory, wherein the processor is configured to implement the data retrieval method when executing a computer program stored in the memory.

[0043] A fourth aspect of the present invention provides a computer-readable storage medium on which a computer program is stored, the computer program implementing the data retrieval method when executed by a processor.

[0044] The data retrieval method, apparatus, electronic device, and storage medium provided in this invention construct historical sample data based on historical retrieval records, generate pseudo-historical sample data based on the historical sample data, and then input the historical sample data and the corresponding pseudo-historical sample data into the same preset feature encoding model for encoding to obtain feature encoded data. By fully learning the relationship between the historical sample data and the pseudo-historical sample data, the accuracy of the preset feature encoding model can be improved when optimizing it based on multiple feature encoded data, thereby improving the accuracy of retrieval. Finally, the target feature encoding model is used to obtain the closest target sample data of the data to be retrieved from multiple historical sample data, improving the efficiency of retrieval. Attached Figure Description

[0045] Figure 1 This is a flowchart of the data retrieval method provided in Embodiment 1 of the present invention.

[0046] Figure 2 This is an architecture diagram of the data retrieval model provided in an embodiment of the present invention.

[0047] Figure 3 This is a structural diagram of the data retrieval device provided in Embodiment 2 of the present invention.

[0048] Figure 4 This is a schematic diagram of the structure of the electronic device provided in Embodiment 3 of the present invention. Detailed Implementation

[0049] To better understand the above-mentioned objects, features, and advantages of the present invention, the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments. Unless otherwise specified, the embodiments of the present invention and the features thereof can be combined with each other.

[0050] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used herein in the description of the invention is for the purpose of describing an embodiment in one alternative implementation and is not intended to be limiting of the invention.

[0051] The data retrieval method provided in this embodiment of the invention is executed by an electronic device, and correspondingly, the data retrieval device operates in the electronic device.

[0052] The embodiments of this invention can standardize data processing based on artificial intelligence technology. Artificial intelligence (AI) refers to the theories, methods, technologies, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to obtain optimal results.

[0053] Foundational technologies for artificial intelligence generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies mainly encompass computer vision, robotics, biometrics, speech processing, natural language processing, and machine learning / deep learning.

[0054] Example 1

[0055] Figure 1 This is a flowchart of a data retrieval method provided in Embodiment 1 of the present invention. The data retrieval method specifically includes the following steps. Depending on different needs, the order of the steps in this flowchart can be changed, and some steps can be omitted.

[0056] S11, construct historical sample data based on historical retrieval records.

[0057] Historical search records refer to data records of queries performed in the first database based on user-input search criteria and the resulting search results. Electronic devices can associate and store each user-input search criteria and corresponding search results in a second database. The first database stores a large amount of raw data, while the second database stores historical search records. The search criteria stored in the second database are called historical query data, and the stored search results are called historical search data.

[0058] The electronic device retrieves multiple historical search records from a second database and constructs a historical sample dataset based on these records. Each historical search record is a historical sample data point within the historical sample dataset. The historical sample data can be text (e.g., medical text, news text, academic paper text) or images (e.g., facial images, gesture images, facial expression images), and this invention does not impose any limitations. That is, the method described in this invention can be applied to the retrieval of medical text, news text, academic paper text, facial images, gesture images, facial expression images, etc.

[0059] In an optional implementation, constructing historical sample data based on historical retrieval records includes:

[0060] Obtain each historical query data and its corresponding historical search data, as well as the click tags of the historical search data, from the historical search records;

[0061] A first three-dimensional data group is constructed based on the query data, the corresponding historical retrieval data, and the click tags of the historical retrieval data;

[0062] Each of the first three data sets is treated as a historical sample data.

[0063] The click tags in the historical search data can include a first tag and a second tag. The first tag can be represented by the number 1, and the second tag can be represented by the number 0. In other embodiments, the first tag can also be represented by the letter A, and the second tag by the letter B. This invention does not impose any limitations on the representation of the first tag and the second tag.

[0064] The first tag indicates that the user clicked on the historical search data when it was displayed as a search result, while the second tag indicates that the user did not click on the historical search data when it was displayed as a search result.

[0065] Assume D represents the historical sample dataset, q i Let d represent the i-th historical query data in the historical sample dataset D. i This indicates that the historical sample dataset D contains the data related to the i-th historical query data q. i The most recent or matching historical search data, c i If ∈{0,1), then the historical sample dataset D={(q1,d1,c1),...,(q i d i c i ), ..., (q n d n c n )}, where n represents the number of historical sample data in the historical sample dataset D.

[0066] S12, generate pseudo-historical sample data based on the historical sample data.

[0067] After obtaining the historical sample dataset, the electronic device can generate a pseudo-historical sample data for each historical sample data in the historical sample dataset, thereby combining all the pseudo-historical sample data to obtain a pseudo-historical sample dataset, and then training a feature extraction model based on the historical sample dataset and the pseudo-historical sample dataset.

[0068] In an optional implementation, generating pseudo-historical sample data based on the historical sample data includes:

[0069] The first three data sets corresponding to the historical sample data are input into the pre-trained language model;

[0070] The pre-trained language model outputs pseudo-historical query data corresponding to each of the first three data sets;

[0071] A second set of three data elements is generated based on each of the pseudo-historical query data, the corresponding historical retrieval data, and the click tags of the historical retrieval data.

[0072] The second three data set is used as a pseudo-historical sample data.

[0073] The pre-trained language model can be either a doc2query model or a docT5query model.

[0074] The core idea of ​​the doc2query model is to simultaneously train a judgment model and a generation model using pre-labeled (q, d) pairs in the training set. On one hand, it can use only the encoder to concatenate (q, d) for relevance prediction; on the other hand, it can add a decoder to generate the query. The simultaneous learning of these two parts mutually enhances each other, making the encoder stronger. The structure of doc2query is a 6-layer Transformer. The difference between docT5query and doc2query is that docT5 uses a more powerful pre-trained model T5 to generate expanded terms. The core idea of ​​both is to train a Seq2Seq generation model using given query and retrieval data pairs, and then generate pseudo-query data for each retrieval data pair.

[0075] In specific implementation, the first three data sets (q) i d i c i The data is input into a pre-trained language model, such as a doc2query model, which then generates and outputs pseudo-historical query data. Where, qi It is the historical query data entered by the user. It is pseudo-data generated and output by a pre-trained language model, historical query data q i Pseudo-historical query data Although different, they have the same semantics and are pseudo-historical query data. It contains historical retrieval data d i The pseudo-historical query data contains the same topic and keywords as the corresponding historical search data.

[0076] S13, use a preset feature encoding model to encode based on the historical sample data and the corresponding pseudo-historical sample data to obtain feature encoded data.

[0077] Electronic devices can set a bidirectional encoder representation from transformers (BERT) as the base encoder of a preset feature encoding model, and extract features from the historical sample data and the pseudo-historical sample data through the base encoder to obtain feature data.

[0078] In an optional implementation, the step of encoding based on the historical sample data and the corresponding pseudo-historical sample data using a preset feature encoding model to obtain feature encoded data includes:

[0079] The historical query data in the historical sample data is input into the first encoder of the preset feature encoding model to obtain the first encoding vector;

[0080] The pseudo-historical query data and the corresponding historical retrieval data in the pseudo-historical sample data are input into the second encoder of the preset feature encoding model to obtain the second encoding vector;

[0081] The feature encoding data is obtained based on the first encoding vector and the second encoding vector.

[0082] The preset feature encoding model consists of two parts: a first encoder and a second encoder. The first encoder is a query encoder, and the second encoder is a document encoder.

[0083] like Figure 2 As shown, QueryEncoder is the Query encoder, and DocumentEncoder is the Document encoder.

[0084] First, retrieve the historical query data q iThe input is fed into the Query encoder to obtain the first encoded vector output by the Query encoder.

[0085] Secondly, the pseudo-historical query data and historical search data d i The input is fed into the Document encoder to obtain the second encoded vector output by the Document encoder.

[0086] Finally, the first encoding vector is calculated. Second encoding vector The inner product of is used as feature encoding data.

[0087] The above optional implementation method trains the Query encoder using historical query data and the DocumentEncoder using pseudo-historical query data and historical retrieval data, so that the historical query data and historical retrieval data can fully interact and influence each other, thereby learning feature encoding data that can better represent historical sample data.

[0088] S14, optimize the preset feature coding model based on multiple feature coding data to obtain the target feature coding model.

[0089] Each sample data corresponds to one feature code data, and multiple sample data correspond to multiple feature code data.

[0090] Each feature-encoded data point and its corresponding click label are input into a pre-defined feature-encoded model for iterative training, with the cross-entropy loss function value calculated during each iteration. The gradient descent algorithm is then used to optimize the parameters of the pre-defined feature-encoded model based on the cross-entropy loss function value, completing the training process and yielding the target feature-encoded model.

[0091] S15, using the target feature encoding model, obtain the closest target sample data of the data to be retrieved from multiple historical sample data.

[0092] The closest target sample data to the data to be retrieved refers to the target sample data and the data to be retrieved having a similarity greater than a preset similarity threshold.

[0093] In an optional implementation, obtaining the closest target sample data for the data to be retrieved from the plurality of historical sample data using the target feature encoding model includes:

[0094] The target feature encoding model is used to encode the historical sample data to obtain target feature encoded data;

[0095] The target feature encoding model is used to encode the data to be retrieved to obtain retrieval feature encoded data;

[0096] Based on the target feature encoding data and the retrieval feature encoding data, the closest target sample data of the data to be retrieved is obtained from multiple historical sample data.

[0097] After optimizing and training the preset feature encoding model, a target feature encoding model is obtained. The target feature encoding model is used to encode the historical sample data, and the resulting target feature encoding data can better represent the historical sample data. The target feature encoding model is also used to encode the data to be retrieved, and the resulting retrieval feature encoding data can also better represent the data to be retrieved. Thus, based on the target feature encoding data and the retrieval feature encoding data, target sample data corresponding to the data to be retrieved is retrieved from multiple historical sample data, and the target sample data has the highest similarity to the data to be retrieved.

[0098] The step of encoding the historical sample data using the target feature encoding model to obtain target feature encoded data includes: inputting the historical retrieval data and the corresponding pseudo-historical query data from the historical sample data into the second encoder of the target feature encoding model to obtain the target feature encoded data.

[0099] The step of encoding the data to be retrieved using the target feature encoding model to obtain retrieval feature encoded data includes: inputting the data to be retrieved into the first encoder of the target feature encoding model to obtain the retrieval feature encoded data.

[0100] After obtaining the target feature encoding model, the electronic device can input the pseudo-historical query data corresponding to the historical query data in each historical sample data and the historical retrieval data in that historical sample data into the DocumentEncoder encoder of the target feature encoding model to obtain the target feature encoding data for each historical sample data.

[0101] Here, the data to be retrieved refers to the detection conditions input by the user. After receiving an instruction to retrieve the data to be retrieved, the electronic device inputs the data to be retrieved into the QueryEncoder of the target feature encoding model to obtain the retrieved feature encoding data.

[0102] Electronic devices can aggregate all historical sample data and corresponding target feature encoding data to obtain a historical sample target feature encoding dataset, and save the correspondence between historical sample data and corresponding target feature encoding data. This makes it convenient to directly call the historical sample target feature encoding dataset when a retrieval instruction for the data to be retrieved is received in the future, and obtain the most approximate target sample data of the data to be retrieved from the historical sample target feature encoding dataset based on vector matching.

[0103] In the above optional implementation, the DocumentEncoder encoder in the optimized target feature encoding model is used to convert all historical search data into encoding vectors and save them. In actual use, the retrieval data can be obtained from the set of encoding vectors corresponding to the historical search simply by obtaining the encoding vectors of the historical query data, thereby greatly improving the retrieval speed.

[0104] In an optional implementation, obtaining the closest target sample data of the data to be retrieved from a plurality of historical sample data based on the target feature encoding data and the retrieval feature encoding data includes:

[0105] Based on a preset hash function, the hash bucket to which the retrieval feature encoded data is mapped is determined;

[0106] Use the target feature encoding data corresponding to the hash bucket as the target retrieval set;

[0107] Calculate the similarity between the retrieval feature encoding data and the target feature encoding data included in the target retrieval set;

[0108] Determine whether the similarity is greater than a preset similarity threshold;

[0109] If the similarity is greater than the preset similarity threshold, then the historical sample data corresponding to the target feature encoding data corresponding to the similarity is determined as the closest target sample data of the data to be retrieved.

[0110] The preset hash function can be a locality-sensitive hash function.

[0111] After obtaining the target feature encoding data for each historical sample data, the electronic device can use the preset hash function to calculate the hash value of each target feature encoding data, and determine the hash value as the bucket number of the hash bucket. Then, it stores the correspondence between target feature encoding data with the same hash value and the corresponding hash bucket. Since the hash values ​​are the same, it indicates that the target feature encoding data corresponding to the same hash bucket are highly similar; that is, the target feature encoding data corresponding to the same hash bucket are similar, and the historical sample data corresponding to the same hash bucket are similar.

[0112] Similarly, the hash value of the retrieval feature encoded data is calculated using the preset hash function to determine the hash bucket to which the retrieval feature encoded data is mapped. The target feature encoded data corresponding to the determined hash bucket is used as the target retrieval set, and the similarity between the retrieval feature encoded data and each target feature encoded data included in the target retrieval set is calculated. When the similarity is greater than the preset similarity threshold, the historical sample data corresponding to the similarity is determined as the closest target sample data of the data to be retrieved.

[0113] The similarity can be represented by distance, such as cosine distance, Euclidean distance, Hamming distance, etc., and this embodiment of the invention is not limited thereto. In the above optional implementation, since the target feature encoding data is pre-divided into multiple sets according to hash values, and thus multiple historical sample data are divided into multiple sets according to similarity, when determining the most similar target sample data corresponding to the data to be retrieved, it is only necessary to determine the target set corresponding to the data to be retrieved from the pre-determined sets. Then, the similarity between the data to be retrieved and the historical sample data corresponding to the target feature encoding data in the target set is calculated to determine the similar data corresponding to the data to be retrieved. Therefore, the amount of similarity calculation can be greatly reduced, the efficiency of determining similar data can be improved, and the determination result of similar data can be returned quickly.

[0114] In this embodiment of the invention, historical sample data is constructed based on historical retrieval records, and pseudo-historical sample data is generated based on the historical sample data. Then, the historical sample data and the corresponding pseudo-historical sample data are input into the same preset feature encoding model for encoding to obtain feature-encoded data. By fully learning the relationship between the historical sample data and the pseudo-historical sample data, the accuracy of the preset feature encoding model can be improved when optimizing it based on multiple feature-encoded data, thereby improving the accuracy of the retrieval. Finally, the target feature encoding model is used to obtain the closest target sample data for the data to be retrieved from multiple historical sample data sets, improving the efficiency of the retrieval.

[0115] Example 2

[0116] Figure 3 This is a structural diagram of the data retrieval device provided in Embodiment 2 of the present invention.

[0117] In some embodiments, the data retrieval device 30 may include a plurality of functional modules composed of computer program segments. The computer programs for each program segment in the data retrieval device 30 may be stored in the memory of an electronic device and executed by at least one processor to perform (see details). Figure 1 (Description) Data retrieval function.

[0118] In this embodiment, the data retrieval device 30 can be divided into multiple functional modules according to its functions. These functional modules may include: a construction module 301, a generation module 302, an encoding module 303, an optimization module 304, and a retrieval module 305. The module referred to in this invention is a series of computer program segments that can be executed by at least one processor and perform a fixed function, stored in memory. In this embodiment, the functions of each module will be detailed in subsequent embodiments.

[0119] The construction module 301 is used to construct historical sample data based on historical retrieval records.

[0120] Historical search records refer to data records of queries performed in the first database based on user-input search criteria and the resulting search results. Electronic devices can associate and store each user-input search criteria and corresponding search results in a second database. The first database stores a large amount of raw data, while the second database stores historical search records. The search criteria stored in the second database are called historical query data, and the stored search results are called historical search data.

[0121] The electronic device retrieves multiple historical search records from a second database and constructs a historical sample dataset based on these records. Each historical search record is a historical sample data point within the historical sample dataset. The historical sample data can be text (e.g., medical text, news text, academic paper text) or images (e.g., facial images, gesture images, facial expression images), and this invention does not impose any limitations. That is, the method described in this invention can be applied to the retrieval of medical text, news text, academic paper text, facial images, gesture images, facial expression images, etc.

[0122] In an optional implementation, constructing historical sample data based on historical retrieval records includes:

[0123] Obtain each historical query data and its corresponding historical search data, as well as the click tags of the historical search data, from the historical search records;

[0124] A first three-dimensional data group is constructed based on the query data, the corresponding historical retrieval data, and the click tags of the historical retrieval data;

[0125] Each of the first three data sets is treated as a historical sample data.

[0126] The click tags in the historical search data can include a first tag and a second tag. The first tag can be represented by the number 1, and the second tag can be represented by the number 0. In other embodiments, the first tag can also be represented by the letter A, and the second tag by the letter B. This invention does not impose any limitations on the representation of the first tag and the second tag.

[0127] The first tag indicates that the user clicked on the historical search data when it was displayed as a search result, while the second tag indicates that the user did not click on the historical search data when it was displayed as a search result.

[0128] Assume D represents the historical sample dataset, q i Let d represent the i-th historical query data in the historical sample dataset D. i This indicates that the historical sample dataset D contains the data related to the i-th historical query data q. i The most recent or matching historical search data, c i If ∈{0,1}, then the historical sample dataset D={(q1,d1,c1),...,(q i d i c i ), ..., (q n d n c n )}, where n represents the number of historical sample data in the historical sample dataset D.

[0129] The generation module 302 is used to generate pseudo-historical sample data based on the historical sample data.

[0130] After obtaining the historical sample dataset, the electronic device can generate a pseudo-historical sample data for each historical sample data in the historical sample dataset, thereby combining all the pseudo-historical sample data to obtain a pseudo-historical sample dataset, and then training a feature extraction model based on the historical sample dataset and the pseudo-historical sample dataset.

[0131] In an optional implementation, generating pseudo-historical sample data based on the historical sample data includes:

[0132] The first three data sets corresponding to the historical sample data are input into the pre-trained language model;

[0133] The pre-trained language model outputs pseudo-historical query data corresponding to each of the first three data sets;

[0134] A second set of three data elements is generated based on each of the pseudo-historical query data, the corresponding historical retrieval data, and the click tags of the historical retrieval data.

[0135] The second three data set is used as a pseudo-historical sample data.

[0136] The pre-trained language model can be either the doc2Query model or the docT5Query model.

[0137] The core idea of ​​the doc2Query model is to simultaneously train a judgment model and a generative model using pre-labeled (q, d) pairs in the training set. On one hand, it can use only the Encoder to concatenate (q, d) for relevance prediction; on the other hand, it can add a Decoder to generate the query. The simultaneous learning of these two parts mutually enhances each other, making the Encoder stronger. The structure of doc2Query is a 6-layer Transformer. The difference between docT5Query and doc2Query is that docT5Query uses a more powerful pre-trained model, T5, to generate expanded terms. Both models share the core idea of ​​training a Seq2Seq generative model using given query and retrieval data pairs, and then generating pseudo-query data for each retrieval data pair.

[0138] In specific implementation, the first three data sets (q) i d i c i The data is input into a pre-trained language model, such as the doc2Query model, which then generates and outputs pseudo-historical query data. Where, q i It is the historical query data entered by the user. It is pseudo-data generated and output by a pre-trained language model, historical query data q i Pseudo-historical query data Although different, they have the same semantics and are pseudo-historical query data. It contains historical retrieval data d i The pseudo-historical query data contains the same topic and keywords as the corresponding historical search data.

[0139] The encoding module 303 is used to encode the historical sample data and the corresponding pseudo-historical sample data using a preset feature encoding model to obtain feature encoded data.

[0140] Electronic devices can set a bidirectional encoder representation from transformers (BERT) as the base encoder of a preset feature encoding model, and extract features from the historical sample data and the pseudo-historical sample data through the base encoder to obtain feature data.

[0141] In an optional implementation, the step of encoding based on the historical sample data and the corresponding pseudo-historical sample data using a preset feature encoding model to obtain feature encoded data includes:

[0142] The historical query data in the historical sample data is input into the first encoder of the preset feature encoding model to obtain the first encoding vector;

[0143] The pseudo-historical query data and the corresponding historical retrieval data in the pseudo-historical sample data are input into the second encoder of the preset feature encoding model to obtain the second encoding vector;

[0144] The feature encoding data is obtained based on the first encoding vector and the second encoding vector.

[0145] The preset feature encoding model consists of two parts: a first encoder and a second encoder. The first encoder is a query encoder, and the second encoder is a document encoder.

[0146] like Figure 2 As shown, QueryEncoder is the Query encoder, and DocumentEncoder is the Document encoder.

[0147] First, retrieve the historical query data q i The input is fed into the Query encoder to obtain the first encoded vector output by the Query encoder.

[0148] Secondly, the pseudo-historical query data and historical search data d i The input is fed into the Document encoder to obtain the second encoded vector output by the Document encoder.

[0149] Finally, the first encoding vector is calculated. Second encoding vector The inner product of is used as feature encoding data.

[0150] The above optional implementation method trains the Query encoder using historical query data and the DocumentEncoder using pseudo-historical query data and historical retrieval data, so that the historical query data and historical retrieval data can fully interact and influence each other, thereby learning feature encoding data that can better represent historical sample data.

[0151] The optimization module 304 is used to optimize the preset feature coding model based on multiple feature coding data to obtain the target feature coding model.

[0152] Each sample data corresponds to one feature code data, and multiple sample data correspond to multiple feature code data.

[0153] Each feature-encoded data point and its corresponding click label are input into a pre-defined feature-encoded model for iterative training, with the cross-entropy loss function value calculated during each iteration. The gradient descent algorithm is then used to optimize the parameters of the pre-defined feature-encoded model based on the cross-entropy loss function value, completing the training process and yielding the target feature-encoded model.

[0154] The retrieval module 305 is used to obtain the closest target sample data of the data to be retrieved from multiple historical sample data using the target feature encoding model.

[0155] The closest target sample data to the data to be retrieved refers to the target sample data and the data to be retrieved having a similarity greater than a preset similarity threshold.

[0156] In an optional implementation, obtaining the closest target sample data for the data to be retrieved from the plurality of historical sample data using the target feature encoding model includes:

[0157] The target feature encoding model is used to encode the historical sample data to obtain target feature encoded data;

[0158] The target feature encoding model is used to encode the data to be retrieved to obtain retrieval feature encoded data;

[0159] Based on the target feature encoding data and the retrieval feature encoding data, the closest target sample data of the data to be retrieved is obtained from multiple historical sample data.

[0160] After optimizing and training the preset feature encoding model, a target feature encoding model is obtained. The target feature encoding model is used to encode the historical sample data, and the resulting target feature encoding data can better represent the historical sample data. The target feature encoding model is also used to encode the data to be retrieved, and the resulting retrieval feature encoding data can also better represent the data to be retrieved. Thus, based on the target feature encoding data and the retrieval feature encoding data, target sample data corresponding to the data to be retrieved is retrieved from multiple historical sample data, and the target sample data has the highest similarity to the data to be retrieved.

[0161] The step of encoding the historical sample data using the target feature encoding model to obtain target feature encoded data includes: inputting the historical retrieval data and the corresponding pseudo-historical query data from the historical sample data into the second encoder of the target feature encoding model to obtain the target feature encoded data.

[0162] The step of encoding the data to be retrieved using the target feature encoding model to obtain retrieval feature encoded data includes: inputting the data to be retrieved into the first encoder of the target feature encoding model to obtain the retrieval feature encoded data.

[0163] After obtaining the target feature encoding model, the electronic device can input the pseudo-historical query data corresponding to the historical query data in each historical sample data and the historical retrieval data in that historical sample data into the DocumentEncoder encoder of the target feature encoding model to obtain the target feature encoding data for each historical sample data.

[0164] Here, the data to be retrieved refers to the detection conditions input by the user. After receiving an instruction to retrieve the data to be retrieved, the electronic device inputs the data to be retrieved into the QueryEncoder of the target feature encoding model to obtain the retrieved feature encoding data.

[0165] Electronic devices can aggregate all historical sample data and corresponding target feature encoding data to obtain a historical sample target feature encoding dataset, and save the correspondence between historical sample data and corresponding target feature encoding data. This makes it convenient to directly call the historical sample target feature encoding dataset when a retrieval instruction for the data to be retrieved is received in the future, and obtain the most approximate target sample data of the data to be retrieved from the historical sample target feature encoding dataset based on vector matching.

[0166] In the above optional implementation, the DocumentEncoder encoder in the optimized target feature encoding model is used to convert all historical search data into encoding vectors and save them. In actual use, the retrieval data can be obtained from the set of encoding vectors corresponding to the historical search simply by obtaining the encoding vectors of the historical query data, thereby greatly improving the retrieval speed.

[0167] In an optional implementation, obtaining the closest target sample data of the data to be retrieved from a plurality of historical sample data based on the target feature encoding data and the retrieval feature encoding data includes:

[0168] Based on a preset hash function, the hash bucket to which the retrieval feature encoded data is mapped is determined;

[0169] Use the target feature encoding data corresponding to the hash bucket as the target retrieval set;

[0170] Calculate the similarity between the retrieval feature encoding data and the target feature encoding data included in the target retrieval set;

[0171] Determine whether the similarity is greater than a preset similarity threshold;

[0172] If the similarity is greater than the preset similarity threshold, then the historical sample data corresponding to the target feature encoding data corresponding to the similarity is determined as the closest target sample data of the data to be retrieved.

[0173] The preset hash function can be a locality-sensitive hash function.

[0174] After obtaining the target feature encoding data for each historical sample data, the electronic device can use the preset hash function to calculate the hash value of each target feature encoding data, and determine the hash value as the bucket number of the hash bucket. Then, it stores the correspondence between target feature encoding data with the same hash value and the corresponding hash bucket. Since the hash values ​​are the same, it indicates that the target feature encoding data corresponding to the same hash bucket are highly similar; that is, the target feature encoding data corresponding to the same hash bucket are similar, and the historical sample data corresponding to the same hash bucket are similar.

[0175] Similarly, the hash value of the retrieval feature encoded data is calculated using the preset hash function to determine the hash bucket to which the retrieval feature encoded data is mapped. The target feature encoded data corresponding to the determined hash bucket is used as the target retrieval set, and the similarity between the retrieval feature encoded data and each target feature encoded data included in the target retrieval set is calculated. When the similarity is greater than the preset similarity threshold, the historical sample data corresponding to the similarity is determined as the closest target sample data of the data to be retrieved.

[0176] The similarity can be represented by distance, such as cosine distance, Euclidean distance, Hamming distance, etc., and this embodiment of the invention is not limited thereto. In the above optional implementation, since the target feature encoding data is pre-divided into multiple sets according to hash values, and thus multiple historical sample data are divided into multiple sets according to similarity, when determining the most similar target sample data corresponding to the data to be retrieved, it is only necessary to determine the target set corresponding to the data to be retrieved from the pre-determined sets. Then, the similarity between the data to be retrieved and the historical sample data corresponding to the target feature encoding data in the target set is calculated to determine the similar data corresponding to the data to be retrieved. Therefore, the amount of similarity calculation can be greatly reduced, the efficiency of determining similar data can be improved, and the determination result of similar data can be returned quickly.

[0177] In this embodiment of the invention, historical sample data is constructed based on historical retrieval records, and pseudo-historical sample data is generated based on the historical sample data. Then, the historical sample data and the corresponding pseudo-historical sample data are input into the same preset feature encoding model for encoding to obtain feature-encoded data. By fully learning the relationship between the historical sample data and the pseudo-historical sample data, the accuracy of the preset feature encoding model can be improved when optimizing it based on multiple feature-encoded data, thereby improving the accuracy of the retrieval. Finally, the target feature encoding model is used to obtain the closest target sample data for the data to be retrieved from multiple historical sample data sets, improving the efficiency of the retrieval.

[0178] Example 3

[0179] This embodiment provides a computer-readable storage medium storing a computer program. When executed by a processor, the computer program implements the steps described in the data retrieval method embodiment above, for example... Figure 1 S11-S15 as shown:

[0180] S11, Construct historical sample data based on historical retrieval records;

[0181] S12, generate pseudo-historical sample data based on the historical sample data;

[0182] S13, use a preset feature coding model to encode based on the historical sample data and the corresponding pseudo-historical sample data to obtain feature coding data;

[0183] S14, optimize the preset feature coding model based on multiple feature coding data to obtain the target feature coding model;

[0184] S15, using the target feature encoding model, obtain the closest target sample data of the data to be retrieved from multiple historical sample data.

[0185] Alternatively, when the computer program is executed by the processor, it implements the functions of each module / unit in the above-described device embodiments, for example... Figure 3 Modules 301-305 in the document:

[0186] The construction module 301 is used to construct historical sample data based on historical retrieval records;

[0187] The generation module 302 is used to generate pseudo-historical sample data based on the historical sample data.

[0188] The encoding module 303 is used to encode the historical sample data and the corresponding pseudo-historical sample data using a preset feature encoding model to obtain feature encoded data.

[0189] The optimization module 304 is used to optimize the preset feature coding model based on multiple feature coding data to obtain a target feature coding model;

[0190] The retrieval module 305 is used to obtain the closest target sample data of the data to be retrieved from multiple historical sample data using the target feature encoding model.

[0191] Example 4

[0192] See Figure 4 The diagram shown is a structural schematic of an electronic device provided in Embodiment 3 of the present invention. In a preferred embodiment of the present invention, the electronic device 4 includes a memory 41, at least one processor 42, at least one communication bus 43, and a transceiver 44.

[0193] Those skilled in the art should understand that Figure 4 The structure of the electronic device shown does not constitute a limitation of the embodiments of the present invention. It can be a bus structure or a star structure. The electronic device 4 may also include more or fewer other hardware or software than shown, or different component arrangements.

[0194] In some embodiments, the electronic device 4 is a device capable of automatically performing numerical calculations and / or information processing according to pre-set or stored instructions. Its hardware includes, but is not limited to, microprocessors, application-specific integrated circuits (ASICs), programmable gate arrays (FPGAs), digital processors, and embedded devices. The electronic device 4 may also include client devices, including, but not limited to, any electronic product capable of human-computer interaction with a client via a keyboard, mouse, remote control, touchpad, or voice control device, such as personal computers, tablet computers, smartphones, and digital cameras.

[0195] The electronic device 4 described herein is merely an example. Other existing or future electronic products that are adaptable to this invention should also be included within the scope of protection of this invention and are incorporated herein by reference.

[0196] In some embodiments, the memory 41 stores a computer program that, when executed by the at least one processor 42, implements all or part of the steps in the data retrieval method described above. The memory 41 includes a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), a one-time programmable read-only memory (OTPROM), an electronically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disc storage, disk storage, magnetic tape storage, or any other computer-readable medium capable of carrying or historically retrieving data.

[0197] Furthermore, the computer-readable storage medium may primarily include a program storage area and a historical retrieval data area, wherein the program storage area may store the operating system, at least one application program required for a function, etc.; and the historical retrieval data area may store data created based on the use of blockchain nodes, etc.

[0198] The blockchain referred to in this invention is a novel application model of computer technologies such as distributed data storage, peer-to-peer transmission, consensus mechanisms, and encryption algorithms. Essentially, a blockchain is a decentralized database, a chain of data blocks linked together using cryptographic methods. Each data block contains information about a batch of network transactions, used to verify the validity of the information (anti-counterfeiting) and generate the next block. A blockchain can include an underlying blockchain platform, a platform product service layer, and an application service layer.

[0199] In some embodiments, the at least one processor 42 is the control unit of the electronic device 4, connecting various components of the electronic device 4 via various interfaces and lines. It executes programs or modules stored in the memory 41 and calls data stored in the memory 41 to perform various functions and process data. For example, when the at least one processor 42 executes a computer program stored in the memory, it implements all or part of the steps of the data retrieval method described in this embodiment of the invention; or it implements all or part of the functions of the data retrieval device. The at least one processor 42 may be composed of integrated circuits, such as a single-packaged integrated circuit or multiple integrated circuits with the same or different functions, including combinations of one or more central processing units (CPUs), microprocessors, digital processing chips, graphics processors, and various control chips.

[0200] In some embodiments, the at least one communication bus 43 is configured to enable communication between the memory 41 and the at least one processor 42, etc.

[0201] Although not shown, the electronic device 4 may also include a power supply (such as a battery) to power the various components. Preferably, the power supply can be logically connected to the at least one processor 42 via a power management device, thereby enabling functions such as charging, discharging, and power consumption management. The power supply may also include one or more DC or AC power sources, a recharging device, a power fault detection circuit, a power converter or inverter, a power status indicator, or any other components. The electronic device 4 may also include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be described in detail here.

[0202] The integrated unit implemented as a software functional module described above can be stored in a computer-readable storage medium. This software functional module, stored in a storage medium, includes several instructions to cause a computer device (which may be a personal computer, electronic device, or network device, etc.) or processor to execute portions of the methods described in the various embodiments of the present invention.

[0203] In the several embodiments provided by this invention, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and other division methods may be used in actual implementation.

[0204] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical units; they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.

[0205] Furthermore, the functional modules in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or in the form of hardware plus software functional modules.

[0206] It will be apparent to those skilled in the art that the present invention is not limited to the details of the exemplary embodiments described above, and that the invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the invention. Therefore, the embodiments should be considered illustrative and non-limiting in all respects, and the scope of the invention is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be embraced within the present invention. No reference numerals in the claims should be construed as limiting the scope of the claims. Furthermore, it is clear that the word "comprising" does not exclude other elements, and the singular does not exclude the plural. Multiple elements or devices recited in the specification may also be implemented by a single element or device in software or hardware. The terms "first," "second," etc., are used to indicate names and do not indicate any particular order.

[0207] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A data retrieval method, characterized in that, The method includes: Constructing historical sample data based on historical search records includes: obtaining each historical query data and its corresponding historical search data and click tags from the historical search records; constructing a first three-dimensional data set based on the query data and its corresponding historical search data and click tags; and treating each of the first three-dimensional data sets as a historical sample data set. Generating pseudo-historical sample data based on the historical sample data includes: inputting a first ternary data set corresponding to the historical sample data into a pre-trained language model; outputting pseudo-historical query data corresponding to each of the first ternary data sets through the pre-trained language model; generating a second ternary data set based on each of the pseudo-historical query data sets and the corresponding historical retrieval data and the click tags of the historical retrieval data; and using the second ternary data set as a pseudo-historical sample data set. Encoding the historical sample data and the corresponding pseudo-historical sample data using a preset feature encoding model to obtain feature encoded data includes: inputting historical query data from the historical sample data into the first encoder of the preset feature encoding model to obtain a first encoding vector; inputting pseudo-historical query data and the corresponding historical retrieval data from the pseudo-historical sample data into the second encoder of the preset feature encoding model to obtain a second encoding vector; and obtaining the feature encoded data based on the first encoding vector and the second encoding vector. The preset feature coding model is optimized based on multiple feature coding data to obtain the target feature coding model; The target feature encoding model is used to obtain the closest target sample data of the data to be retrieved from multiple historical sample data.

2. The data retrieval method as described in claim 1, characterized in that, The step of using the target feature encoding model to obtain the closest target sample data for the data to be retrieved from multiple historical sample data includes: The target feature encoding model is used to encode the historical sample data to obtain target feature encoded data; The target feature encoding model is used to encode the data to be retrieved to obtain retrieval feature encoded data; Based on the target feature encoding data and the retrieval feature encoding data, the closest target sample data of the data to be retrieved is obtained from multiple historical sample data.

3. The data retrieval method as described in claim 2, characterized in that, The step of encoding the historical sample data using the target feature encoding model to obtain target feature encoded data includes: The historical retrieval data and the corresponding pseudo-historical query data in the historical sample data are input into the second encoder of the target feature encoding model to obtain the target feature encoding data; The step of encoding the data to be retrieved using the target feature encoding model to obtain retrieval feature encoded data includes: The data to be retrieved is input into the first encoder of the target feature encoding model to obtain the retrieved feature encoding data.

4. The data retrieval method as described in claim 2 or 3, characterized in that, The step of obtaining the closest target sample data of the data to be retrieved from multiple historical sample data based on the target feature encoding data and the retrieval feature encoding data includes: Based on a preset hash function, the hash bucket to which the retrieval feature encoded data is mapped is determined; Use the target feature encoding data corresponding to the hash bucket as the target retrieval set; Calculate the similarity between the retrieval feature encoding data and the target feature encoding data included in the target retrieval set; Determine whether the similarity is greater than a preset similarity threshold; If the similarity is greater than the preset similarity threshold, then the historical sample data corresponding to the target feature encoding data corresponding to the similarity is determined as the closest target sample data of the data to be retrieved.

5. A data retrieval device, characterized in that, The device includes: The construction module is used to construct historical sample data based on historical search records, including: obtaining each historical query data and the corresponding historical search data and the click tags of the historical search data from the historical search records; constructing a first three-dimensional data set based on the query data and the corresponding historical search data and the click tags of the historical search data; and treating each of the first three-dimensional data sets as a historical sample data set. The generation module is used to generate pseudo-historical sample data based on the historical sample data, including: inputting a first ternary data set corresponding to the historical sample data into a pre-trained language model; outputting pseudo-historical query data corresponding to each of the first ternary data sets through the pre-trained language model; generating a second ternary data set based on each of the pseudo-historical query data and the corresponding historical retrieval data and the click tags of the historical retrieval data; and using the second ternary data set as a pseudo-historical sample data. An encoding module is used to encode the historical sample data and the corresponding pseudo-historical sample data using a preset feature encoding model to obtain feature encoded data. The module includes: inputting historical query data from the historical sample data into a first encoder of the preset feature encoding model to obtain a first encoding vector; inputting pseudo-historical query data and the corresponding historical retrieval data from the pseudo-historical sample data into a second encoder of the preset feature encoding model to obtain a second encoding vector; and obtaining the feature encoded data based on the first encoding vector and the second encoding vector. The optimization module is used to optimize the preset feature coding model based on multiple feature coding data to obtain the target feature coding model; The retrieval module is used to obtain the closest target sample data of the data to be retrieved from multiple historical sample data using the target feature encoding model.

6. An electronic device, characterized in that, The electronic device includes a processor and a memory, wherein the processor is configured to implement the data retrieval method as described in any one of claims 1 to 4 when executing a computer program stored in the memory.

7. A computer-readable storage medium storing a computer program thereon, characterized in that, When the computer program is executed by the processor, it implements the data retrieval method as described in any one of claims 1 to 4.