Model data processing method and device, computer device, readable storage medium and program product

By constructing and fine-tuning the comparative learning sample data of the text retrieval model, the problem of insufficient semantic information capture in traditional text retrieval methods is solved, achieving higher retrieval accuracy and efficiency.

CN122309694APending Publication Date: 2026-06-30TENCENT TECHNOLOGY (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TENCENT TECHNOLOGY (SHENZHEN) CO LTD
Filing Date
2024-12-31
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Traditional text retrieval methods cannot effectively capture the semantic information of the entire text through average word vectors, resulting in insufficient retrieval accuracy.

Method used

A first comparative learning sample data consisting of query text, positive sample document text, and negative sample document text is constructed. An initial text vector retrieval model is trained through comparative learning. The second text retrieval model is then fine-tuned by combining the query text with the second comparative learning sample data of similar query texts.

Benefits of technology

It improves the accuracy and efficiency of text retrieval, enhances the model's robustness to query text, and is able to better understand and capture the similarities and differences between different query texts.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309694A_ABST
    Figure CN122309694A_ABST
Patent Text Reader

Abstract

This application relates to a model data processing method, apparatus, computer device, computer-readable storage medium, and computer program product. The method includes: acquiring first contrastive learning sample data consisting of a query text, a list of positive sample document texts, and a list of negative sample document texts; performing sample generation processing based on the query text to construct second contrastive learning sample data consisting of the query text, a list of similar query texts, and a list of dissimilar query texts; training an initial text vector retrieval model based on the first contrastive learning sample data to obtain a first text retrieval model; and fine-tuning the first text retrieval model based on the first and second contrastive learning sample data to obtain a second text retrieval model. The second text retrieval model in this application can better understand and capture the similarities and differences between different query texts, thereby improving the robustness of the second text retrieval model to query texts.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, and in particular to a model data processing method, apparatus, computer equipment, computer-readable storage medium, and computer program product. Background Technology

[0002] With the development of computer technology, Natural Language Processing (NLP) has emerged. NLP is an important research direction in the field of artificial intelligence, integrating knowledge from multiple disciplines such as linguistics, computer science, machine learning, mathematics, and cognitive psychology. It is an interdisciplinary field combining computer science, artificial intelligence, and linguistics, encompassing two main aspects: natural language understanding and natural language generation. Its research content includes multiple levels such as characters, words, phrases, sentences, paragraphs, and texts, serving as a bridge between machine language and human language. It aims to enable machines to understand, interpret, and generate human language, achieving effective communication between humans and machines, and enabling computers to perform tasks such as language translation, sentiment analysis, and text summarization. Furthermore, NLP-based retrieval is an effective method for text retrieval.

[0003] Traditional text retrieval techniques typically employ pre-trained word vector representation models (such as Word2Vec and GloVe) to represent each word in the input text as a word vector. The average word vector is then calculated as the vector representation of the entire text, and finally, the cosine similarity between the vectors is calculated. This allows the retrieval of target content that is similar to the input text. However, this approach focuses on the vector representation of individual words, and the text vector obtained through averaging word vectors cannot effectively capture the semantic information of the entire text. Therefore, it is difficult to guarantee the accuracy of text retrieval. Summary of the Invention

[0004] Therefore, it is necessary to provide a model data processing method, apparatus, computer equipment, computer-readable storage medium, and computer program product that can improve the accuracy of text retrieval in response to the above-mentioned technical problems.

[0005] Firstly, this application provides a model data processing method, including:

[0006] Obtain first contrastive learning sample data consisting of query text, positive sample document text, and a list of negative sample document text, wherein the query text has a semantic relationship with the positive sample document text, and the query text does not have a semantic relationship with the negative sample document text in the negative sample document text list;

[0007] Based on the query text, sample generation processing is performed to construct a second contrastive learning sample data consisting of the query text, a list of similar query texts, and a list of dissimilar query texts. The query text is semantically similar to the similar query texts, and the query text is semantically dissimilar to the dissimilar query texts in the list of dissimilar query texts.

[0008] The initial text vector retrieval model is trained based on the first comparative learning sample data to obtain the first text retrieval model.

[0009] Based on the first and second contrastive learning sample data, the first text retrieval model is fine-tuned and trained to obtain a second text retrieval model. The second text retrieval model is used to query target documents that semantically match the input query text.

[0010] Secondly, this application also provides a model data processing apparatus, comprising:

[0011] The first sample construction module is used to obtain first contrastive learning sample data consisting of query text, positive sample document text, and a list of negative sample document text. The query text has a semantic relationship with the positive sample document text, and the query text does not have a semantic relationship with the negative sample document text in the negative sample document text list.

[0012] The second sample construction module is used to perform sample generation processing based on the query text, and construct a second contrastive learning sample data consisting of the query text, similar query texts, and a list of dissimilar query texts. The query text is semantically similar to the similar query texts, and the query text is semantically dissimilar to the dissimilar query texts in the list of dissimilar query texts.

[0013] The first model training module is used to train the initial text vector retrieval model based on the first comparative learning sample data to obtain the first text retrieval model.

[0014] The second model training module is used to fine-tune the first text retrieval model based on the first contrastive learning sample data and the second contrastive learning sample data to obtain a second text retrieval model. The second text retrieval model is used to query target documents that semantically match the input query text.

[0015] Thirdly, this application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to perform the following steps:

[0016] Obtain first contrastive learning sample data consisting of query text, positive sample document text, and a list of negative sample document text, wherein the query text has a semantic relationship with the positive sample document text, and the query text does not have a semantic relationship with the negative sample document text in the negative sample document text list;

[0017] Based on the query text, sample generation processing is performed to construct a second contrastive learning sample data consisting of the query text, a list of similar query texts, and a list of dissimilar query texts. The query text is semantically similar to the similar query texts, and the query text is semantically dissimilar to the dissimilar query texts in the list of dissimilar query texts.

[0018] The initial text vector retrieval model is trained based on the first comparative learning sample data to obtain the first text retrieval model.

[0019] Based on the first and second contrastive learning sample data, the first text retrieval model is fine-tuned and trained to obtain a second text retrieval model. The second text retrieval model is used to query target documents that semantically match the input query text.

[0020] Fourthly, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, performs the following steps:

[0021] Obtain first contrastive learning sample data consisting of query text, positive sample document text, and a list of negative sample document text, wherein the query text has a semantic relationship with the positive sample document text, and the query text does not have a semantic relationship with the negative sample document text in the negative sample document text list;

[0022] Based on the query text, sample generation processing is performed to construct a second contrastive learning sample data consisting of the query text, a list of similar query texts, and a list of dissimilar query texts. The query text is semantically similar to the similar query texts, and the query text is semantically dissimilar to the dissimilar query texts in the list of dissimilar query texts.

[0023] The initial text vector retrieval model is trained based on the first comparative learning sample data to obtain the first text retrieval model.

[0024] Based on the first and second contrastive learning sample data, the first text retrieval model is fine-tuned and trained to obtain a second text retrieval model. The second text retrieval model is used to query target documents that semantically match the input query text.

[0025] Fifthly, this application also provides a computer program product, including a computer program that, when executed by a processor, performs the following steps:

[0026] Obtain first contrastive learning sample data consisting of query text, positive sample document text, and a list of negative sample document text, wherein the query text has a semantic relationship with the positive sample document text, and the query text does not have a semantic relationship with the negative sample document text in the negative sample document text list;

[0027] Based on the query text, sample generation processing is performed to construct a second contrastive learning sample data consisting of the query text, a list of similar query texts, and a list of dissimilar query texts. The query text is semantically similar to the similar query texts, and the query text is semantically dissimilar to the dissimilar query texts in the list of dissimilar query texts.

[0028] The initial text vector retrieval model is trained based on the first comparative learning sample data to obtain the first text retrieval model.

[0029] Based on the first and second contrastive learning sample data, the first text retrieval model is fine-tuned and trained to obtain a second text retrieval model. The second text retrieval model is used to query target documents that semantically match the input query text.

[0030] The aforementioned model data processing method, apparatus, computer equipment, computer-readable storage medium, and computer program product acquire first contrastive learning sample data consisting of query text, a list of positive sample document text, and a list of negative sample document text. Using the first contrastive learning sample data, the semantic association and semantic interaction between query text and document text can be modeled. Based on the query text, sample generation processing is performed to construct second contrastive learning sample data consisting of query text, a list of similar query text, and a list of dissimilar query text. Using the second contrastive learning sample data, it is possible to learn how to map query texts expressing the same or similar intent to nearby positions in the vector space. Then, based on the first contrastive learning sample data, an initial text vector retrieval model is trained to obtain a first text retrieval model. This learns the basic association between query text and document text. Based on the first and second contrastive learning sample data, the first text retrieval model is fine-tuned to obtain a second text retrieval model. Fine-tuning training is used to learn the relationships between similar query texts. This application uses first contrastive learning sample data to learn the basic correlation between query text and document text, and then combines it with second contrastive learning sample data to achieve contrastive learning between similar query texts. By combining the two contrastive learning methods, the second text retrieval model can better understand and capture the similarities and differences between different query texts, thereby improving the robustness of the second text retrieval model to query texts, and further improving the accuracy and efficiency of text retrieval based on the second text retrieval model. Attached Figure Description

[0031] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the drawings used in the description of the embodiments of this application or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0032] Figure 1 This is an application environment diagram of the model data processing method in one embodiment;

[0033] Figure 2 This is a flowchart illustrating a model data processing method in one embodiment;

[0034] Figure 3 This is a schematic diagram of the prompt text in one embodiment;

[0035] Figure 4 This is a schematic diagram illustrating the comparative learning process of query text and document text in one embodiment;

[0036] Figure 5This is a schematic diagram illustrating the comparative learning process of query text and similar query text in one embodiment;

[0037] Figure 6 This is a schematic diagram illustrating the determination of similarity between query text and documents in one embodiment;

[0038] Figure 7 This is a flowchart illustrating the model data processing method in another embodiment;

[0039] Figure 8 This is a structural block diagram of a model data processing device in one embodiment;

[0040] Figure 9 This is a structural block diagram of the model data processing device in another embodiment;

[0041] Figure 10 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0042] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0043] The model data processing method provided in this application embodiment can be applied to, for example... Figure 1In the application environment shown, terminal 102 communicates with server 104 via a network. A data storage system can store the data that server 104 needs to process. The data storage system can be integrated onto server 104 or placed on a cloud or other network server. When a user on terminal 102 wants to train a text retrieval model applicable to a specific domain, they can submit a corresponding request to server 104. The server 104 then performs the relevant model training. After receiving the request, server 104 collects various types of text information in the relevant domain and then performs model training. Server 104 acquires first contrastive learning sample data consisting of query text, a list of positive sample document texts, and a list of negative sample document texts. The query text has semantic relevance to the positive sample document texts, but no semantic relevance to the negative sample document texts in the negative sample document text list. Based on the query text, sample generation processing is performed to construct second contrastive learning sample data consisting of the query text, a list of similar query texts, and a list of dissimilar query texts. The query text is semantically similar to the similar query texts, but not semantically similar to the dissimilar query texts in the dissimilar query text list. Based on the first contrastive learning sample data, an initial text vector retrieval model is trained to obtain a first text retrieval model. Based on the first and second contrastive learning sample data, the first text retrieval model is fine-tuned to obtain a second text retrieval model. Users can use the second text retrieval model to query target documents. The terminal 102 can be, but is not limited to, various personal computers, laptops, smartphones, tablets, IoT devices, and portable wearable devices. IoT devices can include smart speakers, smart TVs, smart air conditioners, smart in-vehicle devices, projection devices, etc. Portable wearable devices can include smartwatches, smart bracelets, head-mounted devices, etc. Headset devices can be virtual reality (VR) devices, augmented reality (AR) devices, smart glasses, etc. Server 104 can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing cloud computing services.

[0044] In one exemplary embodiment, such as Figure 2 As shown, a model data processing method is provided, which can be applied to... Figure 1 Taking server 104 as an example, the explanation includes the following steps 201 to 207. Wherein:

[0045] Step 201: Obtain the first contrastive learning sample data consisting of the query text, positive sample document text, and negative sample document text list. The query text has semantic correlation with the positive sample document text, but the query text does not have semantic correlation with the negative sample document text in the negative sample document text list.

[0046] In this context, the query text, also known as the query term, represents the content specified in the user's query request. Positive sample document texts are the document texts (passages) that satisfy the query text's intent, and these passages have a semantic relationship with the query text. The negative sample document text list is a list file composed of negative sample document texts, which are document texts that do not satisfy the query text's intent, and these passages have no semantic relationship with the query text. The first contrastive learning sample data consists of triples composed of the above three types of data. This application trains the text retrieval model using contrastive learning. Each training data point in the contrastive learning is a triple, namely (query text, positive sample document text, list of negative sample document texts). Each query text corresponds to one positive sample document text and multiple negative sample document texts.

[0047] For example, when a user on terminal 102 wants to train a text retrieval model to retrieve specified document content using the trained model, they can submit a corresponding model training request to server 104. Server 104 will then process the model training. The model training request can specify the base model and the base data for model training. Server 104 will then conduct training based on the model training request. The text retrieval model is essentially a dense vector retrieval model. The essence of dense vector retrieval is learning the vector representation of text in a dense latent semantic space, modeling the semantic association and semantic interaction between query text and document text, thereby calculating and measuring the relevance score between the query text and document text. This application specifically constructs model training data through contrastive learning to complete the training of the text retrieval model. Contrastive learning refers to identifying and distinguishing relevant positive sample text from other irrelevant negative sample text. Positive sample pairs are similar or related data pairs, while negative sample pairs are dissimilar or irrelevant data pairs. The basic idea of ​​contrastive learning is to bring similar and related samples closer (positive samples) and push dissimilar and irrelevant samples further apart (negative samples). The goal of vector retrieval models is to map similar samples to nearby positions in the vector space, while mapping dissimilar samples to distant positions. Therefore, when constructing training data, a first contrastive learning sample data is obtained, consisting of a list of query text, positive sample document text, and negative sample document text. Positive and negative sample pairs can then be extracted from this first contrastive learning sample data to construct model training data, thus implementing the relevant processing for model training. In one embodiment, the first contrastive learning sample data can be obtained from datasets such as T2Ranking and mMARCO. An example of sample data is shown below:

[0048] Query text: How does snowmelt affect the urban heat island effect?

[0049] Positive sample document text: Melting snow and ice exacerbate the urban heat island effect. As snow and ice cover decreases, the amount of solar radiation absorbed by the ground increases, leading to higher urban temperatures and thus worsening the urban heat island effect.

[0050] List of negative sample document texts:

[0051] Negative Sample 1: Analysis of the impact of global warming on agricultural production. Global warming may lead to a shortened crop growth cycle and reduced crop yields, thereby affecting food security.

[0052] Negative Sample 2: The urban heat island effect refers to the phenomenon where cities become "high-temperature" due to factors such as excessive artificial heat generation, high heat storage bodies like buildings and roads, and reduced green space. The temperature in the city is significantly higher than in the surrounding suburbs.

[0053] Negative Sample 3: The main factors contributing to the urban heat island effect include urban underlying surface, artificial heat sources, water and air influences, air pollution, reduction of green space, and population migration.

[0054] Step 203: Based on the query text, perform sample generation processing to construct a second comparative learning sample data consisting of the query text, a list of similar query texts, and a list of dissimilar query texts. The query text is semantically similar to the similar query texts, and the query text is semantically dissimilar to the dissimilar query texts in the list of dissimilar query texts.

[0055] In this context, similar query text refers to another query text that is semantically similar to the input query text, while dissimilar query text is another query text that is semantically dissimilar to the input query text. Sample generation involves constructing similar text based on the query text, which can be achieved through keyword replacement or a large language model. For example, after identifying the keywords in the query text, these keywords are replaced with corresponding similar words to construct similar query text. For dissimilar query text, it can be directly extracted from another set of first-stage comparative learning sample data to serve as the dissimilar query text for the current query text.

[0056] For example, the applicant found that existing vector retrieval models are quite sensitive to the wording of query text. Slight modifications to the wording can lead to inaccurate or inconsistent search results. Similar query texts should point to the same search result. For instance, a similar query text to "How to cook pasta?" could be "What are the methods for cooking pasta?" Although the wording differs, the search intent is the same. Training a vector retrieval model using only query-passage contrastive learning fails to recognize the similarity between these wordings. Despite the shared search intent, the model cannot map these similarities to similar vector representations. A vector retrieval model sensitive to query wording might return the correct search result for the original query text, but return less relevant content, such as the history or ingredients of pasta, for modified similar query texts.

[0057] To this end, the applicant uses query-query comparison learning to enable the vector retrieval model to better understand and capture the similarities and differences between different query texts, thereby improving the robustness of the vector retrieval model to query texts. Therefore, it is also necessary to construct training data for query-query comparison learning. We need to collect or generate a set of similar query text pairs as positive samples and a set of dissimilar query text pairs as negative samples. Similar query text pairs can be different expressions of the same or similar intent, such as "How to cook pasta?" and "What are the methods for cooking pasta?"; dissimilar query text pairs are query texts expressing different intents, such as "How to cook pasta?" and "How to bake pizza?". To construct the training data for query-query comparison learning, we can perform sample generation processing based on the first comparison learning sample data to construct a second comparison learning sample data consisting of query texts, similar query texts, and lists of dissimilar query texts. For example, the powerful semantic understanding and context learning capabilities of large language models (such as ChatGPT and GPT4) can be leveraged to rewrite the original query text by constructing appropriate prompt word templates. The rewritten similar query text is similar to the original query text, has the same retrieval intent, and is related to the positive sample document text. For example, for the query text mentioned above, after sample generation processing based on this query text, the second contrastive learning sample data obtained is shown below:

[0058] Query text: How does snowmelt affect the urban heat island effect?

[0059] Similar search terms: What impact does snowmelt have on the urban heat island effect?

[0060] List of dissimilar query texts:

[0061] Negative Sample 1: Why is football the world's number one sport?

[0062] Negative Sample 2: Why is plaster particularly suitable as an interior decoration material?

[0063] Negative Sample 3: Why does my gums bleed when I brush my teeth after a filling?

[0064] Step 205: Train the initial text vector retrieval model based on the first contrastive learning sample data to obtain the first text retrieval model.

[0065] The initial text vector retrieval model represents the initial state of the model, which can specifically be a dual-encoder model. This is because dense vector retrieval essentially learns the vector representation of text in a dense latent semantic space, modeling the semantic associations and interactions between query text and document text, thereby calculating and measuring the relevance score between them. The dual-encoder independently encodes each text sequence, rather than treating them as a pair of inputs. Specifically, the dual-encoder uses a dual-tower structure, employing two independent encoders to learn the latent semantic representations of the query and document texts—that is, query vectors and document vectors. The relevance score is then calculated based on these two vectors, using a relevance function (dot product / cosine similarity) between the query and document vectors. The dual-encoder structure can use two independent encoders or share the same encoder to encode both the query and document texts. The dual-encoder uses a transformer model structure, which boasts strong modeling capabilities, good scalability, and excellent parallel computation performance. The transformer model consists of multiple identical transformer layers stacked on top of each other. Each transformer layer contains two sub-layers: a multi-head self-attention mechanism and a feed-forward neural network. Furthermore, each sub-layer is followed by a residual connection and layer normalization. The multi-head self-attention mechanism calculates the degree of association between each word in the input sequence and other words, thus capturing long-distance dependencies in the sentence. The multi-head mechanism allows the transformer model to simultaneously pay attention to information from different locations in the text sequence. The feed-forward neural network is used to extract local features from the input sequence and typically contains two fully connected layers and an activation function.

[0066] For example, the initial text vector retrieval model can be trained using the first contrastive learning sample data to perform the first stage of contrastive learning. This stage of contrastive learning focuses on learning the relationship between query text and related document text. By minimizing the distance between positive sample pairs (query text and positive sample document text) and maximizing the distance between negative sample pairs (query text and negative sample document text), the vector retrieval model learns how to map related query text and document text to similar positions in the vector space. This helps improve the model's accuracy and recall in retrieval tasks. In one embodiment, specifically for each first contrastive learning sample data, a positive sample pair and multiple negative sample pairs can be constructed. These pairs are then input into the dual encoder of the initial text vector retrieval model to determine the similarity between samples within each pair. The loss is then calculated based on the similarity to optimize the initial text vector retrieval model. This process is iterated repeatedly using different first contrastive learning sample data to complete the training of the initial text vector retrieval model.

[0067] Step 207: Based on the first contrastive learning sample data and the second contrastive learning sample data, the first text retrieval model is fine-tuned and trained to obtain the second text retrieval model. The second text retrieval model is used to query target documents that semantically match the input query text.

[0068] For example, after completing the first stage of model training, the first text retrieval model primarily models the key relationship between query text and document text. Query text-similar query text contrastive learning focuses on learning the relationship between similar query texts with different expressions, allowing the vector retrieval model to learn how to map query texts with different expressions conveying the same or similar intent to nearby positions in the vector space. Therefore, based on the first stage of contrastive learning, the vector model further introduces query text-similar query text (query-query) contrastive learning to further train the vector retrieval model. In this training stage, query-passage contrastive learning and query-query contrastive learning can be considered as two related sub-tasks, trained using the same first text retrieval model. During training, the two sub-tasks can mutually promote each other, jointly improving the performance of the vector retrieval model. The query-query contrastive learning method focuses on learning the relationship between similar queries with different expressions. By minimizing the distance between positive sample pairs (similar queries) while maximizing the distance between negative sample pairs (dissimilar queries), the vector retrieval model can learn how to map queries with different expressions conveying the same or similar intent to nearby positions in the vector space. This helps improve the model's robustness to queries, i.e., its ability to handle slightly modified queries. In one embodiment, specifically for each second contrastive learning sample data, a positive sample pair and multiple negative sample pairs can be constructed. These positive and negative sample pairs are then input into the dual encoder of the initial text vector retrieval model to determine the similarity between samples within each sample pair of the second contrastive learning sample data. This process also requires combining the first contrastive learning sample data; the similarity of sample pairs in the first contrastive learning sample data can be obtained by referring to step 205. The loss is then calculated by combining the two similarities to optimize the first text vector retrieval model. By iteratively repeating this process, the training of the initial text vector retrieval model is completed.

[0069] The aforementioned model data processing method, apparatus, computer equipment, computer-readable storage medium, and computer program product acquire first contrastive learning sample data consisting of query text, a list of positive sample document text, and a list of negative sample document text. Using the first contrastive learning sample data, the semantic association and semantic interaction between query text and document text can be modeled. Based on the query text, sample generation processing is performed to construct second contrastive learning sample data consisting of query text, a list of similar query text, and a list of dissimilar query text. Using the second contrastive learning sample data, it is possible to learn how to map query texts expressing the same or similar intent to nearby positions in the vector space. Then, based on the first contrastive learning sample data, an initial text vector retrieval model is trained to obtain a first text retrieval model. This learns the basic association between query text and document text. Based on the first and second contrastive learning sample data, the first text retrieval model is fine-tuned to obtain a second text retrieval model. Fine-tuning training is used to learn the relationships between similar query texts. This application uses first contrastive learning sample data to learn the basic correlation between query text and document text, and then combines it with second contrastive learning sample data to achieve comparative learning between similar query texts. By combining these two contrastive learning methods, the second text retrieval model can better understand and capture the similarities and differences between different query texts, thereby improving the robustness of the second text retrieval model to query texts, and ultimately improving the accuracy and efficiency of text retrieval based on the second text retrieval model. A phased training strategy is also designed, allowing the model to first learn the basic relationship between query text and document text, and then learn the relationships between similar query texts. This can reduce the training difficulty to a certain extent and improve the model's convergence speed and stability.

[0070] In an exemplary embodiment, step 203 includes: filling the query text into the prompt word template to obtain text generation prompt words; inputting the text generation prompt words into a large language model, and performing text generation processing through the large language model to obtain similar query texts corresponding to the query text; extracting dissimilar query texts that are not similar to the query text from each of the first contrastive learning sample data to form a list of dissimilar query texts; and summarizing the query text, similar query texts, and dissimilar query text lists to form second contrastive learning sample data.

[0071] Large Language Models (LLMs) are a type of natural language processing model built using deep learning techniques, designed to simulate human language processing and generation capabilities. LLMs typically employ neural network structures and are trained on large-scale text data to learn the grammar, semantics, and contextual information within the text, thereby generating a model with a certain level of language ability. Specifically, this application utilizes LLMs to generate similar query text, completing the construction of a second comparative learning sample data set. A prompt is text or instruction that provides input to the LLM to guide it in generating a specific output. It is used to trigger the model to produce the expected response.

[0072] For example, the solution of this application can specifically utilize a large language model to generate similar query text. The large language model possesses powerful semantic understanding and context learning capabilities. It can rewrite the original query text by constructing a suitable prompt word template. The rewritten query text is similar to the original query text, has the same retrieval intent, and is related to the positive sample document text. The query text is filled into the prompt word template to obtain text generation prompt words. The prompt word text input to the large model includes two parts: instructions describing the rewriting task, the original query text, and the positive sample document text. The instructions describing the rewriting task are the template content, which is universal for all query texts. The original query text and the positive sample document text are the content filled into the template. In one embodiment, the prompt word text can refer to... Figure 3 As shown, for the query text: "How does snowmelt affect the urban heat island effect?" and the positive sample document text: "Snowmelt exacerbates the urban heat island effect. Due to reduced snow cover, the amount of solar radiation absorbed by the ground increases, leading to higher urban temperatures and thus exacerbating the urban heat island effect," we combine this with the instructions for the rewriting task: rewrite the query text using different expressions, ensuring the rewritten question is similar to the original question and has the same intent; the rewritten question is also relevant to the document text. Combining these three elements yields the prompt text for the large language model. This prompt text is then input into the large language model to obtain the similar query text: "What impact does snowmelt have on the urban heat island effect?" The rewritten similar query text can then be used as a positive sample of the original query text. Other query texts can be randomly selected from the first comparative learning sample data as negative samples to construct a list of dissimilar query texts. Finally, by summarizing the query text, the similar query text, and the list of dissimilar query texts, we construct the second comparative learning sample data. In this embodiment, using a large language model to generate similar query texts that are similar to the query text can effectively improve the efficiency and accuracy of constructing similar query texts, thereby ensuring the effectiveness of subsequent comparative learning.

[0073] Furthermore, the method also includes: replacing the query text in the first comparative learning sample data with corresponding similar query text to obtain augmented sample data; and updating the set of the first comparative learning sample data based on the augmented sample data. After obtaining the similar query text, data augmentation processing of the first comparative learning sample data can be completed using the similar query text. Since the similar query text and the original query text are semantically similar, the two query texts can find the same document text. Therefore, in the first comparative learning sample data, the query text can be replaced with the constructed similar query text to obtain a new first comparative learning sample data. By constructing multiple first comparative learning sample data, data augmentation processing can be completed. During the training process, the initial text vector retrieval model can be trained based on the updated set of first comparative learning sample data to obtain the first text retrieval model. This allows the vector retrieval model to learn the association between query texts with different expressions and positive sample document texts during training, which helps to improve the robustness of the vector retrieval model to slightly modified query texts.

[0074] In one exemplary embodiment, step 205 includes:

[0075] The query text, positive sample document text, and negative sample document text in the first comparative learning sample data are vectorized to obtain the query input vector corresponding to the query text, the positive sample input vector corresponding to the positive sample document text, and the negative sample input vector corresponding to the negative sample document text.

[0076] Construct positive sample vector pairs consisting of the query input vector and the positive sample input vector, and negative sample vector pairs consisting of the query input vector and each negative sample input vector.

[0077] The positive sample vector pairs and each negative sample vector pair are input into the initial text vector retrieval model to obtain the first cosine similarity score of the encoded feature vector of the positive sample vector pair and the corresponding second cosine similarity score of the encoded feature vector of each negative sample vector pair.

[0078] Loss calculation is performed based on the first and second cosine similarity scores to obtain loss parameters. The loss parameters are used to minimize the distance between the positive sample vector and the median vector, and maximize the distance between the negative sample vector and the median vector.

[0079] The model parameters of the initial text vector retrieval model are updated based on the loss parameters to obtain the first text retrieval model.

[0080] Vectorization is a preprocessing step, a crucial aspect of machine learning involving data organization, cleaning, transformation, and expansion. Data cleaning and preprocessing aim to improve model performance, increase prediction accuracy, and reduce errors. In this application, vectorization preprocessing primarily transforms text data into a form acceptable to the encoder model. Specifically, it constructs corresponding input vectors for the query text, positive sample document text, and negative sample document text. Positive and negative sample vector pairs refer to paired feature data input to the model. Since the initial text vector retrieval model is implemented using a dual-tower model, corresponding positive and negative sample vector pairs can be formed and input into the dual-tower encoder model. The first cosine similarity score is calculated based on the cosine similarity of the two vectors in the positive sample vector pair. Similarly, the second cosine similarity score is calculated based on the cosine similarity of the two vectors in the negative sample vector pair. Cosine similarity, also known as cosine similarity, assesses the similarity between two vectors by calculating the cosine of the angle between them. Cosine similarity plots vectors onto a vector space based on their coordinates, such as the most common two-dimensional space. The loss parameter, on the other hand, is a metric used during neural network training to measure the difference between the model's predictions and the true labels. A high or low loss parameter reflects how well the model fits the training data with the current parameters. The loss parameter can be used to optimize the initial text vector retrieval model's parameters through backpropagation, thereby improving the model's performance.

[0081] For example, the initial text vector retrieval model can be implemented using a dual encoder. The dual encoder independently encodes each text sequence, rather than treating them as a pair of inputs. Specifically, the dual encoder employs a dual-tower structure, using two independent encoders to learn the latent semantic representations of the query text and document text, i.e., query vectors and document vectors. A relevance score is then calculated based on these two vectors, obtained by calculating the relevance function (dot product / cosine similarity) between the query vector and document vector. Therefore, during model training, the query text, positive sample document text, and negative sample document text first need to undergo vectorization preprocessing. This can be achieved through word segmentation, indexing, etc., to transform the query text into a query input vector, the positive sample document text into a positive sample input vector, and the negative sample document text into a negative sample input vector. Then, since each input to the dual encoder requires two pairs of input vectors, positive sample vector pairs can be formed based on the query input vector and the positive sample input vector, and multiple negative sample vector pairs can be formed based on the query input vector and each negative sample input vector. After forming vector pairs, the two vectors of each pair are input into the dual encoder of the initial text vector retrieval model for encoding, resulting in two densely encoded feature vectors. Then, a cosine similarity is calculated between these two vectors to obtain the corresponding cosine similarity score. This represents the densely encoded vector representation of the query text. Dense vector representation of document text encoding With these two encoded feature vectors, the cosine similarity score between the two can be calculated, which satisfies the formula:

[0082]

[0083] After calculating the cosine similarity score for each positive and negative sample vector pair, loss calculation can be performed based on the first and second cosine similarity scores to determine the loss parameter corresponding to the current first contrastive learning sample data. The calculation process of the loss parameter can satisfy the following formula:

[0084]

[0085] Among them () represents the first cosine similarity score. This represents the second cosine similarity score. During the contrastive learning training process, the model parameters of the dual encoder are updated by minimizing the above loss function, thereby minimizing the distance between positive sample pairs (relevant query text and positive sample document text) and maximizing the distance between negative sample pairs (irrelevant query text and negative sample document text). This yields the desired first text retrieval model. By constructing different training data and iterating the above training process multiple times, the final first text retrieval model can be obtained. In this embodiment, the dual encoder of the initial text vector retrieval model is trained through contrastive learning, enabling the model to learn how to map relevant query text and document text to similar positions in the vector space. This ensures the model's accuracy and recall during the retrieval process.

[0086] In an exemplary embodiment, vectorizing the query text in the first contrastive learning sample data to obtain the query input vector corresponding to the query text includes: performing word segmentation on the query text in the first contrastive learning sample data to obtain the word segmentation result of the query text; and indexing the query text based on the word segmentation result to obtain the query input vector corresponding to the query text.

[0087] For example, this application constructs the query input vector corresponding to the query text through vectorized preprocessing. First, the query text needs to be segmented into words or phrases. For instance, for the query text "How will melting ice and snow affect the urban heat island effect?", the plain text content "How will melting ice and snow affect the urban heat island effect?" is extracted first, and then segmented into words such as "ice and snow, melting, will, how, affect, urban heat island effect". Further, these segmented text results need to be indexed to transform them into an efficient data structure form, constructing the query input vector. Indexing can be implemented using dictionary encoding. For each word, a pre-defined encoding dictionary is searched, and it is converted into an encoded form. For example, "ice and snow, melting, will, how, affect, urban heat island effect" can be represented as { In one embodiment, a special symbol "cls" can be added to the beginning of the query text to indicate the start of a sentence. In this case, the query input vector is represented as { This can be used as the input feature vector for the encoder model. Similarly, for positive and negative sample document text, they can be transformed in a similar way to query text to obtain their respective input vectors. In this embodiment, constructing the query input vector through word segmentation and indexing can effectively improve the construction efficiency and accuracy of the query input vector.

[0088] In an exemplary embodiment, inputting positive sample vector pairs into an initial text vector retrieval model to obtain a first cosine similarity score of the encoded feature vectors of the positive sample vector pairs includes: inputting the positive sample vector pairs into a dual encoder of the initial text vector retrieval model for encoding processing to obtain a first encoded feature vector corresponding to the query input vector and a second encoded feature vector corresponding to the positive sample input vector; determining the cosine similarity between the first encoded feature vector and the second encoded feature vector; and scaling the cosine similarity based on a preset scaling temperature coefficient to obtain a first cosine similarity score of the encoded feature vectors of the positive sample vector pairs.

[0089] For example, this application uses a dual encoder to encode vector pairs. After obtaining positive sample vector pairs, these pairs can be input into the dual encoder of the initial text vector retrieval model for encoding. This process can be referred to... Figure 4 As shown, the dual encoder is implemented using a transformer, which encodes the query input vector obtained from the positive sample vector pair. } and positive sample input vector { After calculation by the transformer, the corresponding dense vectors are obtained. The output vector of the last transformer layer corresponding to the starting character "cls" is used as the dense vector representation of the query text, that is, the first encoded feature vector. The output vector of the last transformer layer corresponding to the starting character "cls" is used as the dense vector representation of the document text, i.e., the second encoded feature vector. And for the first and second encoded feature vectors in the vector pair, calculate the cosine similarity score between the two vectors:

[0090]

[0091] in The temperature coefficient is used for scaling. The temperature coefficient is a hyperparameter used to adjust the creativity and diversity of the text generated by the model. The temperature coefficient is a value greater than 0, typically between 0 and 1. It affects the probability distribution of sampled predicted words when the model generates text. When the model's temperature coefficient is high (e.g., 0.8, 1, or higher), the model tends to choose from a wider variety of words, resulting in more risky and creative text, but also potentially more errors and inconsistencies. When the temperature coefficient is low (e.g., 0.2, 0.3, etc.), the model primarily chooses from words with higher probabilities, resulting in smoother and more coherent text. However, the generated text may appear overly conservative and repetitive. In practical applications, developers can weigh the appropriate temperature coefficient value based on the application requirements of the model. Similarly, for negative sample vector pairs, the corresponding second cosine similarity score can be calculated using the above method, and then the corresponding loss parameter can be calculated. In this embodiment, a dual encoder is used to perform cosine similarity estimation for positive sample vector pairs, effectively ensuring the accuracy and efficiency of model training.

[0092] In an exemplary embodiment, step 207 includes: inputting first contrastive learning sample data into a first text retrieval model to obtain a first contrastive loss parameter corresponding to the first contrastive learning sample data; inputting second contrastive learning sample data into the first text retrieval model to obtain a second contrastive loss parameter corresponding to the second contrastive learning sample data; performing loss calculation processing based on the first contrastive loss parameter, the second contrastive loss parameter, and preset weight parameters to obtain a total contrastive loss parameter; and updating the model parameters of the first text retrieval model based on the total contrastive loss parameter to obtain a second text retrieval model.

[0093] For example, after training the first text retrieval model, this application further includes a training phase combining first and second contrastive learning sample data. This phase introduces second contrastive learning sample data to further train the vector retrieval model through contrastive learning of similar query texts. Query text-document text contrastive learning and query text-similar query text contrastive learning are considered as two related sub-tasks, trained using the same vector retrieval model. During training, the two sub-tasks can mutually promote each other, jointly improving the performance of the vector retrieval model. For the specific training process, the two types of sample data can be input into the trained first text retrieval model respectively, thereby calculating the corresponding contrastive loss parameters for each type of sample. Then, the first contrastive loss parameter for query text-document text and the second contrastive loss parameter for query text-similar query text are combined into a total loss function. Specifically, this can be expressed as:

[0094]

[0095] in, This represents the first contrast loss parameter. This represents the preset weight parameters of the first contrast loss parameter. This represents the first contrast loss parameter. The preset weight parameters represent the first contrastive loss parameters. The preset weight parameters corresponding to each of the two loss parameters can be adjusted according to actual needs. Then, based on the total contrastive loss parameters, the model parameters of the first text retrieval model are optimized and updated. The resulting second text retrieval model can simultaneously learn the relationship between query text and document text, as well as the relationship between similar query texts. By constructing different training data and iterating the above training process multiple times, the final second text retrieval model can be obtained. In this embodiment, by combining two contrastive learning methods to train the text retrieval model, the robustness of the resulting second text retrieval model to query text can be improved while maintaining retrieval accuracy and recall. This will give the model better performance and adaptability in practical applications.

[0096] In an exemplary embodiment, inputting the second contrastive learning sample data into the first text retrieval model and obtaining the second contrastive loss parameter corresponding to the second contrastive learning sample data includes:

[0097] The query text, similar query text, and dissimilar query text in the second comparative learning sample data are vectorized to obtain the query input vector corresponding to the query text, the similar input vector corresponding to the similar query text, and the dissimilar input vector corresponding to the dissimilar query text.

[0098] Construct similar vector pairs consisting of the query input vector and similar input vectors, and dissimilar vector pairs consisting of the query input vector and each dissimilar input vector.

[0099] By inputting the similar vector pairs and each dissimilar vector pair into the first text retrieval model, the third cosine similarity score of the encoded feature vectors of the similar vector pairs and the fourth cosine similarity score of the encoded feature vectors of each dissimilar vector pair are obtained.

[0100] Loss calculation is performed based on the third and fourth cosine similarity to obtain the second contrast loss parameter. The second contrast loss parameter is used to minimize the distance between similar vectors and their median vectors, and to maximize the distance between dissimilar vectors and their median vectors.

[0101] For example, the calculation process for the first contrastive loss parameter can be referenced from the loss parameter calculation process during the training of the first text retrieval model. The calculation process for the second contrastive loss parameter can also refer to this process; however, when calculating the second contrastive loss parameter, the first contrastive learning sample data input to the model needs to be replaced with the second contrastive learning sample data. First, it is necessary to complete the vectorization processing of the query text, similar query text, and dissimilar query text, constructing their respective input vectors. Then, the query input vector and similar input vectors are combined into similar vector pairs, and the query input vector and each dissimilar input vector are combined into dissimilar vector pairs. After forming the vector pairs, the two vectors of each pair are input into the dual encoder of the first text vector retrieval model for encoding processing, resulting in two vector-dense encoded feature vectors. Then, cosine similarity is calculated for these two vectors to obtain the corresponding third and fourth cosine similarity scores. The query text is represented by a densely encoded vector. Dense vector representation of similar query text With these two encoded feature vectors, the cosine similarity score between the two can be calculated, which satisfies the formula:

[0102]

[0103] After calculating the cosine similarity score for each similar vector pair and each dissimilar vector pair, loss calculation can be performed based on the third and fourth cosine similarity scores to determine the loss parameters corresponding to the current second contrastive learning sample data. The calculation process of the loss parameters can satisfy the following formula:

[0104]

[0105] Among them () represents the third cosine similarity score. This represents the fourth cosine similarity score. During contrastive learning training, the model parameters of the dual encoder are updated by minimizing the aforementioned loss function. By minimizing the distance between similar vector pairs (similar query texts) and maximizing the distance between dissimilar vector pairs (dissimilar query texts), the vector retrieval model can learn how to map query texts expressing the same or similar intent to nearby positions in the vector space. This helps improve the model's robustness to query text, i.e., its ability to handle slight modifications to the query text.

[0106] In an exemplary embodiment, inputting similar vector pairs into a first text retrieval model to obtain the third cosine similarity score of the encoded feature vectors of the similar vector pairs includes:

[0107] The similarity vectors are encoded into the dual encoders of the first text retrieval model to obtain the third encoded feature vector corresponding to the query input vector and the fourth encoded feature vector corresponding to the similar input vector.

[0108] Determine the coding cosine similarity between the third and fourth coding feature vectors.

[0109] The cosine similarity of the encoded vectors is scaled based on a preset scaling temperature coefficient to obtain the third cosine similarity of the encoded feature vectors of similar vector pairs.

[0110] For example, this application uses a dual encoder to encode vector pairs. After obtaining similar vector pairs, the similar vector pairs can be input into the dual encoder of the first text vector retrieval model for encoding. This process can be referred to... Figure 5 As shown, the dual encoder is implemented using a transformer, which encodes the query input vector obtained from similar vector pairs. } and similar query input vector { After calculation by the transformer, the corresponding dense vectors are obtained. The output vector of the last transformer layer corresponding to the starting character "cls" is used as the dense vector representation of the query text, that is, the third encoded feature vector. The output vector of the last transformer layer corresponding to the starting character "cls" is used as the dense vector representation of the document text, i.e., the fourth encoded feature vector. And for the third and fourth encoded feature vectors in the vector pair, calculate the cosine similarity score between the two vectors:

[0111]

[0112] in The temperature coefficient used for scaling. The cosine similarity is encoded. Similarly, for dissimilar vector pairs, their corresponding fourth cosine similarity scores can be calculated using the above method, and then the corresponding loss parameters can be calculated. In this embodiment, a dual encoder is used to complete the cosine similarity estimation of similar vector pairs, effectively ensuring the accuracy and efficiency of model training.

[0113] In an exemplary embodiment, the method further includes: obtaining a document query request, extracting the input query text of the document query request, and inputting the input query text into a second text retrieval model for text retrieval processing to obtain a target document that matches the input query text.

[0114] For example, this application also includes a text retrieval process based on a trained second text retrieval model. When a user needs to retrieve relevant document text, they can submit a document query request containing input query text to the server, and the server equipped with the second text retrieval model can process the document query request. First, the corresponding input query text needs to be extracted from the document query request. Then, the input query text is input into the second text retrieval model, which can convert the input query text into a corresponding query vector and calculate the similarity between the query vector and the document text feature vectors of each document in the database, thereby outputting the target document that matches the input query text. In this embodiment, processing the text query request through the second text retrieval model can effectively ensure the efficiency and accuracy of request processing.

[0115] Furthermore, the input query text is input into the second text retrieval model for text retrieval processing to obtain the target document text that matches the input query text. This includes: inputting the input query text into the second text retrieval model for vector encoding processing to obtain the input query vector corresponding to the input query text; determining the cosine similarity score between the input query vector and different document vectors in the document database to obtain the target document corresponding to the document vector with the highest cosine similarity score.

[0116] For example, the utilization process of the second text retrieval model can be divided into two steps: indexing and retrieval. The first step is the indexing process, which mainly uses the second text retrieval model to index the document database, converting N document texts in the database into fixed-dimensional document vector representations. In the document retrieval stage, the input query text can be input into a second text retrieval model for vector encoding to obtain the input query vector q corresponding to the input query text. Then, q is compared with the document vectors in the aforementioned document database. For each vector in the array, calculate its cosine similarity score, which ranges from 0 to 1. The specific calculation process is as follows:

[0117]

[0118] in, Represents the input query vector The modulus, Document vector The cosine similarity score is the modulus of the similarity. A higher cosine similarity score indicates a greater similarity between the two texts; conversely, a lower score indicates a lower similarity. The document with the highest similarity score is selected and returned. The specific processing flow can be found in [reference needed]. Figure 6As shown in the figure. In this embodiment, by pre-establishing the cosine similarity scores of different document vectors during the indexing stage, and then using the document vectors of the document database to perform the query processing of the target document during the query stage, the query efficiency can be effectively improved.

[0119] This application also provides an application scenario, which is illustrated by taking the above-mentioned model data processing method as an example. The model data processing method specifically includes:

[0120] When users wish to iterate on the document search function of an information sharing platform, they can train a document search model using the model data processing method described in this application, and then deploy the document search model within the platform to optimize the platform's document search functionality. Platform users can then invoke the document search model via commands, and subsequently use the document search model to perform relevant processing actions during the document search process.

[0121] The overall training process for the document search model can be referenced. Figure 7 As shown, the first step is to construct the training data required for contrastive learning. This can be achieved using historical data from an information sharing platform. For each historical query text input by the user, the accurate query result corresponding to each historical query text is determined as the positive sample document text, and other irrelevant document texts are designated as negative sample document texts. Then, the first contrastive learning sample data is constructed, consisting of the query text, the positive sample document texts, and the negative sample document text lists. Furthermore, since this application also needs to model the correlation between similar query texts, for each query text, a large language model can be used to construct its corresponding similar query texts. The similar query texts of other query texts can then form a list of dissimilar query texts corresponding to the query text.

[0122] The first step is to implement the query text-document text comparison training process using the first comparison learning sample data. Specifically, the query text, positive sample document text, and negative sample document text in the first comparison learning sample data are vectorized to obtain the query input vector corresponding to the query text, the positive sample input vector corresponding to the positive sample document text, and the negative sample input vector corresponding to the negative sample document text. Positive sample vector pairs consisting of the query input vector and the positive sample input vector, and negative sample vector pairs consisting of the query input vector and each negative sample input vector are constructed. The positive sample vector pairs and each negative sample vector pair are input into the initial text vector retrieval model to obtain the first cosine similarity score of the encoded feature vector of the positive sample vector pair and the corresponding second cosine similarity score of the encoded feature vector of each negative sample vector pair. Loss calculation is performed based on the first and second cosine similarity scores to obtain loss parameters. The loss parameters are used to minimize the distance between vectors in the positive sample vector pair and maximize the distance between vectors in the negative sample vector pair. The model parameters of the initial text vector retrieval model are updated based on the loss parameters to obtain the first text retrieval model.

[0123] After the first stage of training is completed, a second stage of training is required, which combines the training process of query text-document text and query text-similar query text. Specifically, the first contrastive learning sample data is input into the first text retrieval model to obtain the first contrastive loss parameter corresponding to the first contrastive learning sample data; the second contrastive learning sample data is input into the first text retrieval model to obtain the second contrastive loss parameter corresponding to the second contrastive learning sample data; loss calculation is performed based on the first contrastive loss parameter, the second contrastive loss parameter, and preset weight parameters to obtain the total contrastive loss parameter; the model parameters of the first text retrieval model are updated based on the total contrastive loss parameter to obtain the second text retrieval model.

[0124] The resulting second text retrieval model is a machine learning model that can be deployed to a specified platform environment, used to retrieve relevant document text during the inference phase. After deploying the second text retrieval model to the platform, the platform's documents can be converted into vector form and saved through indexing. Subsequent user queries will directly compare the vector of the user's input query text with the vector of the document text stored in the database to determine the user's query results.

[0125] It should be understood that although the steps in the flowcharts of the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the above embodiments may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0126] Based on the same inventive concept, this application also provides a model data processing apparatus for implementing the model data processing method described above. The solution provided by this apparatus is similar to the implementation scheme described in the above method; therefore, the specific limitations in one or more model data processing apparatus embodiments provided below can be found in the limitations of the model data processing method described above, and will not be repeated here.

[0127] In one exemplary embodiment, such as Figure 8 As shown, a model data processing device is provided, comprising:

[0128] The first sample construction module 801 is used to obtain first contrastive learning sample data consisting of query text, positive sample document text, and a list of negative sample document text. The query text has a semantic relationship with the positive sample document text, but the query text does not have a semantic relationship with the negative sample document text in the negative sample document text list.

[0129] The second sample construction module 803 is used to generate samples based on the query text, and construct a second comparative learning sample data consisting of the query text, a list of similar query texts and a list of dissimilar query texts. The query text is semantically similar to the similar query texts, and the query text is semantically dissimilar to the dissimilar query texts in the list of dissimilar query texts.

[0130] The first model training module 805 is used to train the initial text vector retrieval model based on the first contrastive learning sample data to obtain the first text retrieval model.

[0131] The second model training module 807 is used to fine-tune the first text retrieval model based on the first and second contrastive learning sample data to obtain the second text retrieval model. The second text retrieval model is used to query target documents that semantically match the input query text.

[0132] In one embodiment, the second sample construction module 803 is specifically used to: fill the query text into the prompt word template to obtain text generation prompt words; input the text generation prompt words into a large language model, and perform text generation processing through the large language model to obtain similar query texts corresponding to the query text; extract dissimilar query texts that are not similar to the query text from each of the first contrastive learning sample data to form a list of dissimilar query texts; and summarize the query text, similar query texts, and dissimilar query text lists to form the second contrastive learning sample data.

[0133] In one embodiment, a data augmentation module is further included, configured to: replace the query text in the first comparative learning sample data with corresponding similar query text to obtain augmented sample data; and update the set of the first comparative learning sample data based on the augmented sample data. The first model training module 805 is specifically configured to: train the initial text vector retrieval model based on the updated set of the first comparative learning sample data to obtain a first text retrieval model.

[0134] In one embodiment, the first model training module 805 is specifically used to: vectorize the query text, positive sample document text, and negative sample document text in the first comparative learning sample data to obtain the query input vector corresponding to the query text, the positive sample input vector corresponding to the positive sample document text, and the negative sample input vector corresponding to the negative sample document text; construct positive sample vector pairs composed of the query input vector and the positive sample input vector, and negative sample vector pairs composed of the query input vector and each negative sample input vector; input the positive sample vector pairs and each negative sample vector pair into the initial text vector retrieval model to obtain the first cosine similarity score of the encoded feature vector of the positive sample vector pair and the corresponding second cosine similarity score of the encoded feature vector of each negative sample vector pair; perform loss calculation based on the first cosine similarity score and the second cosine similarity score to obtain loss parameters, which are used to minimize the distance between vectors in the positive sample vector pair and maximize the distance between vectors in the negative sample vector pair; update the model parameters of the initial text vector retrieval model based on the loss parameters to obtain the first text retrieval model.

[0135] In one embodiment, the first model training module 805 is specifically used to: perform word segmentation on the query text in the first comparative learning sample data to obtain the word segmentation result of the query text; and perform indexing on the query text based on the word segmentation result to obtain the query input vector corresponding to the query text.

[0136] In one embodiment, the first model training module 805 is specifically used to: encode the positive sample vector pairs into the dual encoder of the initial text vector retrieval model to obtain the first encoded feature vector corresponding to the query input vector and the second encoded feature vector corresponding to the positive sample input vector; determine the cosine similarity between the first encoded feature vector and the second encoded feature vector; and scale the cosine similarity based on a preset scaling temperature coefficient to obtain the first cosine similarity score of the encoded feature vectors of the positive sample vector pairs.

[0137] In one embodiment, the second model training module 807 is specifically used for: inputting first contrastive learning sample data into the first text retrieval model to obtain the first contrastive loss parameter corresponding to the first contrastive learning sample data; inputting second contrastive learning sample data into the first text retrieval model to obtain the second contrastive loss parameter corresponding to the second contrastive learning sample data; performing loss calculation processing based on the first contrastive loss parameter, the second contrastive loss parameter, and preset weight parameters to obtain the total contrastive loss parameter; and updating the model parameters of the first text retrieval model based on the total contrastive loss parameter to obtain the second text retrieval model.

[0138] In one embodiment, the second model training module 807 is specifically used to: vectorize the query text, similar query text, and dissimilar query text in the second contrastive learning sample data to obtain the query input vector corresponding to the query text, the similar input vector corresponding to the similar query text, and the dissimilar input vector corresponding to the dissimilar query text; construct similar vector pairs composed of the query input vector and the similar input vector, and dissimilar vector pairs composed of the query input vector and each dissimilar input vector; input the similar vector pairs and each dissimilar vector pair into the first text retrieval model to obtain the third cosine similarity score of the encoded feature vector of the similar vector pair, and the fourth cosine similarity score of the encoded feature vector of each dissimilar vector pair; perform loss calculation based on the third cosine similarity and the fourth cosine similarity to obtain the second contrastive loss parameter, which is used to minimize the distance between vectors in the similar vector pair and maximize the distance between vectors in the dissimilar vector pair.

[0139] In one embodiment, the second model training module 807 is specifically used to: encode the similar vector pairs into the dual encoder of the first text retrieval model to obtain the third encoded feature vector corresponding to the query input vector and the fourth encoded feature vector corresponding to the similar input vector; determine the encoding cosine similarity between the third encoded feature vector and the fourth encoded feature vector; and scale the encoding cosine similarity based on a preset scaling temperature coefficient to obtain the third cosine similarity of the encoded feature vectors of the similar vector pairs.

[0140] In one embodiment, the system further includes a text retrieval module, configured to: obtain a document query request; extract the input query text of the document query request; input the input query text into a second text retrieval model for text retrieval processing; and obtain a target document that matches the input query text.

[0141] In one embodiment, the text retrieval module is further configured to: input the input query text into a second text retrieval model for vector encoding processing to obtain the input query vector corresponding to the input query text; determine the cosine similarity score between the input query vector and different document vectors in the document database to obtain the target document corresponding to the document vector with the highest cosine similarity score.

[0142] Each module in the aforementioned model data processing device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the operations corresponding to each module.

[0143] In one exemplary embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 9 As shown, this computer device includes a processor, memory, input / output (I / O) interfaces, and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is also connected to the system bus via the I / O interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides the environment for the operating system and computer programs stored in the non-volatile storage media. The database stores data related to model data processing. The I / O interfaces are used for exchanging information between the processor and external devices. The communication interface is used for communicating with external terminals via a network connection. When the computer program is executed by the processor, it implements a model data processing method.

[0144] In one exemplary embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 10As shown, the computer device includes a processor, memory, input / output interfaces, a communication interface, a display unit, and an input device. The processor, memory, and input / output interfaces are connected via a system bus, and the communication interface, display unit, and input device are also connected to the system bus via the input / output interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The input / output interfaces are used for exchanging information between the processor and external devices. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, Near Field Communication (NFC), or other technologies. When the computer program is executed by the processor, it implements a model data processing method. The display unit is used to form a visually visible image and can be a display screen, a projection device, or a virtual reality imaging device. The display screen can be an LCD screen or an e-ink screen. The input device of the computer device can be a touch layer covering the display screen, or buttons, trackballs, or touchpads set on the casing of the computer device, or external keyboards, touchpads, or mice, etc.

[0145] Those skilled in the art will understand that Figure 10 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0146] In one embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.

[0147] In one embodiment, a computer-readable storage medium is provided storing a computer program that, when executed by a processor, implements the steps in the above method embodiments.

[0148] In one embodiment, a computer program product or computer program is provided, the computer program product or computer program including computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, causing the computer device to perform the steps in the above method embodiments.

[0149] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data must comply with relevant regulations.

[0150] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.

[0151] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this application.

[0152] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A model data processing method, characterized in that, The method includes: Obtain first contrastive learning sample data consisting of query text, positive sample document text, and a list of negative sample document text, wherein the query text has a semantic relationship with the positive sample document text, and the query text does not have a semantic relationship with the negative sample document text in the negative sample document text list; Based on the query text, sample generation processing is performed to construct a second contrastive learning sample data consisting of the query text, a list of similar query texts, and a list of dissimilar query texts. The query text is semantically similar to the similar query texts, and the query text is semantically dissimilar to the dissimilar query texts in the list of dissimilar query texts. The initial text vector retrieval model is trained based on the first comparative learning sample data to obtain the first text retrieval model. Based on the first and second contrastive learning sample data, the first text retrieval model is fine-tuned and trained to obtain a second text retrieval model. The second text retrieval model is used to query target documents that semantically match the input query text.

2. The method according to claim 1, characterized in that, The step of generating sample data based on the query text to construct a second comparative learning sample data consisting of the query text, a list of similar query texts, and a list of dissimilar query texts includes: Fill the query text into the prompt word template to obtain the text-generated prompt words; The text generation prompt words are input into a large language model, and the large language model is used to perform text generation processing to obtain similar query texts corresponding to the query text. Extract dissimilar query texts that are not similar to the query text from each of the first comparative learning sample data to form a list of dissimilar query texts; The query text, the list of similar query texts, and the list of dissimilar query texts are summarized to form the second comparative learning sample data.

3. The method according to claim 2, characterized in that, The method further includes: The query text in the first comparative learning sample data is replaced with the corresponding similar query text to obtain the augmented sample data; The set of the first contrastive learning sample data is updated based on the enhanced sample data; The step of training the initial text vector retrieval model based on the first contrastive learning sample data to obtain the first text retrieval model includes: Based on the updated set of the first comparative learning sample data, the initial text vector retrieval model is trained to obtain the first text retrieval model.

4. The method according to claim 1, characterized in that, The step of training the initial text vector retrieval model based on the first contrastive learning sample data to obtain the first text retrieval model includes: The query text, positive sample document text, and negative sample document text in the first comparative learning sample data are vectorized to obtain the query input vector corresponding to the query text, the positive sample input vector corresponding to the positive sample document text, and the negative sample input vector corresponding to the negative sample document text. Construct a pair of positive sample vectors consisting of the query input vector and the positive sample input vector, and a pair of negative sample vectors consisting of the query input vector and each of the negative sample input vectors; The positive sample vector pairs and each of the negative sample vector pairs are respectively input into the initial text vector retrieval model to obtain the first cosine similarity score of the encoded feature vector of the positive sample vector pair and the corresponding second cosine similarity score of the encoded feature vector of each negative sample vector pair. Loss calculation is performed based on the first cosine similarity score and the second cosine similarity score to obtain loss parameters. The loss parameters are used to minimize the distance between the positive sample vector and the median vector, and to maximize the distance between the negative sample vector and the median vector. The model parameters of the initial text vector retrieval model are updated based on the loss parameters to obtain the first text retrieval model.

5. The method according to claim 4, characterized in that, Preprocessing the query text in the first contrastive learning sample data to obtain the query input vector corresponding to the query text includes: The query text in the first comparative learning sample data is segmented into words to obtain the word segmentation result of the query text; Based on the text segmentation results, the query text is indexed to obtain the query input vector corresponding to the query text.

6. The method according to claim 4, characterized in that, Inputting the positive sample vector pairs into the initial text vector retrieval model yields the first cosine similarity score of the encoded feature vectors of the positive sample vector pairs, which includes: The positive sample vector is input to the dual encoder of the initial text vector retrieval model for encoding processing to obtain the first encoded feature vector corresponding to the query input vector and the second encoded feature vector corresponding to the positive sample input vector; Determine the cosine similarity between the first encoded feature vector and the second encoded feature vector; The cosine similarity is scaled based on a preset scaling temperature coefficient to obtain the first cosine similarity score of the encoded feature vector of the positive sample vector pair.

7. The method according to claim 1, characterized in that, The step of fine-tuning the first text retrieval model based on the first contrastive learning sample data and the second contrastive learning sample data to obtain the second text retrieval model includes: Input the first contrastive learning sample data into the first text retrieval model to obtain the first contrastive loss parameter corresponding to the first contrastive learning sample data; Input the second contrastive learning sample data into the first text retrieval model to obtain the second contrastive loss parameter corresponding to the second contrastive learning sample data; Based on the first contrast loss parameter, the second contrast loss parameter, and the preset weight parameter, loss calculation is performed to obtain the total contrast loss parameter; The model parameters of the first text retrieval model are updated based on the total contrast loss parameter to obtain the second text retrieval model.

8. The method according to claim 7, characterized in that, The step of inputting the second contrast learning sample data into the first text retrieval model to obtain the second contrast loss parameter corresponding to the second contrast learning sample data includes: The query text, similar query text, and dissimilar query text in the second comparative learning sample data are vectorized to obtain the query input vector corresponding to the query text, the similar input vector corresponding to the similar query text, and the dissimilar input vector corresponding to the dissimilar query text. Construct similar vector pairs consisting of the query input vector and the similar input vectors, and dissimilar vector pairs consisting of the query input vector and each of the dissimilar input vectors; The similar vector pairs and each of the dissimilar vector pairs are respectively input into the first text retrieval model to obtain the third cosine similarity score of the encoded feature vector of the similar vector pair and the corresponding fourth cosine similarity score of the encoded feature vector of each dissimilar vector pair. Loss calculation is performed based on the third cosine similarity and the fourth cosine similarity to obtain a second contrast loss parameter. The second contrast loss parameter is used to minimize the distance between the similar vectors and the median vectors, and to maximize the distance between the dissimilar vectors and the median vectors.

9. The method according to claim 7, characterized in that, Inputting the similar vector pairs into the first text retrieval model yields the third cosine similarity score of the encoded feature vectors of the similar vector pairs, including: The similarity vector is encoded into the dual encoder of the first text retrieval model to obtain the third encoded feature vector corresponding to the query input vector and the fourth encoded feature vector corresponding to the similar input vector. Determine the coding cosine similarity between the third coding feature vector and the fourth coding feature vector; The coded cosine similarity is scaled based on a preset scaling temperature coefficient to obtain the third cosine similarity of the coded feature vectors of the similar vector pair.

10. The method according to any one of claims 1 to 9, characterized in that, The method further includes: Obtain the document query request and extract the input query text of the document query request; The input query text is input into the second text retrieval model for text retrieval processing to obtain the target document that matches the input query text.

11. The method according to claim 10, characterized in that, The step of inputting the input query text into the second text retrieval model for text retrieval processing to obtain the target document text that matches the input query text includes: The input query text is input into the second text retrieval model for vector encoding processing to obtain the input query vector corresponding to the input query text. Determine the cosine similarity score between the input query vector and different document vectors in the document database, and obtain the target document corresponding to the document vector with the highest cosine similarity score.

12. A model data processing device, characterized in that, The device includes: The first sample construction module is used to obtain first contrastive learning sample data consisting of query text, positive sample document text, and a list of negative sample document text. The query text has a semantic relationship with the positive sample document text, and the query text does not have a semantic relationship with the negative sample document text in the negative sample document text list. The second sample construction module is used to perform sample generation processing based on the query text, and construct a second contrastive learning sample data consisting of the query text, similar query texts, and a list of dissimilar query texts. The query text is semantically similar to the similar query texts, and the query text is semantically dissimilar to the dissimilar query texts in the list of dissimilar query texts. The first model training module is used to train the initial text vector retrieval model based on the first comparative learning sample data to obtain the first text retrieval model. The second model training module is used to fine-tune the first text retrieval model based on the first contrastive learning sample data and the second contrastive learning sample data to obtain a second text retrieval model. The second text retrieval model is used to query target documents that semantically match the input query text.

13. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 11.

14. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 11.

15. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 11.