A text retrieval method and apparatus
By constructing a target knowledge base containing knowledge documents, reference question texts, and keyword information, and utilizing text feature similarity and keyword matching, the problem of low retrieval accuracy in existing technologies is solved, achieving higher relevance and user experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING YUANLI WEILAI SCI & TECH CO LTD
- Filing Date
- 2026-03-31
- Publication Date
- 2026-06-30
Smart Images

Figure CN122309668A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and in particular to a text retrieval method. This application also relates to a data processing apparatus, a computing device, a computer-readable storage medium, and a computer program product. Background Technology
[0002] A knowledge base can be understood as a system for storing, managing, and retrieving information. It is typically used to support decision-making and problem-solving, and can be widely applied in areas such as enterprise knowledge management, intelligent question answering, and information retrieval. For example, in practical applications, a knowledge base can store multiple document fragments. When a user's query text is received, the knowledge base can retrieve document fragments related to the query text, and a corresponding response text can be generated based on the retrieved document fragments.
[0003] However, currently, document retrieval through knowledge bases usually relies on keyword retrieval. Keyword retrieval makes it difficult to accurately and comprehensively understand the semantic content and user intent corresponding to the user's query text, resulting in low retrieval accuracy and inaccurate responses, thus affecting user experience. Summary of the Invention
[0004] In view of this, embodiments of this application provide a text retrieval method. This application also relates to a knowledge base construction method, a text retrieval device, a computing device, a computer-readable storage medium, and a computer program product, to solve the aforementioned problems existing in the prior art.
[0005] According to a first aspect of the embodiments of this application, a text retrieval method is provided, including: Retrieve the query question text; Determine at least one target question text that matches the query question text from each reference question text, and determine the first knowledge document corresponding to each target question text, wherein the reference question text is a question corresponding to a pre-set knowledge document and is stored in a pre-built target knowledge base; Determine at least one second knowledge document from the knowledge documents of the target knowledge base that matches the query question text; Based on each first knowledge document and each second knowledge document, determine the target knowledge document corresponding to the query question text.
[0006] According to a second aspect of the embodiments of this application, a knowledge base construction method is provided, including: Obtain the reference question text and secondary keyword information corresponding to each knowledge document; Construct a target knowledge base based on each knowledge document, the corresponding reference question text for each knowledge document, and the secondary keyword information.
[0007] According to a third aspect of the embodiments of this application, a text retrieval device is provided, comprising: The retrieval module is configured to retrieve the text of the query question. The first retrieval module is configured to determine at least one target question text that matches the query question text from each reference question text, and to determine the first knowledge document corresponding to each target question text, wherein the reference question text is a question corresponding to a pre-set knowledge document and is stored in a pre-built target knowledge base; The second retrieval module is configured to determine at least one second knowledge document from the knowledge documents of the target knowledge base that matches the query question text; The determination module is configured to determine the target knowledge document corresponding to the query question text based on each first knowledge document and each second knowledge document.
[0008] According to a fourth aspect of the embodiments of this application, a computing device is provided, comprising: Memory and processor; The memory is used to store computer programs / instructions, and the processor is used to execute the computer programs / instructions, which, when executed by the processor, implement the steps of the above-described text retrieval method.
[0009] According to a fifth aspect of the embodiments of this application, a computer-readable storage medium is provided that stores a computer program / instructions, which, when executed by a processor, implement the steps of the above-described text retrieval method.
[0010] According to a sixth aspect of the embodiments of this application, a computer program product is provided, including a computer program / instructions that, when executed by a processor, implement the steps of the above-described text retrieval method.
[0011] The text retrieval method provided in this application can obtain a query question text; determine at least one target question text that matches the query question text from each reference question text, and determine a first knowledge document corresponding to each target question text, wherein the reference question text is a question corresponding to a pre-set knowledge document and is stored in a pre-built target knowledge base; determine at least one second knowledge document that matches the query question text from each knowledge document in the target knowledge base; and determine a target knowledge document corresponding to the query question text based on each first knowledge document and each second knowledge document.
[0012] In one embodiment of this application, since the target knowledge base not only stores knowledge documents, but also reference question texts corresponding to the knowledge documents, when retrieving and querying target knowledge documents similar to the question texts, not only can the knowledge documents themselves be referenced, but also the reference question texts corresponding to the knowledge documents, thereby improving the relevance between the retrieved target knowledge documents and the query question texts. Attached Figure Description
[0013] Figure 1 This is a flowchart of a text retrieval method provided in an embodiment of this application; Figure 2 This is a schematic diagram of data flow for a text retrieval method provided in one embodiment of this application; Figure 3 This is a flowchart of a knowledge base construction method provided in one embodiment of this application; Figure 4 This is a schematic diagram of the structure of a text retrieval device provided in one embodiment of this application; Figure 5 This is a structural block diagram of a computing device provided in one embodiment of this application. Detailed Implementation
[0014] Many specific details are set forth in the following description to provide a full understanding of this application. However, this application can be implemented in many other ways different from those described herein, and those skilled in the art can make similar extensions without departing from the spirit of this application; therefore, this application is not limited to the specific embodiments disclosed below.
[0015] The terminology used in one or more embodiments of this application is for the purpose of describing particular embodiments only and is not intended to limit the scope of one or more embodiments of this application. The singular forms “a,” “the,” and “the” used in one or more embodiments of this application and in the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” used in one or more embodiments of this application refers to and includes any or all possible combinations of one or more associated listed items.
[0016] It should be understood that although the terms first, second, etc., may be used to describe various information in one or more embodiments of this application, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, first may also be referred to as second without departing from the scope of one or more embodiments of this application, and similarly, second may also be referred to as first. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to a determination."
[0017] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with the relevant laws, regulations and standards of the relevant regions, and corresponding operation entry points are provided for users to choose to authorize or refuse.
[0018] First, the terms and concepts involved in one or more embodiments of this application will be explained.
[0019] A query is a statement or request initiated by a user from another computing device to a knowledge base that stores multiple documents. It is used to read, filter, or aggregate data. In practical applications, a query can be used to retrieve documents from the database that match the query and return the retrieved documents to the user. Alternatively, it can be used to generate a response message (such as a reply text) corresponding to the query and return it to the user.
[0020] Elasticsearch (ES) is a distributed full-text search and analysis engine built on Apache Lucene. ES features real-time performance, a distributed architecture, and versatility. Real-time performance means it can store, retrieve, and analyze data in near real-time. A distributed architecture means ES supports high availability and horizontal scalability. Versatility means that in addition to full-text search, ES can be widely used in scenarios such as log analysis, real-time monitoring, and intelligent analysis.
[0021] Thrift: Apache Thrift is an open-source, cross-language remote procedure call (RPC) framework developed by Facebook. It enables cross-language communication and efficient serialization, and is often used in large-scale distributed architectures.
[0022] Embedding is a technique that maps discrete data (such as words, users, etc.) to a low-dimensional, continuous, dense vector space.
[0023] Rerank: Rerank refers to the process of re-ranking the initially selected candidate results in search, recommendation, or RAG (Retrieval Enhancement Generation) systems.
[0024] As described in the background section, current document retrieval methods using knowledge bases typically rely on keyword search. However, keyword search makes it difficult to accurately and comprehensively understand the semantic content and user intent corresponding to the user's query text, resulting in low retrieval accuracy and potentially inaccurate responses, thus impacting user experience. While it's possible to vectorize the knowledge content as a whole, this approach fails to fully exploit the multi-dimensional semantic features of the knowledge, leading to low recall rates for answers that directly match the question.
[0025] To address the aforementioned issues, this application provides a text retrieval method, and also relates to a knowledge base construction method, a text retrieval device, a computing device, a computer-readable storage medium, and a computer program product, which will be described in detail in the following embodiments.
[0026] like Figure 1 As shown, Figure 1 This is a flowchart of a text retrieval method provided in an embodiment of this application, which specifically includes the following steps: Step 102: Obtain the query question text.
[0027] It should be noted that the executing entity for the text retrieval method and knowledge base construction method provided in this manual can be any computing device with computing capabilities, such as a server, terminal, etc. This manual does not impose specific restrictions, and the settings can be configured according to actual needs. Furthermore, the executing entity for the knowledge base construction method can be the same as or different from the executing entity for the text retrieval method; this manual also does not impose specific restrictions, and the determination can be based on the actual situation. For ease of description, the following detailed description of the technical solution in this manual will use a server as the executing entity.
[0028] In one or more embodiments of this specification, a user can send a text retrieval request to a server via another computing device. The text retrieval request may carry the query text, and the server can then receive the text retrieval request and obtain the query text.
[0029] Step 104: Determine at least one target question text that matches the query question text from each reference question text, and determine the first knowledge document corresponding to each target question text, wherein the reference question text is a question corresponding to a pre-set knowledge document and is stored in a pre-built target knowledge base.
[0030] In one or more embodiments of this specification, a target knowledge base can be pre-constructed. This target knowledge base stores multiple knowledge documents and corresponding reference question texts for each knowledge document. The reference question texts are the questions corresponding to each knowledge document, or, in other words, the questions that are most likely to be queried for each knowledge document. Alternatively, the reference question texts can be standard question texts that match the knowledge documents. These standard question texts can be pre-set, or the matching degree between the reference question texts and the knowledge documents can be greater than a preset matching degree. The matching degree between the reference question texts and the knowledge documents can be calculated using a pre-trained language model.
[0031] It should be noted that this specification does not limit the methods for obtaining the knowledge documents and their corresponding reference question texts. For example, internet technology can be used to obtain papers, manuals, guidelines, etc., from the internet as knowledge documents. Specifically, the target knowledge base can be a knowledge base for a specific field, such as education, in which case the knowledge documents in the target knowledge base can be text content such as papers, articles, and novels. Knowledge documents can also be generated using pre-trained language models such as large language models. Correspondingly, the reference question texts for the knowledge documents can be set manually, generated using pre-trained language models such as large language models, or obtained from the historical search records of the knowledge documents to serve as reference query questions.
[0032] In one or more embodiments of this specification, for each knowledge document in the target knowledge base, the knowledge document and preset prompt information can be input into a preset pre-trained language model. The preset prompt information is used to instruct the preset pre-trained model to generate at least one reference question text corresponding to the knowledge document. The reference question text is a question with a higher frequency of being raised for the knowledge document than a preset frequency, thereby obtaining the reference question text corresponding to each knowledge document. For example, the preset prompt information can be: Please generate possible questions for the knowledge document, or it can be: Please generate high-frequency questions for the knowledge document.
[0033] In one or more embodiments of this specification, feature vectors of each knowledge document can be extracted using a preset text encoder as third text features corresponding to each knowledge document. Furthermore, feature vectors corresponding to reference question texts of each knowledge document can be extracted using the same preset text encoder to obtain second text features corresponding to each reference question text.
[0034] Therefore, based on the above, each knowledge document, the third text feature corresponding to each knowledge document, the reference question text corresponding to the knowledge document, and the second text feature corresponding to each reference question text can be stored in the target knowledge base.
[0035] In one or more embodiments of this specification, determining at least one target question text that matches the query question text from each reference question text includes: Obtain the first text features of the query question text, and obtain the second text features of each reference question text; Calculate the first similarity between the first text feature and each of the second text features; Based on each first similarity score, at least one target question text that matches the query question text is determined.
[0036] In one or more embodiments of this specification, in order to improve the similarity between the retrieved knowledge documents and the query question text, a reference question text that is theoretically most likely to be asked can be preset for each knowledge document, so that relevance retrieval can be performed from the dimension of the reference question text.
[0037] Specifically, the first text features of the query question text can be extracted by a preset text encoder, and then the second text features of the reference question texts corresponding to each knowledge document can be obtained from the target database. Alternatively, the first similarity between the first text features and each second text feature can be calculated. Then, based on each first similarity, at least one target question text matching the query question text can be determined from each reference question text. Finally, based on the first knowledge document corresponding to each target question text.
[0038] In one or more embodiments of this specification, determining at least one target question text that matches the query question text based on each first similarity score includes: The reference question text corresponding to the first similarity score that is greater than the first preset similarity threshold is taken as the target question text that matches the query question text.
[0039] In practical applications, a first preset similarity threshold can be set, which can be adjusted based on actual needs. Then, the reference question text corresponding to a first similarity score greater than the first preset similarity threshold can be used as the target question text to match the query question text.
[0040] In one or more embodiments of this specification, after obtaining the target question text, the knowledge document containing the target question text can be used as the first knowledge document that matches the query question text.
[0041] Step 106: Determine at least one second knowledge document from the knowledge documents of the target knowledge base that matches the query question text.
[0042] In one or more embodiments of this specification, determining at least one second knowledge document matching the query question text from the knowledge documents of the target knowledge base includes: Obtain the first text features of the query question text, and obtain the third text features of each knowledge document; Calculate the second similarity between the first text feature and each of the third text features; Based on each second similarity score, at least one second knowledge document matching the query question text is determined.
[0043] As mentioned above, the target knowledge base stores the third text features corresponding to each knowledge document. Therefore, the first text features of the query question text can be extracted by a preset text encoder, and then, based on the third text features and the first text features, the second knowledge document matching the query question text can be determined from each knowledge document.
[0044] In practical applications, for each knowledge document, the similarity between the first text feature and the corresponding third text feature can be calculated as the second similarity for that knowledge document, thereby obtaining the second similarity for each knowledge document.
[0045] It should be noted that, in one or more embodiments of this specification, when calculating the similarity between text features, cosine similarity, Euclidean distance, etc. can be calculated. This specification does not limit the specific method used to calculate similarity.
[0046] In one or more embodiments of this specification, determining at least one second knowledge document that matches the query question text based on each second similarity score includes: The knowledge document corresponding to the second similarity value that is greater than the second preset similarity threshold is taken as the second knowledge document that matches the query question text.
[0047] In practical applications, a second preset similarity threshold can be set. This second preset similarity threshold can be the same as or different from the first preset similarity threshold, depending on the actual situation.
[0048] In one or more embodiments of this specification, for each knowledge document in the target knowledge base, it can be determined whether the second similarity corresponding to the knowledge document is greater than a second preset similarity threshold. If it is determined that the second similarity corresponding to the knowledge document is greater than the second preset similarity threshold, the knowledge document can be considered as a second knowledge document matching the query question text. Conversely, if it is determined that the second similarity corresponding to the knowledge document is less than or equal to the second preset similarity threshold, the knowledge document is not considered a second knowledge document matching the query question text.
[0049] Step 108: Based on each first knowledge document and each second knowledge document, determine the target knowledge document corresponding to the query question text.
[0050] Based on the aforementioned steps 102-106, each first knowledge document and each second knowledge document can be obtained. It should be understood that the first and second knowledge documents may contain the same knowledge document; that is, the same knowledge document can be either a first knowledge document or a second knowledge document. Therefore, when determining the target knowledge document corresponding to the query question text based on each first and second knowledge document, the identical knowledge document among the first and second knowledge documents can be used as the target knowledge document corresponding to the query question text.
[0051] Furthermore, in one or more embodiments of this specification, determining the target knowledge document corresponding to the query question text based on each first knowledge document and each second knowledge document includes: Obtain a first sorting sequence and a second sorting sequence, wherein the first sorting sequence is obtained based on a first similarity between each first knowledge document and the query question text, and the second sorting sequence is obtained based on a second similarity between each second knowledge document and the query question text; For each candidate knowledge document, the target similarity between the candidate knowledge document and the query question text is determined based on the first ranking of the candidate knowledge document in the first ranking sequence and the second ranking of the candidate knowledge document in the second ranking sequence, wherein the candidate knowledge document is any one of each first knowledge document and each second knowledge document; Based on the target similarity of each knowledge document, the target knowledge document corresponding to the query question text is determined.
[0052] Based on the foregoing, we can query the first text features of the question text and obtain the second text features of the reference question texts corresponding to each knowledge document. This allows us to calculate the first similarity between the first and second text features, thus obtaining the first similarity for each knowledge document and the first knowledge document itself. Correspondingly, we can obtain the third text features corresponding to each knowledge document itself and calculate the second similarity between the third and first text features, thereby obtaining the second similarity for each knowledge document and the second knowledge document.
[0053] Therefore, in one or more embodiments of this specification, the first knowledge documents can be sorted based on the first similarity corresponding to each first knowledge document, for example, sorting the first knowledge documents in descending order of the first similarity to obtain a first sorting sequence. Correspondingly, the second knowledge documents can be sorted based on the second similarity corresponding to each second knowledge document, for example, sorting the second knowledge documents in descending order of the second similarity to obtain a second sorting sequence. That is to say, the first sorting sequence in this specification can be understood as a knowledge document sequence obtained by sorting the first knowledge documents based on the first similarity between the first knowledge documents and the query question text, and the second sorting sequence can be understood as a knowledge document sequence obtained by sorting the second knowledge documents based on the second similarity between the second knowledge documents and the query question text.
[0054] It should be understood that in practical applications, knowledge documents can also be sorted in ascending order of similarity. This instruction manual does not impose specific restrictions and can be determined based on the actual situation.
[0055] In practical applications, after obtaining the first and second sorting sequences, for each candidate knowledge document, the first sorting sequence and the second sorting sequence can be determined. Then, based on the first and second sorting sequences corresponding to the candidate knowledge document, the target similarity between the candidate knowledge document and the query question text can be calculated.
[0056] It should be noted that the first ranking can be understood as the position of the candidate knowledge document in the first ranking sequence. Correspondingly, the second ranking can be understood as the position of the candidate knowledge document in the second ranking sequence. For example, assuming the first ranking sequence is: knowledge document 1, knowledge document 2, knowledge document 3, and assuming the candidate knowledge document is knowledge document 2, then the first ranking of the candidate knowledge document is 2.
[0057] The candidate knowledge document can be any one of the first knowledge documents and the second knowledge documents. Alternatively, it can be any one of the first and second knowledge documents after removing duplicates.
[0058] In one or more embodiments of this specification, for each candidate knowledge document, determining the target similarity between the candidate knowledge document and the query question text based on the candidate knowledge document's first ranking in the first ranking sequence and its second ranking in the second ranking sequence includes: Calculate the sum of the reciprocals of the first sort and the second sort, and use the sum of the reciprocals as the target similarity between the candidate knowledge document and the query question text.
[0059] In practical applications, Formula 1 can be used to calculate the target similarity between candidate knowledge documents and query question text. Formula 1 is shown below: Formula 1 in, denoted as target similarity. k is a preset constant, specifically 60. d can be understood as candidate knowledge documents. This can be understood as the i-th ranking of candidate knowledge documents in the i-th ranking sequence. It should be noted that the ranking sequence corresponding to Formula 1 is sorted in descending order of similarity.
[0060] For example, following the order of similarity from highest to lowest, suppose the first sorted sequence is: Knowledge Document 1, Knowledge Document 2, Knowledge Document 3, and the second sorted sequence is: Knowledge Document 2, Knowledge Document 3, Knowledge Document 4, and assume k is 60. When the candidate knowledge document is Knowledge Document 1, Knowledge Document 1 has a first ranking of 1 in the first sorted sequence. Knowledge Document 1 does not exist in the second sorted sequence, therefore there is no second ranking. The target similarity corresponding to Knowledge Document 1 is: Approximately 0.0163. Given that the candidate knowledge document is knowledge document 2, its first ranking in the first ranking sequence is 2, and its second ranking in the second ranking sequence is 1. Therefore, the target similarity corresponding to knowledge document 1 is: Approximately 0.0325. Given that the candidate knowledge document is knowledge document 3, its first ranking in the first ranking sequence is 3, and its second ranking in the second ranking sequence is 2. Therefore, the target similarity corresponding to knowledge document 3 is: Approximately 0.0320. If the candidate knowledge document is knowledge document 4, and knowledge document 4 does not exist in the first ranking sequence, then there is no first ranking. Knowledge document 4's second ranking in the second ranking sequence is 3. Therefore, the target similarity corresponding to knowledge document 1 is: It is approximately equal to 0.0158.
[0061] In one or more embodiments of this specification, when the sorting sequence includes a first sorting sequence and a second sorting sequence, the reciprocal corresponding to the sum of the first sorting sequence and a preset constant can be calculated, and the reciprocal corresponding to the sum of the second sorting sequence and the preset constant can be calculated, thereby the sum of the reciprocals of the two reciprocals can be calculated, and the sum of the reciprocals can be used as the target similarity between the candidate knowledge document and the query question text.
[0062] Therefore, the target knowledge document can be selected from the candidate knowledge documents based on the target similarity between each candidate knowledge document and the query question text.
[0063] In practical applications, candidate knowledge documents with a target similarity greater than the fourth preset similarity threshold can be used as target knowledge documents. That is, for each candidate knowledge document, if the target similarity corresponding to that candidate knowledge document is determined to be greater than the fourth preset similarity threshold, then that candidate knowledge document is determined to be the target knowledge document corresponding to the query question text.
[0064] It should be noted that the fourth preset similarity threshold can be set based on actual needs. It can be the same as or different from the aforementioned first preset similarity threshold, second preset similarity threshold, and subsequent third preset similarity threshold. This manual does not impose any specific restrictions.
[0065] Continuing with the previous example, the target similarity of knowledge document 1 is 0.0163, the target similarity of knowledge document 2 is 0.0325, the target similarity of knowledge document 3 is 0.0320, and the target similarity of knowledge document 4 is 0.0158. Assuming the fourth preset similarity threshold is 0.3, then the target knowledge documents are knowledge document 2 and knowledge document 3.
[0066] In one or more embodiments of this specification, determining the target knowledge document corresponding to the query question text based on the target similarity corresponding to each knowledge document includes: The knowledge document corresponding to the target similarity greater than the third preset similarity threshold is used as the reference knowledge document; The target score for each reference knowledge document is generated using a pre-set scoring model. The target knowledge document corresponding to the query question text is determined based on the scores of each target.
[0067] In practical applications, to further demonstrate the degree of matching between the identified target knowledge documents and the query question text, the knowledge documents can be further filtered using a preset scoring model.
[0068] Specifically, knowledge documents with a similarity score greater than a third preset similarity threshold can be used as reference knowledge documents. Then, each reference knowledge document and the query text can be input into a preset scoring model, which outputs a relevance score between each reference knowledge document and the query text, serving as the target score. More specifically, preset prompts, each reference knowledge document, and the query text can be input into the preset scoring model. These prompts instruct the model to analyze the relevance scores between each reference knowledge document and the query text to obtain the target score for each reference knowledge document.
[0069] It should be noted that the third preset similarity threshold can be set based on actual needs. It can be the same as or different from the aforementioned first preset similarity threshold or second preset similarity threshold. This manual does not impose any specific restrictions.
[0070] It should also be noted that the preset scoring model is used to generate a relevance score between the knowledge document and the query question text. This preset scoring model can be any pre-trained language model, such as a deep semantic model or a large language model, etc.
[0071] In one or more embodiments of this specification, the target knowledge document corresponding to the query question text can be determined based on each target score. Specifically, the reference knowledge document corresponding to a target score greater than a preset score threshold can be used as the target knowledge document corresponding to the query question text.
[0072] The above text retrieval method can obtain a query question text; determine at least one target question text that matches the query question text from each reference question text, and determine a first knowledge document corresponding to each target question text, wherein the reference question text is a question corresponding to a pre-set knowledge document and is stored in a pre-built target knowledge base; determine at least one second knowledge document that matches the query question text from each knowledge document in the target knowledge base; and determine a target knowledge document corresponding to the query question text based on each first knowledge document and each second knowledge document.
[0073] By using the text retrieval method described above, since the target knowledge base not only stores knowledge documents, but also reference question texts corresponding to the knowledge documents, when retrieving target knowledge documents similar to the query question text, one can refer not only to the knowledge document itself, but also to the reference question texts corresponding to the knowledge document, thereby improving the relevance between the retrieved target knowledge documents and the query question text.
[0074] Furthermore, since the target knowledge base provided in this specification not only contains the text features of the knowledge documents themselves, but also the text features corresponding to the reference question texts of the knowledge documents, it can also be guaranteed that the target knowledge base can add new knowledge simply, efficiently and without redundancy.
[0075] Furthermore, to improve the relevance between the retrieved target knowledge documents and the query question text, keyword matching can be performed. Specifically, based on each first knowledge document and each second knowledge document, the target knowledge document corresponding to the query question text is determined, including: Extract the first keyword information from the query text; Based on the second keyword information corresponding to each knowledge document and the first keyword information, determine at least one third knowledge document that matches the query question text; Based on each first knowledge document, each second knowledge document, and each third knowledge document, determine the target knowledge document corresponding to the query question text.
[0076] In practical applications, keywords can be extracted from each knowledge document manually or using any pre-trained language model, serving as the secondary keyword information for each document. These keywords can then be stored in a target knowledge base. This allows for the extraction of keywords from the query text as primary keyword information, and the retrieval of secondary keyword information from the target knowledge base for each knowledge document. The matching degree between the primary and secondary keyword information can then be calculated, and based on this matching degree, at least one third knowledge document matching the query text can be obtained.
[0077] The keyword information can include multiple keywords, which can be key entities, key locations, key times, etc., and can be set based on the actual situation. This manual does not impose specific restrictions.
[0078] It should be noted that when calculating the matching degree between the first keyword information and each of the second keyword information, a pre-set text encoder can be used to extract the feature vectors corresponding to the first keyword information and each of the second keyword information. Then, the similarity between the feature vectors of the first keyword information and the feature vectors of each of the second keyword information is calculated as the matching degree between the first keyword information and each of the second keyword information. Alternatively, any pre-trained language model can be used to parse and generate matching scores between the first keyword information and each of the second keyword information to obtain the matching degree between them. In other words, the matching degree represents the similarity or correlation between the first keyword information and the second keyword information.
[0079] In one or more embodiments of this specification, when at least one third knowledge document matching the query question text can be obtained based on the matching degree, the knowledge document corresponding to the matching degree with a similarity or matching score greater than a preset matching score threshold can be used as the third knowledge document matching the query question text.
[0080] It should be noted that when determining the target knowledge document corresponding to the query question text based on each first knowledge document, each second knowledge document, and each third knowledge document, the aforementioned method of determining the target knowledge document corresponding to the query question text based on each first knowledge document and each second knowledge document can be referred to, and will not be elaborated here.
[0081] Based on the above, this specification provides a data flow diagram of a text retrieval method, such as... Figure 2 As shown, Figure 2 This is a data flow diagram illustrating a text retrieval method provided in an embodiment of this application. As can be seen, after obtaining the query question text, a first knowledge document corresponding to the query question text can be retrieved from the target knowledge base based on the reference question text corresponding to the knowledge document. A second knowledge document corresponding to the query question text can be retrieved from the target knowledge base based on the knowledge document itself. Furthermore, a third knowledge document corresponding to the query question text can be retrieved from the target knowledge base based on the keyword information corresponding to the knowledge document. Then, at least one target knowledge document corresponding to the query question text can be determined from each of the first, second, and third knowledge documents.
[0082] The above text retrieval method retrieves knowledge documents from three aspects, which greatly improves the relevance between the retrieved target knowledge documents and the query text, avoids users obtaining knowledge documents with low relevance, or avoids generating response text with low relevance for users, thereby improving the user experience.
[0083] Furthermore, this specification also provides a method for constructing a knowledge base. For example... Figure 3 As shown, Figure 3 This is a flowchart of a knowledge base construction method provided in an embodiment of this application, which specifically includes the following steps: Step 302: Obtain the reference question text and secondary keyword information corresponding to each knowledge document.
[0084] It should be noted that the methods for obtaining knowledge documents, the corresponding reference question texts for each knowledge document, and the second keyword information can be found in the text retrieval methods described above, and will not be elaborated here.
[0085] Step 304: Construct the target knowledge base based on each knowledge document, the corresponding reference question text for each knowledge document, and the second keyword information.
[0086] In one or more embodiments of this specification, each knowledge document, the corresponding reference question text for each knowledge document, and the second keyword information can be stored in the target knowledge base to achieve the construction of the target knowledge base.
[0087] Furthermore, in one or more embodiments of this specification, the third text features corresponding to the knowledge documents, the second text features of the reference question texts corresponding to each knowledge document, etc., can also be stored in the target knowledge base to improve retrieval efficiency and retrieval accuracy.
[0088] By using the above knowledge base construction method, since the target knowledge base not only stores knowledge documents, but also the reference question texts corresponding to the knowledge documents, as well as the keyword information corresponding to the knowledge documents, when retrieving target knowledge documents similar to the query question text from the target database, one can refer not only to the knowledge document itself, but also to the reference question texts corresponding to the knowledge document, and to the keyword information corresponding to the knowledge document, thereby greatly improving the relevance between the retrieved target knowledge documents and the query question text.
[0089] In practical applications, there may be a lack of knowledge documents. Therefore, when constructing the target knowledge base, historical query records in a specific domain can be referenced, or knowledge documents can be synthesized using a large language model. Furthermore, before storing the current knowledge document and its corresponding related information in the target knowledge base, it's advisable to first determine if the target knowledge base already contains the current knowledge document and / or its corresponding related information. If it's determined that the current knowledge document and / or its corresponding related information do not exist, then the current knowledge document and / or its corresponding related information can be stored in the target knowledge base.
[0090] It should be noted that the text retrieval methods and knowledge base construction methods provided in this manual can be applied to different fields, such as the education field, e-commerce field, medical field, etc., as mentioned above.
[0091] Corresponding to the above method embodiments, this application also provides an embodiment of a text retrieval device. Figure 4 This is a schematic diagram of the structure of a text retrieval device provided in one embodiment of this application. For example... Figure 4 As shown, the device includes: Module 402 is configured to retrieve the query question text; The first retrieval module 404 is configured to determine at least one target question text that matches the query question text from each reference question text, and to determine the first knowledge document corresponding to each target question text, wherein the reference question text is a question corresponding to a pre-set knowledge document and is stored in a pre-built target knowledge base; The second retrieval module 406 is configured to determine at least one second knowledge document from the knowledge documents of the target knowledge base that matches the query question text; The determination module 408 is configured to determine the target knowledge document corresponding to the query question text based on each first knowledge document and each second knowledge document.
[0092] Optionally, the first retrieval module 404 is further configured to obtain a first text feature of the query question text and a second text feature of each reference question text; Calculate the first similarity between the first text feature and each of the second text features; Based on each first similarity score, at least one target question text that matches the query question text is determined.
[0093] Optionally, the first retrieval module 404 is further configured to use the reference question text corresponding to the first similarity greater than the first preset similarity threshold as the target question text that matches the query question text.
[0094] Optionally, the second retrieval module 406 is further configured to obtain a first text feature of the query question text and a third text feature of each knowledge document; Calculate the second similarity between the first text feature and each of the third text features; Based on each second similarity score, at least one second knowledge document matching the query question text is determined.
[0095] Optionally, the second retrieval module 406 is further configured to use the knowledge document corresponding to the second similarity greater than the second preset similarity threshold as the second knowledge document that matches the query question text.
[0096] Optionally, the determining module 408 is further configured to obtain a first sorting sequence and a second sorting sequence, wherein the first sorting sequence is obtained based on a first similarity between each first knowledge document and the query question text, and the second sorting sequence is obtained based on a second similarity between each second knowledge document and the query question text; For each candidate knowledge document, the target similarity between the candidate knowledge document and the query question text is determined based on the first ranking of the candidate knowledge document in the first ranking sequence and the second ranking of the candidate knowledge document in the second ranking sequence, wherein the candidate knowledge document is any one of each first knowledge document and each second knowledge document; Based on the target similarity of each knowledge document, the target knowledge document corresponding to the query question text is determined.
[0097] Optionally, the determining module 408 is further configured to calculate the sum of the reciprocals of the first sort and the second sort, and use the sum of the reciprocals as the target similarity between the candidate knowledge document and the query question text.
[0098] Optionally, the determining module 408 is further configured to use the knowledge document corresponding to the target similarity greater than the third preset similarity threshold as the reference knowledge document; The target score for each reference knowledge document is generated using a pre-set scoring model. The target knowledge document corresponding to the query question text is determined based on the scores of each target.
[0099] Optionally, the determining module 408 is further configured to extract the first keyword information from the query question text; Based on the second keyword information corresponding to each knowledge document and the first keyword information, determine at least one third knowledge document that matches the query question text; Based on each first knowledge document, each second knowledge document, and each third knowledge document, determine the target knowledge document corresponding to the query question text.
[0100] The aforementioned text retrieval device can acquire a query question text; determine at least one target question text matching the query question text from various reference question texts, and determine a first knowledge document corresponding to each target question text, wherein the reference question texts are questions corresponding to pre-set knowledge documents and are stored in a pre-built target knowledge base; determine at least one second knowledge document matching the query question text from various knowledge documents in the target knowledge base; and determine a target knowledge document corresponding to the query question text based on each first knowledge document and each second knowledge document.
[0101] With the aforementioned text retrieval device, since the target knowledge base not only stores knowledge documents but also corresponding reference question texts, when retrieving target knowledge documents similar to the query question text, one can refer not only to the knowledge document itself but also to the corresponding reference question text, thereby improving the relevance between the retrieved target knowledge documents and the query question text.
[0102] The above is an illustrative scheme of a text retrieval device according to this embodiment. It should be noted that the technical solution of this text retrieval device and the technical solution of the text retrieval method described above belong to the same concept. For details not described in detail in the technical solution of the text retrieval device, please refer to the description of the technical solution of the text retrieval method described above.
[0103] Figure 5 A structural block diagram of a computing device 500 according to an embodiment of this application is shown. The components of the computing device 500 include, but are not limited to, a memory 510 and a processor 520. The processor 520 is connected to the memory 510 via a bus 530, and a database 550 is used to store data.
[0104] The computing device 500 also includes an access device 540, which enables the computing device 500 to communicate via one or more networks 560. Examples of these networks include Public Switched Telephone Network (PSTN), Local Area Network (LAN), Wide Area Network (WAN), Personal Area Network (PAN), or combinations of communication networks such as the Internet. The access device 540 may include one or more of any type of wired or wireless network interface (e.g., a network interface card (NIC)), such as an IEEE 802.11 Wireless Local Area Network (WLAN) wireless interface, a Wi-MAX (Worldwide Interoperability for Microwave Access) interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a Bluetooth interface, a Near Field Communication (NFC) interface, and so on.
[0105] In one embodiment of this application, the aforementioned components of the computing device 500 and Figure 5 Other components, not shown, can also be connected to each other, for example, via a bus. It should be understood that... Figure 5 The block diagram of the computing device shown is for illustrative purposes only and is not intended to limit the scope of this application. Those skilled in the art can add or replace other components as needed.
[0106] The computing device 500 can be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (e.g., tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (e.g., smartphones), wearable computing devices (e.g., smartwatches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or personal computers (PCs). The computing device 500 can also be a mobile or stationary server.
[0107] The processor 520 is used to execute the following computer program / instructions, which, when executed by the processor, implement the steps of the above-described text retrieval method.
[0108] The above is an illustrative scheme of a computing device according to this embodiment. It should be noted that the technical solution of this computing device and the technical solution of the text retrieval method described above belong to the same concept. For details not described in detail in the technical solution of the computing device, please refer to the description of the technical solution of the text retrieval method described above.
[0109] An embodiment of this specification also provides a computer-readable storage medium storing a computer program / instructions that, when executed by a processor, implement the steps of the text retrieval method described above.
[0110] The above is an illustrative scheme of a computer-readable storage medium according to this embodiment. It should be noted that the technical solution of this storage medium and the technical solution of the text retrieval method described above belong to the same concept. For details not described in detail in the technical solution of the storage medium, please refer to the description of the technical solution of the text retrieval method described above.
[0111] An embodiment of this specification also provides a computer program product, including a computer program / instructions that, when executed by a processor, implement the steps of the above-described text retrieval method.
[0112] The above is an illustrative scheme of a computer program product according to this embodiment. It should be noted that the technical solution of this computer program product and the technical solution of the above-described text retrieval method belong to the same concept. For details not described in detail in the technical solution of the computer program product, please refer to the description of the technical solution of the above-described text retrieval method.
[0113] The foregoing has described specific embodiments of this application. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired results. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
[0114] The computer instructions include computer program code, which may be in the form of source code, object code, executable file, or certain intermediate forms. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the computer-readable medium may be appropriately added or removed according to the requirements of patent practice. For example, in some regions, according to patent practice, computer-readable media may not include electrical carrier signals and telecommunication signals.
[0115] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this application. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to this application.
[0116] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions in other embodiments.
[0117] The preferred embodiments disclosed above are merely illustrative of this application. The optional embodiments do not exhaustively describe all details, nor do they limit the invention to the specific implementations described. Clearly, many modifications and variations can be made based on the content of this application. These embodiments are selected and specifically described in this application to better explain the principles and practical applications of this application, thereby enabling those skilled in the art to better understand and utilize this application. This application is limited only by the claims and their full scope and equivalents.
Claims
1. A text retrieval method, characterized in that, include: Retrieve the query question text; Determine at least one target question text that matches the query question text from each reference question text, and determine the first knowledge document corresponding to each target question text, wherein the reference question text is a question corresponding to a pre-set knowledge document and is stored in a pre-built target knowledge base; Determine at least one second knowledge document from the knowledge documents of the target knowledge base that matches the query question text; Based on each first knowledge document and each second knowledge document, determine the target knowledge document corresponding to the query question text.
2. The method as described in claim 1, characterized in that, Determining at least one target question text that matches the query question text from each reference question text includes: Obtain the first text features of the query question text, and obtain the second text features of each reference question text; Calculate the first similarity between the first text feature and each of the second text features; Based on each first similarity score, at least one target question text that matches the query question text is determined.
3. The method as described in claim 2, characterized in that, Based on each first similarity score, at least one target question text matching the query question text is determined, including: The reference question text corresponding to the first similarity score that is greater than the first preset similarity threshold is taken as the target question text that matches the query question text.
4. The method as described in claim 1, characterized in that, Determining at least one second knowledge document from the knowledge documents of the target knowledge base that matches the query question text, including: Obtain the first text features of the query question text, and obtain the third text features of each knowledge document; Calculate the second similarity between the first text feature and each of the third text features; Based on each second similarity score, at least one second knowledge document matching the query question text is determined.
5. The method as described in claim 4, characterized in that, Based on each second similarity score, at least one second knowledge document matching the query question text is determined, including: The knowledge document corresponding to the second similarity value that is greater than the second preset similarity threshold is taken as the second knowledge document that matches the query question text.
6. The method according to any one of claims 1 to 5, characterized in that, Based on each first knowledge document and each second knowledge document, determine the target knowledge document corresponding to the query question text, including: Obtain a first sorting sequence and a second sorting sequence, wherein the first sorting sequence is obtained based on a first similarity between each first knowledge document and the query question text, and the second sorting sequence is obtained based on a second similarity between each second knowledge document and the query question text; For each candidate knowledge document, the target similarity between the candidate knowledge document and the query question text is determined based on the first ranking of the candidate knowledge document in the first ranking sequence and the second ranking of the candidate knowledge document in the second ranking sequence, wherein the candidate knowledge document is any one of each first knowledge document and each second knowledge document; Based on the target similarity of each knowledge document, the target knowledge document corresponding to the query question text is determined.
7. The method as described in claim 6, characterized in that, For each candidate knowledge document, based on its first ranking in the first ranking sequence and its second ranking in the second ranking sequence, the target similarity between the candidate knowledge document and the query question text is determined, including: Calculate the sum of the reciprocals of the first sort and the second sort, and use the sum of the reciprocals as the target similarity between the candidate knowledge document and the query question text.
8. The method as described in claim 6, characterized in that, Based on the target similarity corresponding to each knowledge document, the target knowledge document corresponding to the query question text is determined, including: The knowledge document corresponding to the target similarity greater than the third preset similarity threshold is used as the reference knowledge document; The target score for each reference knowledge document is generated using a pre-set scoring model. The target knowledge document corresponding to the query question text is determined based on the scores of each target.
9. The method as described in claim 1, characterized in that, Based on each first knowledge document and each second knowledge document, determine the target knowledge document corresponding to the query question text, including: Extract the first keyword information from the query text; Based on the second keyword information corresponding to each knowledge document and the first keyword information, determine at least one third knowledge document that matches the query question text; Based on each first knowledge document, each second knowledge document, and each third knowledge document, determine the target knowledge document corresponding to the query question text.
10. A method for constructing a knowledge base, characterized in that, include: Obtain the reference question text and secondary keyword information corresponding to each knowledge document; Construct a target knowledge base based on each knowledge document, the corresponding reference question text for each knowledge document, and the secondary keyword information.
11. A text retrieval device, characterized in that, include: The retrieval module is configured to retrieve the text of the query question. The first retrieval module is configured to determine at least one target question text that matches the query question text from each reference question text, and to determine the first knowledge document corresponding to each target question text, wherein the reference question text is a question corresponding to a pre-set knowledge document and is stored in a pre-built target knowledge base; The second retrieval module is configured to determine at least one second knowledge document from the knowledge documents of the target knowledge base that matches the query question text; The determination module is configured to determine the target knowledge document corresponding to the query question text based on each first knowledge document and each second knowledge document.
12. A computing device, characterized in that, include: Memory and processor; The memory is used to store computer programs / instructions, and the processor is used to execute the computer programs / instructions, which, when executed by the processor, implement the steps of the method according to any one of claims 1 to 10.
13. A computer-readable storage medium storing a computer program / instructions, characterized in that, When the computer program / instructions are executed by the processor, they implement the steps of the method according to any one of claims 1 to 10.
14. A computer program product comprising a computer program / instructions, characterized in that, When the computer program / instructions are executed by the processor, they implement the steps of the method according to any one of claims 1 to 10.