Candidate text determination method and apparatus

By combining semantic retrieval and the BM25 algorithm to filter candidate texts, the problem of low recall in question answering systems is solved, and the accuracy of candidate texts and the performance of question answering systems are improved.

CN117009488BActive Publication Date: 2026-06-19BEIJING KINGSOFT DIGITAL ENTERTAINMENT CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING KINGSOFT DIGITAL ENTERTAINMENT CO LTD
Filing Date
2021-04-30
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing question-answering systems, the low recall rate of information retrieval leads to inaccurate answers and affects system performance. This is mainly because the performance of the retrieval model depends on the training conditions, resulting in inaccurate semantic vectors and the recall of irrelevant text.

Method used

By determining the semantic vector similarity between the question to be answered and the semantic vector similarity of the texts in the text library, and combining word segmentation and word unit weights, texts with high similarity scores are selected as candidate texts. The accuracy of candidate texts is improved by combining semantic retrieval and the BM25 algorithm.

Benefits of technology

It improves the recall rate of information retrieval and the reliability of semantic retrieval, enhances the performance of the question-answering system, and ensures the accuracy of answers.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117009488B_ABST
    Figure CN117009488B_ABST
Patent Text Reader

Abstract

This application provides a method and apparatus for determining candidate texts. The method includes: determining the semantic vector of the question to be answered based on an acquired question; acquiring the semantic vectors of multiple texts in a text library; determining a first candidate text semantically related to the question to be answered from the text library based on the similarity between the semantic vector of the question to be answered and the semantic vectors of the multiple texts; performing word segmentation on the question to be answered to obtain multiple first word units; determining a similarity score of each text relative to the question to be answered based on the weight value of each first word unit and the relevance value between each first word unit and each text in the text library; determining texts with similarity scores greater than a second threshold as second candidate texts; and determining candidate texts based on the first and second candidate texts. Using two methods to determine candidate texts improves the accuracy of the determined candidate texts.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] This application is a divisional application of application number 202110484317.7, filed on April 30, 2021, entitled "A text processing method and apparatus". Technical Field

[0002] This application relates to the field of artificial intelligence, and in particular to a method and apparatus for determining candidate text, a computing device, and a computer-readable storage medium. Background Technology

[0003] In a question-answering system, after obtaining a question, information retrieval is performed to obtain text that is relevant to the question. Then, the answer to the question is determined from the obtained text. If the text obtained through information retrieval is irrelevant, it will affect the accuracy of the determined answer and thus affect the performance of the question-answering system. Therefore, information retrieval is crucial.

[0004] In existing technologies, to improve the recall rate of information retrieval and make the retrieved text more relevant to the question, semantic retrieval is typically used to determine text that is semantically related to the question. Specifically, a retrieval model can be used to determine the semantic vector of the question to be answered and the semantic vectors of multiple texts in a text library. The similarity between the semantic vectors of the texts and the semantic vectors of the question to be answered is then determined. If the similarity is high, it indicates that the question to be answered and the text are semantically close. Therefore, texts with a high similarity to the semantic vectors of the question to be answered can be identified as texts that are semantically related to the question to be answered.

[0005] However, in the above method, the semantic vector obtained by vectorizing the question to be answered is determined only by the retrieval model. The performance of the retrieval model depends on the training situation. Therefore, if the determined semantic vector of the question to be answered cannot accurately represent the question to be answered, the text determined by such semantic vector may be irrelevant to the question to be answered. That is, semantic retrieval may recall irrelevant text, and the answer to the question to be answered based on irrelevant text may be inaccurate, which will affect the performance of the question answering system. Summary of the Invention

[0006] In view of this, embodiments of this application provide a method and apparatus for determining candidate text, a computing device, and a computer-readable storage medium to address the technical deficiencies existing in the prior art.

[0007] According to a first aspect of the embodiments of this application, a method for determining candidate text is provided, including:

[0008] Based on the obtained question to be answered, determine the semantic vector of the question to be answered, and obtain the semantic vectors of multiple texts in the text library;

[0009] Based on the similarity between the semantic vector of the question to be answered and the semantic vectors of the multiple texts, a first candidate text that is semantically related to the question to be answered is determined from the text library;

[0010] The question to be answered is segmented into words to obtain multiple first word units of the question to be answered;

[0011] Based on the weight value of each first word unit and the relevance value between each first word unit and each text in the text library, the similarity score of each text relative to the question to be answered is determined, and the texts with similarity scores greater than the second threshold are determined as the second candidate texts;

[0012] Based on the first candidate text and the second candidate text, candidate texts are determined.

[0013] According to a second aspect of the embodiments of this application, a candidate text determination apparatus is provided, comprising:

[0014] The first determining module is configured to determine the semantic vector of the question to be answered based on the acquired question to be answered, and to acquire the semantic vectors of multiple texts in the text library;

[0015] The second determining module is configured to determine a first candidate text semantically related to the question to be answered from the text library based on the similarity between the semantic vector of the question to be answered and the semantic vectors of the plurality of texts;

[0016] The word segmentation module is configured to perform word segmentation on the question to be answered, thereby obtaining multiple first word units of the question to be answered.

[0017] The third determining module is configured to determine the similarity score of each text relative to the question to be answered based on the weight value of each first word unit and the relevance value of each first word unit to each text in the text library, and determine the text with a similarity score greater than a second threshold as the second candidate text;

[0018] The fourth determining module is configured to determine candidate texts based on the first candidate text and the second candidate text.

[0019] According to a third aspect of the embodiments of this application, a computing device is provided, including a memory, a processor, and computer instructions stored in the memory and executable on the processor, wherein the processor executes the instructions to implement the steps of the candidate text determination method.

[0020] According to a fourth aspect of the embodiments of this application, a computer-readable storage medium is provided that stores computer instructions, which, when executed by a processor, implement the steps of the candidate text determination method.

[0021] According to a fifth aspect of the present application, a chip is provided that stores computer instructions, which, when executed by the chip, implement the steps of the candidate text determination method.

[0022] In this embodiment, based on the acquired question to be answered, the semantic vector of the question to be answered is determined, and the semantic vectors of multiple texts in the text library are obtained. Based on the similarity between the semantic vector of the question to be answered and the semantic vectors of the multiple texts, a first candidate text semantically related to the question to be answered is determined from the text library. The question to be answered is segmented into multiple first word units. Based on the weight value of each first word unit and the relevance value between each first word unit and each text in the text library, a similarity score of each text relative to the question to be answered is determined, and texts with similarity scores greater than a second threshold are determined as second candidate texts. Based on the first and second candidate texts, candidate texts are determined. By using the first and second candidate texts determined in two ways—based on the question to be answered and the multiple texts, and their semantic vectors—to determine the candidate texts, the accuracy of the determined candidate texts is improved, thereby improving the performance of the question-answering system. Attached Figure Description

[0023] Figure 1 This is a structural block diagram of a computing device provided in an embodiment of this application;

[0024] Figure 2 This is a flowchart of a text processing method provided in an embodiment of this application;

[0025] Figure 3 This is a schematic diagram of a text processing method provided in an embodiment of this application;

[0026] Figure 4 This is a schematic diagram illustrating a method for determining candidate text provided in an embodiment of this application;

[0027] Figure 5 This is a schematic diagram of a graph network provided in an embodiment of this application;

[0028] Figure 6 This is a flowchart of another text processing method provided in the embodiments of this application;

[0029] Figure 7 This is a schematic diagram of the structure of a text processing device provided in an embodiment of this application. Detailed Implementation

[0030] Many specific details are set forth in the following description to provide a full understanding of this application. However, this application can be implemented in many other ways different from those described herein, and those skilled in the art can make similar extensions without departing from the spirit of this application; therefore, this application is not limited to the specific embodiments disclosed below.

[0031] The terminology used in one or more embodiments of this application is for the purpose of describing particular embodiments only and is not intended to limit the scope of one or more embodiments of this application. The singular forms “a,” “the,” and “the” used in one or more embodiments of this application and in the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” used in one or more embodiments of this application refers to and includes any or all possible combinations of one or more associated listed items.

[0032] It should be understood that although the terms first, second, etc., may be used to describe various information in one or more embodiments of this application, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, first may also be referred to as second without departing from the scope of one or more embodiments of this application, and similarly, second may also be referred to as first. Depending on the context, the word "if" as used herein may be interpreted as "in response to a determination".

[0033] First, the terminology used in one or more embodiments of the present invention will be explained.

[0034] Information retrieval: A method for retrieving information.

[0035] Semantic retrieval: A method of retrieving information based on semantics.

[0036] DPR (Dense Passage Retrieval) model: This model performs semantic retrieval and outputs candidate texts related to the input question.

[0037] Recall: The ratio of the number of relevant texts retrieved to the number of relevant texts actually existing in the text database. Relevant texts are those that are truly relevant to the question to be answered.

[0038] Adjacency matrix: A matrix that represents the adjacency relationship between nodes. The adjacency matrix of an undirected graph is symmetric.

[0039] Text filtering network: A network that filters input text to identify text that meets certain requirements.

[0040] Graph Neural Networks: A type of deep learning network that processes graph data.

[0041] The BM25 algorithm is an extension of the binary independence model and can be used to rank searches based on relevance.

[0042] Semantic vector: A vector used to represent the semantic features of text.

[0043] Hidden feature vector: A feature vector obtained by combining contextual information; it is a vector representation.

[0044] Word embedding refers to the process of embedding a high-dimensional space containing the number of all words into a continuous vector space with a much lower dimension, where each word or phrase is mapped to a vector in the real number field.

[0045] word2vec is a method for word embedding, developed by Mikolov based on the Bengio Neural Network Language Model (NNLM). It's an efficient word vector training method that allows for word embedding of text to obtain word vectors.

[0046] Word vectors: A representation of words designed to enable computers to process them.

[0047] The BERT model (Bidirectional Encoder Representations from Transformer) is a bidirectional attention neural network model.

[0048] The first word unit: the word unit obtained after segmenting the question to be answered.

[0049] The second word unit: the word unit obtained after segmenting the candidate text.

[0050] First feature vector: The vector representation obtained by combining the word vectors of the first word unit with the word vectors of other first word units in the question to be answered.

[0051] Second feature vector: The vector representation of the second word unit combined with the word vectors of other second word units in the corresponding candidate text.

[0052] This application provides a text processing method and apparatus, a computing device and a computer-readable storage medium, which will be described in detail in the following embodiments.

[0053] Figure 1 A structural block diagram of a computing device 100 according to an embodiment of this application is shown. The components of the computing device 100 include, but are not limited to, a memory 110 and a processor 120. The processor 120 is connected to the memory 110 via a bus 130, and a database 150 is used to store data.

[0054] The computing device 100 also includes an access device 140, which enables the computing device 100 to communicate via one or more networks 160. Examples of these networks include a Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the Internet. The access device 140 may include one or more of any type of wired or wireless network interface (e.g., a Network Interface Card (NIC)), such as an IEEE 802.11 Wireless Local Area Network (WLAN) interface, a Wi-MAX interface, an Ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a Bluetooth interface, a Near Field Communication (NFC) interface, and so on.

[0055] In one embodiment of this application, the aforementioned components of the computing device 100 and Figure 1 Other components, not shown, can also be connected to each other, for example, via a bus. It should be understood that... Figure 1 The block diagram of the computing device shown is for illustrative purposes only and is not intended to limit the scope of this application. Those skilled in the art can add or replace other components as needed.

[0056] The computing device 100 can be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (e.g., tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (e.g., smartphones), wearable computing devices (e.g., smartwatches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or PCs. The computing device 100 can also be a mobile or stationary server.

[0057] Among them, processor 120 can execute Figure 2 The steps in the text processing method shown. Figure 2 A flowchart of a text processing method according to an embodiment of this application is shown, including steps 202 to 206.

[0058] Step 202: Based on the obtained question to be answered, determine the semantic vector of the question to be answered, multiple candidate texts, and the semantic vector of the multiple candidate texts, wherein each candidate text is a text in the text library that is semantically related to the question to be answered.

[0059] In practical applications, after obtaining the question to be answered, the semantic vectors of the question and the text in the text library can be determined by the retrieval model. Text with a high similarity to the semantic vector of the question can be considered to be text with a meaning close to the question. Therefore, text with a high similarity to the semantic vector of the question can be identified as text related to the question to be answered. In this case, a large amount of text can usually be obtained. However, since the vectorization of the question to be answered and the vectorization of the text in the text library are both determined by the retrieval model, and the performance of the retrieval model depends on the training situation, the determined semantic vector is uncontrollable. It may not accurately represent the question to be answered, or it may not accurately represent the text in the text library. The text determined by using such inaccurate semantic vectors may be irrelevant to the question to be answered. That is, semantic retrieval may recall irrelevant text, which can be considered as reducing the retrieval recall rate. Moreover, the answer determined based on text that is irrelevant to the question to be answered may be inaccurate, which will also affect the performance of the question answering system.

[0060] Therefore, this application provides a text processing method that, after initially obtaining candidate texts, further filters these candidate texts, removing those irrelevant to the question to be answered, and obtaining target texts highly relevant to the question. In other words, the text processing method provided in this application performs further filtering based on the recall from large-scale semantic retrieval, filtering out irrelevant texts. This method improves the recall rate of the retrieval, enhances the reliability of semantic retrieval, and results in more accurate answers determined based on these target texts, thus improving the performance of the question-answering system.

[0061] As an example, the semantic vector of the question to be answered is a feature vector that can be used to characterize the semantics of the question to be answered, and the semantic vector of the candidate text is a feature vector that can be used to characterize the semantics of the candidate text.

[0062] As an example, a question to be answered is a question that requires a corresponding answer. For example, a question to be answered could be "What is the smallest natural number?", or "What is the smallest prime number?", or "Which countries are included in the four ancient civilizations?", and so on.

[0063] In a first possible implementation, the specific implementation of determining the semantic vector of the question to be answered, multiple candidate texts, and the semantic vectors of the multiple candidate texts based on the acquired question to be answered may include: extracting features from the question to be answered to determine the semantic vector of the question to be answered; obtaining the semantic vectors of multiple texts in the text library; determining the similarity score of each text relative to the question to be answered based on the semantic vector of the question to be answered and the semantic vectors of the multiple texts; determining the multiple candidate texts based on the similarity score of each text relative to the question to be answered, and obtaining the semantic vectors of the multiple candidate texts.

[0064] The similarity score can be used to represent the similarity between the text and the question to be answered. The higher the similarity score, the more similar the text and the question to be answered are, and the lower the similarity score, the less similar the text and the question to be answered are.

[0065] In other words, features can be extracted from the question to be answered to obtain the semantic vector of the question to be answered, features can be extracted from the text in the text library to obtain the semantic vector of each text, and candidate texts that are semantically related to the question to be answered can be determined from the text library based on the similarity between the semantic vector of the question to be answered and the semantic vector of the text.

[0066] In some embodiments, the question to be answered and texts from a text library can be input into a semantic retrieval model to determine multiple candidate texts. The semantic retrieval model may include a feature extraction module and a text retrieval module. The feature extraction module can extract features from the question to be answered and each text in the text library to obtain the semantic vector of the question to be answered and the semantic vector of each text in the text library. Then, based on the semantic vector of the question to be answered and the semantic vector of each text, the text retrieval module can determine candidate texts semantically related to the question to be answered.

[0067] As an example, the feature extraction module may include a word embedding layer and an encoding layer. The word embedding layer is used to perform word embedding processing on the input text to obtain word vectors, and the encoding layer is used to encode the input word vectors to obtain semantic vectors.

[0068] In practice, the input question to be answered and multiple texts from the text library can be segmented separately to obtain multiple first word units for the question to be answered and multiple second word units for each text. As an example, the question to be answered and multiple texts can be segmented separately according to a pre-compiled vocabulary. For instance, in the pre-compiled vocabulary, if the text is Chinese, a single character or a punctuation mark can be considered as a word unit. If the text is in a foreign language, a single word or a punctuation mark can be considered as a word unit. If the text includes numbers, a single number can be considered as a word unit.

[0069] For example, suppose the question to be answered is "What is the smallest natural number?", then word segmentation of this question will yield multiple first-word units such as [smallest, of, natural number, is, several]. If the question to be answered is "What is the smallest prime number?", then word segmentation of this question will yield multiple first-word units such as [What, is, the, smallest, prime, number]. If the text is "0 is the smallest natural number", then word segmentation of this text will yield multiple second-word units such as [0, is, smallest, of, natural number]. If the text is "Natural numbers are integers greater than or equal to 0", then word segmentation of this text will yield multiple second-word units such as [natural number, is, greater than, or, equal to, 0, of, integer].

[0070] In the specific implementation, after segmenting the question to be answered, word embedding can be performed on each first word unit of the question and each second word unit of the text in the text library. Each word unit is mapped to a low-dimensional vector space to obtain the word vector of each word unit. For ease of description, the first word unit and the second word unit are collectively referred to as word units.

[0071] As an example, word embedding can be performed on each first word unit of the question to be answered using one-hot encoding to obtain the word vector of each first word unit, and word embedding can be performed on each second word unit to obtain the word vector of each second word unit.

[0072] As another example, word2vec encoding can be used to perform word embedding on each first word unit of the question to be answered, to obtain the word vector of each first word unit, and word embedding can be performed on each second word unit to obtain the word vector of each second word unit.

[0073] In the specific implementation, after word embedding is performed through the word embedding layer to obtain word vectors, the word vectors of each first word unit and each second word unit can be input into the encoding layer for encoding. This yields the vector representation of each first word unit combined with the word vectors of other first word units in the question to be answered, i.e., the first feature vector of each first word unit. Similarly, it yields the vector representation of each second word unit combined with the word vectors of other second word units in the corresponding text, i.e., the second feature vector of each second word unit. Concatenating the first feature vectors of multiple first word units in the question to be answered yields the semantic vector of the question to be answered. Likewise, concatenating the second feature vectors of multiple second word units in the same text yields the semantic vector of the text.

[0074] In some embodiments, after obtaining the semantic vector of the question to be answered and the semantic vector of the text in the text library through the feature extraction module, the semantic vector of the question to be answered and the semantic vector of each text can be input into the text retrieval module to determine the similarity score between the semantic vector of the text to be answered and the semantic vector of each text. Multiple similarity scores can be obtained, and then candidate texts can be determined from multiple texts in the text library based on the multiple similarity scores.

[0075] As an example, the text retrieval module can multiply the semantic vector of the question to be answered with the semantic vector of each text, and then normalize the product to obtain the similarity score between the question to be answered and each text, thus obtaining multiple similarity scores.

[0076] It should be noted that the feature extraction module mentioned above is only one example of this application. In other embodiments, the feature extraction module can be any structure that includes word segmentation, word embedding, and encoding functions, and this application does not limit this. For example, the feature extraction module can adopt the structure of the BERT model. In addition, the semantic retrieval model can be the DPR model, through which multiple candidate texts semantically related to the question to be answered can be obtained.

[0077] In one implementation, after determining multiple similarity scores, it is also necessary to determine candidate texts based on the similarity scores. Therefore, the specific implementation of determining the multiple candidate texts based on the similarity score of each text relative to the question to be answered may include: selecting multiple texts with similarity scores greater than a second threshold as the multiple candidate texts.

[0078] It should be noted that the second threshold can be set by the user according to actual needs, or it can be set by the device default; this application embodiment does not limit this. For example, the second threshold can be 0.8.

[0079] For example, since a higher similarity score indicates a greater semantic relevance between the text and the question to be answered, and a lower similarity score indicates a less semantic relevance between the text and the question to be answered, if the similarity score is greater than the second threshold, the similarity can be considered high enough, that is, the semantic relevance between the text and the question to be answered is high enough, and the text can be identified as a candidate text.

[0080] For example, see Figure 3 , Figure 3 This is a schematic diagram of a text processing method provided in an embodiment of this application. After inputting the question to be answered into the semantic retrieval model, the feature extraction module can output the semantic vector of the question to be answered and the semantic vectors of multiple texts. The text retrieval module can obtain 1000 candidate texts and 1000 semantic vectors of the candidate texts.

[0081] Furthermore, after identifying multiple candidate texts, the BM25 algorithm can be used to perform an initial sorting of the candidate texts, retaining the top N candidate texts and deleting the candidate texts that are ranked lower. After the initial screening, the number of candidate texts obtained will be reduced, which can reduce the amount of computation in the text screening network.

[0082] In this implementation, multiple candidate texts related to the question to be answered are identified from the text library through semantic retrieval. Through the semantic retrieval model, multiple candidate texts with high relevance to the question to be answered can be recalled.

[0083] In this embodiment, by extracting features from the question to be answered and the text, a semantic vector representing the semantics of the question to be answered and a semantic vector representing the semantics of the text are determined. Candidate texts semantically related to the question to be answered are determined based on the similarity between the semantic vectors. The semantic vector of the question to be answered is not a concatenation of word vectors of multiple first word units, but is obtained by combining the semantic information of the whole text with each first word unit, which can more accurately represent the question to be answered. The semantic vector of the candidate text is not a concatenation of word vectors of multiple second word units, but is obtained by combining the semantic information of the whole text with each second word unit, which can more accurately represent the candidate text, thus improving the accuracy and recall of retrieval.

[0084] In the second possible implementation, multiple candidate texts can be identified from the text in the text library using the BM25 algorithm. Then, feature extraction is performed on the question to be answered and the identified candidate texts to determine the semantic vector of the question to be answered and the semantic vectors of the multiple candidate texts.

[0085] In some embodiments, the specific implementation of determining multiple candidate texts from a text library using the BM25 algorithm may include: performing word segmentation on the question to be answered to obtain multiple first word units of the question; determining the relevance value between each first word unit and each text, which can yield multiple relevance values ​​for each first word unit, and each relevance value corresponds to one text; determining the weight value of each first word unit; based on the weight value of each first word unit and the multiple relevance values ​​of each first word unit, determining the similarity score of each text relative to the question to be answered, resulting in multiple similarity scores; comparing the multiple similarity scores with a second threshold, and determining multiple texts with similarity scores greater than the second threshold as multiple candidate texts.

[0086] As an example, the question to be answered can be segmented based on a pre-compiled vocabulary. For instance, suppose the question to be answered is "What is the smallest natural number?", then segmenting the question to be answered can yield multiple first word units, namely [smallest, of, natural number, is, how many].

[0087] As an example, let's take the first word unit q... i Taking text d as an example, determine the first word unit q. i The implementation of the relevance value with text d may include: determining the first word unit q i The frequency of occurrence in text d, the average length of all texts in the text library, and the length of text d are determined. Based on this frequency, average length, and length of text d, the first word unit q can be determined. i The relevance value to text d.

[0088] For example, the first word unit q can be determined by the following formula (1). i The relevance values ​​to text d are:

[0089]

[0090] Where R(q) i ,d) represents the first word unit q i The relevance value of f to text d i Indicates the first word unit q i The frequency of occurrence in text d, k1 and b are both adjustment factors, usually set according to experience, generally k1=2, b=0.75, dl represents the length of text d, and avg(dl) represents the average length of all texts in the text library.

[0091] Using the above formula (1), the relevance value of each first word unit relative to each text can be determined.

[0092] As an example, let's take the first word unit q... i For example, determine the first word unit q i The implementation of the weights may include: determining the total number of all texts in the text library, and determining the number of texts in the text library including the first word unit q. i The number of texts, based on the total number and the number of texts including the first word unit q1, can determine the first word unit q. i The weight value.

[0093] For example, the first word unit q can be determined by the following formula (2). i Weight values:

[0094]

[0095] Among them, W i Indicates the first word unit q i The weights are defined as follows: N represents the total number of texts in the text library, and n(q1) represents the number of texts that include the first word unit q1.

[0096] The weight value of each first word unit can be determined using the above formula (2).

[0097] As an example, taking text d as an example, after determining the relevance value of each first word unit relative to each text d, and determining the weight value of each first word unit, the similarity score of text d relative to the question to be answered can be determined by the following formula (3):

[0098]

[0099] Where Q represents the question to be answered, Score(Q, d) represents the similarity score of text d relative to the question to be answered Q, n represents the number of first word units in the question to be answered, and W i Indicates the first word unit q i The weights, R(q) i ,d) represents the first word unit q i The relevance value to text d.

[0100] Using the above formula (3), the similarity score of each text relative to the question to be answered can be determined.

[0101] After determining the similarity score of each text relative to the question to be answered, texts with similarity scores greater than the second threshold can be identified as candidate texts. Then, both the candidate texts and the question to be answered are input into the feature extraction model for feature extraction, which yields the semantic vector of each candidate text and the semantic vector of the question to be answered.

[0102] It should be noted that the above implementation process of determining multiple candidate texts using the BM25 algorithm is only an example. In actual implementation, the BM25 algorithm can be adapted and used, and this application embodiment does not limit this. In addition, the implementation process of determining candidate texts based on similarity scores and extracting features from candidate texts and questions to be answered is the same as the previous implementation method. For specific implementation details, please refer to the relevant description in the first implementation method, and this application embodiment will not repeat them here.

[0103] In this implementation, multiple candidate texts related to the question to be answered are identified from the text library using the BM25 retrieval method, which can recall multiple candidate texts that are highly relevant to the question to be answered.

[0104] In a third possible implementation, a first candidate text can be obtained from a text library using a semantic retrieval model, and a second candidate text can be obtained from the text library using the BM25 retrieval algorithm. Multiple candidate texts are then determined based on the first and second candidate texts. Furthermore, the semantic vector of the question to be answered and the semantic vectors of the multiple candidate texts are obtained.

[0105] It should be noted that the implementation process of obtaining the first candidate text from the text library through semantic retrieval is the same as the implementation process of determining candidate text in the first implementation method described above. For specific implementation details, please refer to the relevant description in the first implementation method; this application embodiment does not limit this process. Furthermore, the implementation process of obtaining the second candidate text from the text library through the BM25 retrieval algorithm is the same as the implementation process of determining candidate text in the second implementation method described above. For specific implementation details, please refer to the relevant description in the second implementation method; this application embodiment does not limit this process.

[0106] In some embodiments, the intersection of the first candidate text and the second candidate text can be used to determine multiple candidate texts, that is, texts that appear repeatedly in the first and second candidate texts can be identified as candidate texts. For example, assuming the first candidate text includes text 1, text 2, and text 4, and the second candidate text includes text 1, text 3, and text 4, then text 1 and text 4 can be identified as candidate texts. Candidate texts identified in this way are more relevant to the question to be answered than candidate texts identified using only one retrieval method; that is, the identified candidate texts are more accurate.

[0107] In other embodiments, the union of the first candidate text and the second candidate text can be determined as the plurality of candidate texts. For example, assuming the first candidate text includes text 1, text 2, and text 4, and the second candidate text includes text 1, text 3, and text 4, then text 1, text 2, text 3, and text 4 can be determined as candidate texts. This allows for the acquisition of as much text as possible related to the question to be answered, avoiding the omission of relevant text.

[0108] In addition, during implementation, when determining the first candidate text through the semantic retrieval model, the semantic vector of the question to be answered and the semantic vector of the first candidate text can be obtained. The semantic vector of the second candidate text can be obtained through feature extraction.

[0109] As an example, if the final list of candidate texts includes texts that are not part of the first candidate text, feature extraction can be performed on these texts to obtain their semantic vectors. This allows the semantic vectors of the candidate texts to be obtained. For instance, assuming the first candidate texts include text 1, text 2, and text 4, and the candidate texts include text 1, text 3, and text 4, the feature extraction module of the semantic retrieval model can obtain the semantic vectors of text 1, text 2, and text 4. Since text 3 is not part of the first candidate texts, feature extraction can be performed on text 3 to determine its semantic vector. Thus, the semantic vectors of the three candidate texts can be determined.

[0110] As another example, if the final determined candidate texts are the intersection of the first candidate text and the second candidate text, i.e., there is no text that does not belong to the first candidate text, then the semantic vector of the first candidate text determined by the semantic retrieval model can be used as the semantic vector of the candidate texts. For example, assuming the first candidate texts include text 1, text 2, and text 4, and the candidate texts include text 1 and text 4, the feature extraction module of the semantic retrieval model can obtain the semantic vectors of text 1, text 2, and text 4, thus directly obtaining the semantic vectors of the two candidate texts.

[0111] For example, see Figure 4 , Figure 4 This is a schematic diagram illustrating a method for determining candidate text provided in an embodiment of this application. Figure 4 In this process, a semantic retrieval model can determine N first candidate texts and their semantic vectors. The BM25 retrieval algorithm can determine M second candidate texts. Feature extraction is then performed on these M second candidate texts to obtain their semantic vectors. Assuming there are no duplicate texts between the first and second candidate texts, then M+N texts can be considered as candidate texts, and their semantic vectors can be used as the semantic vectors of these candidate texts.

[0112] In this implementation, multiple candidate texts related to the question to be answered are identified from the text library by combining semantic retrieval and BM25 retrieval, which can improve the accuracy of the recalled candidate texts.

[0113] Step 204: Construct an adjacency matrix based on the association between the question to be answered and the multiple candidate texts, wherein the adjacency matrix is ​​used to characterize the relevance between the question to be answered and the multiple candidate texts, as well as the relevance among the multiple candidate texts.

[0114] In the embodiments of this application, after determining multiple candidate texts, multiple candidate texts need to be filtered. Since filtering based solely on the association between candidate texts and the question to be answered may be too simplistic, the association between candidate texts can also be considered. Furthermore, an adjacency matrix can be used to represent the association between candidate texts and between candidate texts and the question to be answered.

[0115] Furthermore, before constructing the adjacency matrix based on the association between the question to be answered and the multiple candidate texts, the process also includes:

[0116] Obtain the keywords of the question to be answered and the keywords of each candidate text;

[0117] If the first candidate text contains a keyword corresponding to the keyword of the question to be answered, it is determined that the first candidate text and the question to be answered are related, wherein the first candidate text is any one of the plurality of candidate texts;

[0118] If the first candidate text contains a keyword corresponding to the keyword of the second candidate text, it is determined that the association between the first candidate text and the second candidate text is related, wherein the second candidate text is any candidate text other than the first candidate text among the plurality of candidate texts;

[0119] The relationship between the question to be answered and itself is determined to be relevant, and the relationship between each candidate text and itself is determined to be relevant; or, the relationship between the question to be answered and itself is determined to be irrelevant, and the relationship between each candidate text and itself is determined to be irrelevant.

[0120] Keywords can be important words in the question to be answered, or important words in the candidate text. Furthermore, the number of keywords in the question to be answered can be one, two, or more. The number of keywords in the candidate text can also be one, two, or more.

[0121] The corresponding keywords can be keywords, or similar words, synonyms, alternative words, etc. For example, if the keyword is tissue paper, the corresponding keywords could be toilet paper, roll paper, or facial tissue. If the keyword is a natural number, the corresponding keywords could be non-negative integers. If the keyword is Li Bai, the corresponding keywords could be the Poet Immortal, Li Bai, or Li Bai (the Qinglian Hermit).

[0122] In other words, before constructing the adjacency matrix, it is necessary to determine the relationships between the question to be answered and multiple candidate texts, as well as the relationships among the candidate texts themselves. Specifically, the keywords of the question to be answered and the keywords of each candidate text can be obtained. If the first candidate text contains a keyword corresponding to a keyword of the question to be answered, it can be considered that the first candidate text may be similar to the central idea expressed by the question to be answered, and thus the relationship between the first candidate text and the question to be answered is determined to be relevant. If the first candidate text contains a keyword corresponding to a keyword of a second candidate text, it can be considered that the first candidate text may be similar to the central idea expressed by the second candidate text, and thus the relationship between the first candidate text and the second candidate text is determined to be relevant. Furthermore, the relationship between the question to be answered and itself can be determined as relevant or irrelevant, and the relationship between each candidate text and itself can be determined as relevant or irrelevant.

[0123] In some embodiments, keywords can be extracted from the question to be answered and the candidate text using an entity extraction algorithm. For example, if the question to be answered is "What is the smallest natural number?", the extracted keywords would be "smallest" and "natural number". If the candidate text is "Natural numbers are integers greater than or equal to 0", the extracted keywords would be "natural number", "greater than or equal to", and "0".

[0124] In some embodiments, if the question to be answered includes a keyword, as long as the first candidate text contains a corresponding keyword, it can be determined that the question to be answered and the first candidate text are related; if the second candidate text includes a keyword, as long as the first candidate text contains a corresponding keyword, it can be determined that the second candidate text and the first candidate text are related.

[0125] As an example, if the question to be answered includes multiple keywords, as long as the first candidate text contains a corresponding keyword of one of the keywords, it can be determined that the question to be answered and the first candidate text are related; if the second candidate text includes multiple keywords, as long as the first candidate text contains a corresponding keyword of one of the keywords, it can be determined that the second candidate text and the first candidate text are related.

[0126] For example, suppose the keywords of the question to be answered include "minimum" and "natural number", and the first candidate text is "natural numbers are non-negative integers", which includes the keyword "natural number", then it can be determined that the first candidate text and the question to be answered are related. Suppose the keywords of the second candidate text are "0" and "natural number", and the first candidate text is "natural numbers are non-negative integers", which includes the keyword "natural number", then it can be determined that the first candidate text and the second candidate text are related.

[0127] As another example, if the question to be answered includes multiple keywords, then the first candidate text needs to contain the corresponding keyword for each of those keywords in order to determine whether the question to be answered is related to the first candidate text; similarly, if the second candidate text includes multiple keywords, then the first candidate text needs to contain the corresponding keyword for each of those keywords in order to determine whether the second candidate text is related to the first candidate text. This improves the accuracy of determining the relationship.

[0128] For example, suppose the keywords of the question to be answered include "smallest" and "natural number," and the first candidate text 1 is "natural numbers are non-negative integers," which only includes the keyword "natural number." Then, it can be determined that the first candidate text 1 and the question to be answered are unrelated. Suppose the first candidate text 2 is "0 is the smallest non-negative integer," which includes the keywords "smallest" and "natural number," with the corresponding keyword "non-negative integer." That is, the first candidate text 2 includes the corresponding keyword for each keyword in the question to be answered. Then, it can be determined that the first candidate text 2 and the question to be answered are related. Suppose the keywords of the second candidate text are "0" and "natural number," and the first candidate text is "natural numbers start from 0," which includes the keywords "natural number" and "0." Then, it can be determined that the first candidate text and the second candidate text are related.

[0129] It should be noted that if the first candidate text contains a keyword corresponding to a keyword in the second candidate text, then the first candidate text may be an explanation of the keyword in the second candidate text. For example, suppose the second candidate text includes keyword B, and keyword B exists in the second candidate text as a hyperlink. Clicking this hyperlink will take you to the first candidate text. In this case, it can be assumed that the first candidate text contains a keyword corresponding to a keyword in the second candidate text.

[0130] In this embodiment of the application, before constructing the adjacency matrix, the association between the question to be answered and the candidate text can be determined based on the keywords, as well as the association between multiple candidate texts. The adjacency matrix is ​​constructed based on the association between the question to be answered and the candidate text, and the association between the candidate texts is also considered, which can further improve the accuracy of text filtering.

[0131] In one possible implementation, the specific implementation of constructing an adjacency matrix based on the association between the question to be answered and the multiple candidate texts may include: using the question to be answered and the multiple candidate texts as nodes, using the nodes as rows and columns, and arranging the row nodes and column nodes in the same order, determining the elements at each position based on the association between the row nodes and column nodes corresponding to each position, thereby obtaining the adjacency matrix.

[0132] In other words, in the constructed adjacency matrix, the elements at each position are determined based on the relationship between the row nodes and column nodes at that position. The row nodes and column nodes are the questions to be answered and multiple candidate texts, and the order of the row nodes is the same as the order of the column nodes.

[0133] As an example, for ease of description, the question to be answered and multiple candidate texts can be referred to as nodes. These nodes can be randomly numbered, and the nodes can be arranged into rows and columns according to their numbers. Then, the element in the i-th row and j-th column of the adjacency matrix is ​​determined based on the association between the node in the i-th row and the node in the j-th row. Here, i and j are both integers greater than 0.

[0134] For example, suppose the question to be answered is numbered 1, candidate text 1 is numbered 2, and candidate text 2 is numbered 3. Then the row nodes of the adjacency matrix are arranged in order from 1 to 3, and the column nodes are also arranged in order from 1 to 3.

[0135] In this embodiment of the application, the relationship between the question to be answered and the candidate text can be represented in the form of an adjacency matrix, which facilitates device processing.

[0136] In one implementation, the specific implementation of determining the element at each position based on the row and column relationships corresponding to each position may include:

[0137] If the row node and column node corresponding to the target position are related, then the element of the target position is determined to be 1, where the target position is any position in the adjacency matrix; if the row node and column node corresponding to the target position are not related, then the element of the target position is determined to be 0.

[0138] As an example, for ease of device identification, relevant elements can be represented by the value 1, and irrelevant elements by the value 0. If the relationship between the i-th row node and the j-th column node is relevant, then the element in the i-th row and j-th column is 1; if the relationship between the i-th row node and the j-th column node is irrelevant, then the element in the i-th row and j-th column is 0.

[0139] For example, suppose there are three candidate texts: the question to be answered is numbered 1, candidate text 1 is numbered 2, candidate text 2 is numbered 3, and candidate text 3 is numbered 4. Furthermore, if the question to be answered is related to candidate text 1, then the elements in the first row, second column and the second row, first column are both 1; if the question to be answered is not related to candidate text 2, then the elements in the first row, third column and the third row, first column are both 0; if the question to be answered is related to candidate text 3, then the elements in the first row, fourth column and the fourth row, first column are both 1; if candidate text 1 is related to candidate text 2, then the elements in the second row, third column and the third row, second column are both 0. All elements in the first row are 1; the relationship between candidate text 1 and candidate text 3 is related, so the elements in the second row, fourth column and the fourth row, second column are both 1; the relationship between candidate text 2 and candidate text 3 is unrelated, so the elements in the third row, fourth column and the fourth row, third column are both 0; and the relationship between the text to be answered and itself is related, and the relationship between each candidate text and itself is related, so the elements in the first row, first column, second row, second column, third row, third column and fourth row, fourth column are all 1. That is, the adjacency matrix can be obtained through the above method.

[0140]

[0141] In another possible implementation, the specific implementation of constructing an adjacency matrix based on the association between the question to be answered and the multiple candidate texts may include: using the question to be answered and the multiple candidate texts as nodes, connecting different nodes that are related to each other to obtain a graph network; and constructing the adjacency matrix based on the graph network.

[0142] In this implementation, if the relationship between different nodes is related, then there can be an edge between the different nodes. With the question to be answered and multiple candidate texts as nodes and the relationship as edges, a graph network can be constructed, and then an adjacency matrix can be constructed based on the graph network.

[0143] For example, assuming the relationship between the question to be answered and candidate text 1 is relevant, then there is an edge between the question to be answered node and the candidate text 1 node; the relationship between the question to be answered and candidate text 2 is irrelevant, and there is no edge between the question to be answered node and the candidate text 2 node; the relationship between the question to be answered and candidate text 3 is relevant, and there is an edge between the question to be answered node and the candidate text 3 node; the relationship between candidate text 1 and candidate text 2 is relevant, and there is an edge between the candidate text 1 node and the candidate text 2 node; the relationship between candidate text 1 and candidate text 3 is relevant, and there is an edge between the candidate text 1 node and the candidate text 3 node; the relationship between candidate text 2 and candidate text 3 is irrelevant, and there is no edge between the candidate text 2 node and the candidate text 3 node. Therefore, we can obtain... Figure 5 The graph network shown.

[0144] In one implementation, the specific implementation of constructing the adjacency matrix based on the graph network may include: using the nodes in the graph network as rows and columns, with the row nodes arranged in the same order as the column nodes, determining the elements at each position based on whether there are edges between the row nodes and column nodes corresponding to each position, thereby obtaining the adjacency matrix.

[0145] In other words, in the constructed adjacency matrix, the element at each position is determined by the relationship between the row node and the column node at that position. The row node and the column node are nodes in the graph network, and the order of the row node is the same as the order of the column node.

[0146] As an example, nodes in a graph network can be randomly numbered. Multiple nodes can be grouped into rows and columns based on their numbers. The element in the i-th row and j-th column of the adjacency matrix is ​​then determined by the association between the nodes in the i-th and j-th rows. Here, i and j are both integers greater than 0.

[0147] For example, suppose the question to be answered is numbered 1, candidate text 1 is numbered 2, and candidate text 2 is numbered 3. Then the row nodes of the adjacency matrix are arranged in order from 1 to 3, and the column nodes are also arranged in order from 1 to 3.

[0148] In this embodiment of the application, the relationship between the question to be answered and the candidate text can be represented in the form of an adjacency matrix, which facilitates device processing.

[0149] In one implementation, the specific implementation of determining the element of each position based on whether there are edges between the row nodes and column nodes corresponding to each position may include: if the row nodes and column nodes corresponding to the target position are not the same nodes and there are edges in the graph network, then the element of the target position is determined to be 1, wherein the target position is any position in the adjacency matrix; if the row nodes and column nodes corresponding to the target position are not the same nodes and there are no edges in the graph network, then the element of the target position is determined to be 0; if the row nodes and column nodes corresponding to the target position are the same nodes, then the element of the target position is determined to be 1 or 0.

[0150] In other words, for ease of device identification, relevance can be represented by the value 1, and irrelevance by the value 0. If the row node and column node corresponding to the target location are not the same node, and there is an edge connecting them in the graph network, then the row node and column node are considered relevant, and the element of the target location can be determined as 1. If the row node and column node corresponding to the target location are not the same node, and there is no edge connecting them in the graph network, then the row node and column node are considered irrelevant, and the element of the target location can be determined as 1. If the row node and column node corresponding to the target location are the same node, then there is no edge in the graph network, but the element of the target location can still be determined as 1 or 0.

[0151] As an example, when i and j are different, if there is an edge between the i-th row node and the j-th column node in the graph network, then the element in the i-th row and j-th column is 1; if there is no edge between the i-th row node and the j-th column node in the graph network, then the element in the i-th row and j-th column is 0. When i and j are the same, the element in the i-th row and j-th column can be determined as either 1 or 0.

[0152] For example, suppose there are three candidate texts: the question to be answered is numbered 1, candidate text 1 is numbered 2, candidate text 2 is numbered 3, and candidate text 3 is numbered 4. Furthermore, if there is an edge between the question to be answered node and the candidate text 1 node, then the elements in the first row, second column and the second row, first column are both 1; if there is no edge between the question to be answered node and the candidate text 2 node, then the elements in the first row, third column and the third row, first column are both 0; if there is an edge between the question to be answered node and the candidate text 3 node, then the elements in the first row, fourth column and the fourth row, first column are both 1; if there is an edge between the candidate text 1 node and the candidate text 2 node, then the elements in the second row, third column and the third row, second column are both 1; if there is an edge between the candidate text 1 node and the candidate text 3 node, then the elements in the second row, fourth column and the fourth row, second column are both 1; if there is no edge between the candidate text 2 node and the candidate text 3 node, then the elements in the third row, fourth column and the fourth row, third column are both 0. Furthermore, determine the element 1 at the diagonal position in the adjacency matrix. That is, the adjacency matrix can be obtained using the above method.

[0153] For example, see Figure 3 The adjacency matrix is ​​constructed based on the question to be answered and the candidate text.

[0154] In this embodiment, the association between multiple candidate texts and the association between the question to be answered and the candidate texts can be determined based on the keywords of the question to be answered and the keywords of the candidate texts. An adjacency matrix can be constructed based on the association, that is, the association can be represented in the form of an adjacency matrix. The relationship between candidate texts can be considered at the same time as the question to be answered, and the extracted association is richer. Using the adjacency matrix as the input of the text filtering network can enable the text filtering network to extract richer associations, thereby improving the accuracy of text filtering.

[0155] Step 206: Input the semantic vector of the question to be answered, the semantic vectors of the multiple candidate texts, and the adjacency matrix into the text filtering network to determine the target text.

[0156] As an example, the target text can be filtered text that is highly relevant to the question, and it must not contain any irrelevant text. Irrelevant text can be text that is unrelated to the question.

[0157] As an example, the text filtering network can be a graph neural network. Exemplarily, the text filtering network can be a graph convolutional neural network, a graph autoencoder neural network, etc., and this application embodiment does not limit it in this way.

[0158] In one implementation, the specific implementation of inputting the semantic vector of the question to be answered, the semantic vectors of the plurality of candidate texts, and the adjacency matrix into a text filtering network to determine the target text may include: inputting the adjacency matrix, the semantic vector of the question to be answered, and the semantic vectors of the plurality of candidate texts into the text filtering network to obtain a relevance score of each candidate text relative to the question to be answered; and determining the candidate text with a relevance score greater than a first threshold as the target text.

[0159] The first threshold can be set by the user according to actual needs, or it can be set by the device default; this application embodiment does not limit this. For example, the first threshold can be 0.8.

[0160] The relevance score indicates the relevance between the candidate text and the question to be answered. A higher relevance score indicates a higher relevance between the candidate text and the question to be answered, while a lower relevance score indicates a lower relevance between the candidate text and the question to be answered.

[0161] As an example, the adjacency matrix, the semantic vector of the question to be answered, and the semantic vectors of multiple candidate texts can be input into a text filtering network. The text filtering network learns the relationship between the question to be answered and the multiple candidate texts, and updates the semantic vectors of the multiple candidate texts based on the relationship, the semantic vector of the question to be answered, and the semantic vectors of the multiple candidate texts. The updated semantic vectors are then converted into a relevance score for each candidate text relative to the question to be answered. If the relevance score of a candidate text is greater than a first threshold, it can be considered that the candidate text is highly relevant to the question to be answered, and the candidate text can be identified as the target text.

[0162] In this embodiment, a text filtering network can be used to determine the relevance score of each candidate text to the question to be answered, and then the target text can be determined from multiple candidate texts based on the relevance score. This allows for the selection of target texts that are more relevant to the question to be answered from multiple candidate texts, and enables rapid reordering and filtering of a large number of candidate texts.

[0163] In one implementation, the specific implementation of inputting the adjacency matrix, the semantic vector of the question to be answered, and the semantic vectors of the plurality of candidate texts into a text filtering network to obtain the relevance score of each candidate text relative to the question to be answered may include:

[0164] The semantic vector of the question to be answered and the semantic vector of the multiple candidate texts are concatenated to obtain the concatenated semantic vector;

[0165] The concatenated semantic vector and the adjacency matrix are input into the hidden layer of the text filtering network to obtain a hidden layer feature vector group. The hidden layer feature vector group includes a hidden layer feature vector obtained by combining the semantic vectors of the question to be answered with the semantic vectors of the multiple candidate texts, and a hidden layer feature vector obtained by combining the semantic vectors of each candidate text with the semantic vectors of other candidate texts and the question to be answered.

[0166] The hidden layer feature vector group is input into a fully connected layer to obtain the relevance score of each candidate text relative to the question to be answered.

[0167] In some embodiments, the semantic vector of the question to be answered and the semantic vectors of multiple candidate texts can be concatenated to obtain a concatenated semantic vector. This concatenated semantic vector and the adjacency matrix are then input into the hidden layer of the text filtering network. Multiple convolution operations can be performed in the hidden layer to combine the semantic vector of the question to be answered and the semantic vectors of the candidate texts in the concatenated semantic vector to obtain a hidden layer feature vector group. This hidden layer feature vector group is then input into a fully connected layer to obtain the relevance score of each candidate text relative to the question to be answered.

[0168] As an example, suppose there are 9 candidate texts, the semantic vector of the question to be answered is a 300-dimensional vector, and the semantic vector of each candidate text is also a 300-dimensional vector. Then, concatenating the semantic vector of the question to be answered and the semantic vectors of multiple candidate texts will result in a concatenated semantic vector of 10×300, where each row represents a semantic vector. After inputting this concatenated semantic vector into the hidden layer, it can be multiplied by its transpose, i.e., the dot product of the 10×300 matrix and the 300×10 matrix, resulting in a 10×10 first matrix. The element in the i-th row and j-th column of this first matrix is ​​the combined value of the semantic vector of the node in the i-th row and the first dimension value of the semantic vectors of all nodes.

[0169] As an example, the adjacency matrix is ​​also a 10×10 matrix. Then, this first matrix is ​​merged with the adjacency matrix by multiplying the element in the i-th row and j-th column of the first matrix with the element in the i-th row and j-th column of the adjacency matrix. This results in a second 10×10 matrix where elements at corresponding positions between unrelated row and column nodes are set to 0. The second matrix is ​​then normalized row-wise to ensure that the elements in each row are of the same magnitude, thus obtaining the weight of each node.

[0170] As an example, multiplying the second matrix by the concatenated feature vector, that is, multiplying the 10×10 matrix by the 10×300 matrix, yields a 10×300 third matrix. In this third matrix, the i-th row represents the hidden feature vector after the i-th row node combines the semantic vectors of other nodes, and the j-th element in the i-th row identifies the value of the hidden feature vector in the j-th dimension after the i-th row node combines the semantic vectors of other nodes.

[0171] As an example, the third matrix can also be called the hidden layer feature vector set. This hidden layer feature vector set is input into a fully connected layer. The fully connected layer contains a pre-defined transformation matrix, which can be a 300×1 matrix. Multiplying the third matrix by the transformation matrix yields a 10×1 target matrix. In this target matrix, each row's elements represent the relevance score of the node in that row. Since each row node represents the question to be answered and multiple candidate texts, the relevance score of each candidate text relative to the question to be answered can be obtained.

[0172] In some embodiments, after determining the relevance score of each candidate text relative to the question to be answered, candidate texts with relevance scores greater than a first threshold can be marked as relevant, that is, these candidate texts are determined to be relevant to the question to be answered.

[0173] In this embodiment, the candidate text is filtered by a text filtering network to obtain the target text. This not only considers the relationship between the question to be answered and the candidate text, but also the relationship between the candidate texts. The extracted relationships are richer, and the relationships are represented by an adjacency matrix. Combined with semantic vectors, the text filtering network can learn richer relationships, thereby improving the accuracy of text filtering.

[0174] For example, see Figure 3 The semantic vector of the question to be answered and the semantic vectors of 1000 candidate texts are concatenated to obtain a concatenated semantic vector. The concatenated semantic vector and the adjacency matrix are then input into the text filtering network, which can output the relevance score of each candidate text relative to the question to be answered, thereby determining 10 target texts.

[0175] In one implementation, after determining the candidate texts with relevance scores greater than a preset threshold as the target texts, the method further includes: if there are multiple target texts, sorting the target texts in descending order of relevance scores, and outputting the sorted target texts in order.

[0176] The preset threshold can be set by the user according to actual needs, or it can be set by the device default; this application embodiment does not limit this. For example, the preset threshold can be 0.85.

[0177] In practice, when there are multiple target texts, they can be sorted in descending order of relevance score, and the sorted target texts can be output in order for the user to view. When there is only one target text, that target text can be output for the user to view.

[0178] In this embodiment, unlike existing technologies that require re-extracting features from the question and candidate texts after processing through a semantic retrieval model, and then re-ranking multiple candidate texts based on the re-extracted semantic vectors, the semantic vectors of the question and candidate texts obtained from the semantic retrieval model are input into a text filtering network. This reduces the process of re-obtaining the semantic vectors of the question and texts as in existing technologies, allowing for rapid re-ranking of candidate texts to obtain the target text. Furthermore, the text filtering network can constrain the recall results of the semantic retrieval model, preventing irrelevant texts from being recalled.

[0179] Furthermore, the above method can be used to filter text and obtain target text relevant to the question to be answered. Then, the target answer can be obtained based on the question to be answered and the target text. As an example, the question to be answered and the target text sorted according to relevance scores can be input into a reading comprehension model, which can then output the target answer to the question to be answered.

[0180] Furthermore, the training method for the text filtering network is as follows:

[0181] Obtain a sample question, multiple sample texts, and a sample tag for each sample text, wherein the sample tag for each sample text is used to characterize the relevance of the sample text to the sample question;

[0182] Determine the semantic vector of the sample question and the semantic vector of each sample text, and construct an adjacency matrix based on the sample question and multiple sample texts;

[0183] The semantic vector of the sample question, the semantic vector of each sample text, and the adjacency matrix are input into the text filtering network. The text filtering network is processed through its hidden layer to obtain a hidden layer feature vector group. The hidden layer feature vector group includes the hidden layer feature vector obtained by combining the semantic vector of the sample question with the semantic vector of the multiple sample texts, and the hidden layer feature vector obtained by combining the semantic vector of each sample text with the semantic vector of other sample texts and the sample question.

[0184] The hidden layer feature vector group is input into a fully connected layer to obtain the relevance score of each sample text to the sample question.

[0185] The predicted label for each sample text is determined based on its relevance score to the sample question.

[0186] The text filtering network is trained based on the predicted label and the loss value of the sample text for each sample until the training stops.

[0187] The sample labels can include relevant and irrelevant.

[0188] In some embodiments, sample questions and multiple sample texts can be obtained from a sample library, and each sample text in the sample library has a corresponding sample tag. The sample tag for each sample text can also be obtained.

[0189] In the specific implementation, we can first obtain the sample question, multiple sample texts, and sample labels for each sample text. We then perform feature extraction on the sample question and sample texts to determine the semantic vectors of the sample question and each sample text. Based on the correlation between the sample texts and the sample question, as well as the correlation between the sample texts, we construct an adjacency matrix. We input the semantic vectors of the sample question, each sample text, and the adjacency matrix into the hidden layer of the text filtering network. In the hidden layer, we can perform multiple convolution operations to combine the semantic vectors of the sample question and the sample texts in the concatenated semantic vectors to obtain a hidden layer feature vector group. We input this hidden layer feature vector group into a fully connected layer to obtain the relevance score of each sample text relative to the sample question. The predicted labels of sample texts with relevance scores greater than a first threshold are determined as relevant, and the predicted labels of sample texts with relevance scores less than or equal to the first threshold are determined as irrelevant. This determines the predicted label of each sample text. Then, we determine the loss value based on the predicted label and the sample label of each sample text. We train the text filtering network based on this loss value until the training stopping condition is met.

[0190] It should be noted that the specific implementation of determining the semantic vector of the sample question and the semantic vector of each sample text is the same as the specific implementation of determining the semantic vector of the question to be answered and the semantic vector of each candidate text in step 202. The implementation process can be found in the relevant description of step 202, and will not be repeated here in this embodiment. The specific implementation of constructing an adjacency matrix based on the sample question and multiple sample texts is the same as the specific implementation of constructing an adjacency matrix based on the association relationship between the question to be answered and multiple candidate texts. The implementation process can be found in the relevant description of step 204. The specific implementation of inputting the semantic vector of the sample question, the semantic vector of each sample text, and the adjacency matrix into the text filtering network until the predicted label of each sample text is determined is the same as the specific implementation of determining the target text in this step. The implementation process can be found in the relevant description of this step, and will not be repeated here in this embodiment.

[0191] In one possible implementation, training the text filtering network based on the predicted label and the loss value of the sample text until the training stopping condition is met may specifically include: stopping the training of the text filtering network if the loss value is less than or equal to a third threshold; and continuing the training of the text filtering network if the loss value is greater than the third threshold.

[0192] It should be noted that the third threshold can be set by the user according to actual needs, or it can be set by default by the computing device. This application embodiment does not limit this.

[0193] In other words, if the loss value is greater than the third threshold, it indicates that the difference between the predicted label and the first label group is relatively large, and the performance of the text classification model is not good enough. Therefore, it is necessary to continue training the text classification model. If the loss value is less than or equal to the third threshold, it indicates that the difference between the predicted label and the first label group is relatively small, and the performance of the text classification model is already relatively good. It can be considered that the training of the text classification model has ended, and therefore, the training of the text classification model can be stopped.

[0194] As an example, the loss value can be determined based on the predicted label and the sample label of each sample text. For multiple sample texts, multiple loss values ​​can be obtained. These multiple loss values ​​can be weighted and summed to obtain the loss value corresponding to this training. The parameters of the text filtering network can be adjusted based on this loss value to achieve the training of the text filtering network.

[0195] The embodiments in this specification determine the specific training status of the text filtering network based on the loss value, and adjust the parameters of the text filtering network in reverse according to the loss value if the training is not qualified, so as to improve the text filtering ability of the text filtering network. The training speed is high and the training effect is good.

[0196] In another possible implementation, training the text filtering network based on the predicted label and the loss value of the sample text until the training stopping condition is met may include: for each training iteration of the text filtering network based on the predicted label and the loss value of the sample text, incrementing the training iteration count by one; if the training iteration count is less than or equal to a fourth threshold, continuing to train the text filtering network; if the training iteration count is greater than the fourth threshold, stopping the training of the text filtering network.

[0197] It should be noted that the fourth threshold can be set by the user according to actual needs, or it can be set by default by the computing device. This application embodiment does not limit this.

[0198] In other words, each training iteration of the text filtering network based on the predicted label and the loss value of each sample text can be considered as one iteration of training. Based on the predicted label and sample label obtained from this iteration, the model is iterated and trained again, and the number of iterations is recorded. If the number of iterations is less than or equal to the fourth threshold, it means that the model has not been trained enough and needs to be trained again. If the number of iterations is greater than the fourth threshold, it means that enough training has been done and the model's performance has basically stabilized, and training can be stopped.

[0199] As an example, the loss value can be determined based on the predicted label and the sample label of each sample text. For multiple sample texts, multiple loss values ​​can be obtained. These multiple loss values ​​can be weighted and summed to obtain the loss value corresponding to this training. The parameters of the text filtering network can be adjusted based on this loss value to achieve the training of the text filtering network.

[0200] It should be noted that the preset number of times can be set by the user according to actual needs, or it can be set by default by the computing device. This application embodiment does not limit this.

[0201] In the embodiments of this specification, determining whether the text filtering network training is complete based on the number of iterations can reduce unnecessary iterations and improve the efficiency of text filtering network training.

[0202] In this embodiment, based on the acquired question to be answered, the semantic vector of the question to be answered, multiple candidate texts, and the semantic vectors of the multiple candidate texts are determined. Each candidate text is a text in a text library that is semantically related to the question to be answered. An adjacency matrix is ​​constructed based on the association between the question to be answered and the multiple candidate texts. The adjacency matrix is ​​used to characterize the relevance between the question to be answered and the multiple candidate texts, as well as the relevance among the multiple candidate texts. The semantic vector of the question to be answered, the semantic vectors of the multiple candidate texts, and the adjacency matrix are input into a text filtering network to determine the target text. After determining multiple candidate texts, the above method can further filter the candidate texts through a text filtering network, deleting candidate texts that are irrelevant to the question to be answered, and obtaining target texts that are highly relevant to the question to be answered. This reduces the recall of irrelevant texts, improves the retrieval recall rate, and because the target texts are more relevant to the question to be answered, the accuracy of the answers determined based on these target texts is higher, thus improving the performance of the question-answering system.

[0203] Figure 6 A flowchart of another text processing method provided in an embodiment of this application is shown. The text processing method is described with the question to be answered being "What is the smallest natural number?" as an example, and includes steps 602 to 628.

[0204] Step 602: Obtain the questions to be answered.

[0205] In this embodiment, the question to be answered is "What is the smallest natural number?".

[0206] Step 604: Extract features from the question to be answered to determine the semantic vector of the question to be answered.

[0207] Continuing with the example above, segmenting the question to be answered yields multiple first-word units [minimum, of, natural number, is, several]. Each first-word unit can be embedded using word2vec encoding, mapping it to a low-dimensional vector space to obtain its word vector. These word vectors are then input into an encoding layer for further encoding, resulting in a vector representation of each first-word unit combined with the word vectors of other first-word units in the question—the first feature vector of each unit. Concatenating these first feature vectors from multiple first-word units yields the semantic vector of the question.

[0208] Step 606: Obtain the semantic vectors of multiple texts in the text library.

[0209] For example, assuming the text is "0 is the smallest natural number", word segmentation of this text yields multiple second word units [0, is, smallest, of, natural number]. Each second word unit can be embedded using word2vec encoding, mapping it to a low-dimensional vector space to obtain its word vector. These word vectors are then input into an encoding layer for encoding, resulting in a vector representation of each second word unit combined with the word vectors of other second word units in the text—the second feature vector of each second word unit. Concatenating the second feature vectors of multiple second word units yields the semantic vector of the text.

[0210] Step 608: Based on the semantic vector of the question to be answered and the semantic vectors of the multiple texts, determine the similarity score of each text relative to the question to be answered.

[0211] Step 610: Select multiple texts with similarity scores greater than the second threshold as the multiple candidate texts, and obtain the semantic vectors of the multiple candidate texts.

[0212] It should be noted that steps 602-610 above are a further description of step 202, and their implementation process is the same as that of step 202. For specific implementation details, please refer to the relevant description of step 202. This embodiment will not repeat them here. In addition, in this embodiment, only semantic retrieval is used as an example to illustrate the process of determining candidate texts from the text library. In actual implementation, candidate texts can also be determined by BM25 or other retrieval algorithms, and this application does not limit this.

[0213] Step 612: Obtain the keywords of the question to be answered and the keywords of each candidate text.

[0214] Continuing with the example above, if the question to be answered is "What is the smallest natural number?", the keywords extracted would be "smallest" and "natural number". If the candidate text is "0 is the smallest natural number", the keywords extracted would be "natural number", "0", and "smallest".

[0215] Step 614: If the first candidate text contains a keyword corresponding to the keyword of the question to be answered, determine that the first candidate text and the question to be answered are related, wherein the first candidate text is any one of the plurality of candidate texts.

[0216] For example, if the keywords of the question to be answered include "minimum" and "natural number", and the first candidate text is "natural number is a non-negative integer", which includes the keyword "natural number", then it can be determined that the first candidate text and the question to be answered are related.

[0217] Step 616: If the first candidate text contains a keyword corresponding to the keyword of the second candidate text, determine that the association between the first candidate text and the second candidate text is related, wherein the second candidate text is any candidate text other than the first candidate text among the plurality of candidate texts.

[0218] For example, if the keywords of the second candidate text are "0" and "natural number", and the first candidate text is "natural number is a non-negative integer", which includes the keyword "natural number", then it can be determined that the first candidate text and the second candidate text are related.

[0219] Step 618: Determine that the relationship between the question to be answered and itself is relevant, and determine that the relationship between each candidate text and itself is relevant.

[0220] Step 620: Using the question to be answered and the multiple candidate texts as nodes, and the nodes as rows and columns, with the row nodes and column nodes arranged in the same order, determine the elements of each position based on the association relationship between the row nodes and column nodes corresponding to each position, and obtain the adjacency matrix.

[0221] For example, suppose there are three candidate texts: the question to be answered is numbered 1, candidate text 1 is numbered 2, candidate text 2 is numbered 3, and candidate text 3 is numbered 4. Furthermore, if the question to be answered is related to candidate text 1, then the elements in the first row, second column and the second row, first column are both 1; if the question to be answered is not related to candidate text 2, then the elements in the first row, third column and the third row, first column are both 0; if the question to be answered is related to candidate text 3, then the elements in the first row, fourth column and the fourth row, first column are both 1; if candidate text 1 is related to candidate text 2, then the elements in the second row, third column and the third row, second column are both 0. All elements in the first row are 1; the relationship between candidate text 1 and candidate text 3 is related, so the elements in the second row, fourth column and the fourth row, second column are both 1; the relationship between candidate text 2 and candidate text 3 is unrelated, so the elements in the third row, fourth column and the fourth row, third column are both 0; and the relationship between the text to be answered and itself is related, and the relationship between each candidate text and itself is related, so the elements in the first row, first column, second row, second column, third row, third column and fourth row, fourth column are all 1. That is, the adjacency matrix can be obtained through the above method.

[0222] It should be noted that steps 612-620 above are a sub-description of step 204, and their implementation process is the same as that of step 204. For specific implementation, please refer to the relevant description of step 204. This embodiment will not repeat it here.

[0223] Step 622: Concatenate the semantic vector of the question to be answered and the semantic vectors of the multiple candidate texts to obtain a concatenated semantic vector.

[0224] Step 624: Input the concatenated semantic vector and the adjacency matrix into the hidden layer of the text filtering network to obtain a hidden layer feature vector group, wherein the hidden layer feature vector group includes the hidden layer feature vector obtained by combining the semantic vectors of the question to be answered with the semantic vectors of the multiple candidate texts, and the hidden layer feature vector obtained by combining the semantic vectors of each candidate text with the semantic vectors of other candidate texts and the question to be answered.

[0225] Step 626: Input the hidden layer feature vector group into the fully connected layer to obtain the relevance score of each candidate text relative to the question to be answered.

[0226] Step 628: Identify candidate texts with relevance scores greater than the first threshold as the target text.

[0227] For example, if the question to be answered is "What is the smallest natural number?", and the candidate texts include candidate text 1 "Natural numbers are non-negative integers", candidate text 2 "0 is the smallest natural number", and candidate text 3 "Natural numbers are integers greater than or equal to 0", assuming that the relevance score of candidate text 1 to the question to be answered is 0.6, the relevance score of candidate text 2 to the question to be answered is 0.9, and the relevance score of candidate text 3 to the question to be answered is 0.85, and the first threshold is 0.8, then candidate text 2 and candidate text 3 can be determined as the target texts.

[0228] It should be noted that steps 622-628 above are a sub-description of step 206, and their implementation process is the same as that of step 206. For specific implementation, please refer to the relevant description of step 206. This embodiment will not repeat it here.

[0229] The text processing method provided in this application, after determining multiple candidate texts, can further filter the candidate texts through a text filtering network, delete candidate texts that are not related to the question to be answered, and obtain target texts that are highly relevant to the question to be answered. This reduces the recall of irrelevant texts, improves the retrieval recall rate, and since the target texts are more relevant to the question to be answered, the accuracy of the answers determined based on these target texts is higher, thus improving the performance of the question answering system.

[0230] Corresponding to the above method embodiments, this application also provides text processing apparatus embodiments. Figure 7 A schematic diagram of the structure of a text processing apparatus according to an embodiment of this application is shown. Figure 7 As shown, the device 700 includes:

[0231] The first determining module 702 is configured to determine the semantic vector of the question to be answered, multiple candidate texts, and the semantic vector of the multiple candidate texts based on the acquired question to be answered, wherein each candidate text is a text in a text library that is semantically related to the question to be answered;

[0232] The construction module 704 is configured to construct an adjacency matrix based on the association between the question to be answered and the plurality of candidate texts, wherein the adjacency matrix is ​​used to characterize the relevance between the question to be answered and the plurality of candidate texts, as well as the relevance among the plurality of candidate texts;

[0233] The second determining module 706 is configured to input the semantic vector of the question to be answered, the semantic vectors of the plurality of candidate texts, and the adjacency matrix into a text filtering network to determine the target text.

[0234] Optionally, the building module 704 is also configured as follows:

[0235] Obtain the keywords of the question to be answered and the keywords of each candidate text;

[0236] If the first candidate text contains a keyword corresponding to the keyword of the question to be answered, it is determined that the first candidate text and the question to be answered are related, wherein the first candidate text is any one of the plurality of candidate texts;

[0237] If the first candidate text contains a keyword corresponding to the keyword of the second candidate text, it is determined that the association between the first candidate text and the second candidate text is related, wherein the second candidate text is any candidate text other than the first candidate text among the plurality of candidate texts;

[0238] The relationship between the question to be answered and itself is determined to be relevant, and the relationship between each candidate text and itself is determined to be relevant; or, the relationship between the question to be answered and itself is determined to be irrelevant, and the relationship between each candidate text and itself is determined to be irrelevant.

[0239] Optionally, module 704 is configured as follows:

[0240] Using the question to be answered and the multiple candidate texts as nodes, and the nodes as rows and columns, with the row nodes and column nodes arranged in the same order, the elements at each position are determined based on the association relationship between the row nodes and column nodes corresponding to each position, thus obtaining the adjacency matrix.

[0241] Optionally, module 704 is configured as follows:

[0242] If the row node and column node corresponding to the target position are related, then the element of the target position is determined to be 1, wherein the target position is any position in the adjacency matrix;

[0243] If the row node and column node corresponding to the target position are unrelated, then the element at the target position is determined to be 0.

[0244] Optionally, module 704 is configured as follows:

[0245] Using the question to be answered and the multiple candidate texts as nodes, different nodes with related relationships are connected to obtain a graph network;

[0246] The adjacency matrix is ​​constructed based on the graph network.

[0247] Optionally, module 704 is configured as follows:

[0248] Using the nodes in the graph network as rows and columns, with the row nodes arranged in the same order as the column nodes, the elements at each position are determined based on whether there are edges between the row and column nodes corresponding to each position, thus obtaining the adjacency matrix.

[0249] Optionally, module 704 is configured as follows:

[0250] If the row node and column node corresponding to the target position are not the same node and there is an edge in the graph network, then the element of the target position is determined to be 1, wherein the target position is any position in the adjacency matrix;

[0251] If the row node and column node corresponding to the target position are not the same node and there is no edge in the graph network, then the element of the target position is determined to be 0;

[0252] If the row node and column node corresponding to the target position are the same node, then the element at the target position is determined to be 1 or 0.

[0253] Optionally, the second determining module 706 is configured as follows:

[0254] The adjacency matrix, the semantic vector of the question to be answered, and the semantic vectors of the multiple candidate texts are input into the text filtering network to obtain the relevance score of each candidate text relative to the question to be answered.

[0255] Candidate texts with relevance scores greater than the first threshold are identified as the target texts.

[0256] Optionally, the second determining module 706 is configured as follows:

[0257] The semantic vector of the question to be answered and the semantic vector of the multiple candidate texts are concatenated to obtain the concatenated semantic vector;

[0258] The concatenated semantic vector and the adjacency matrix are input into the hidden layer of the text filtering network to obtain a hidden layer feature vector group. The hidden layer feature vector group includes a hidden layer feature vector obtained by combining the semantic vectors of the question to be answered with the semantic vectors of the multiple candidate texts, and a hidden layer feature vector obtained by combining the semantic vectors of each candidate text with the semantic vectors of other candidate texts and the question to be answered.

[0259] The hidden layer feature vector group is input into a fully connected layer to obtain the relevance score of each candidate text relative to the question to be answered.

[0260] Optionally, the second determining module 706 is also configured to:

[0261] If there are multiple target texts, sort the target texts in descending order of relevance score, and output the sorted target texts in order.

[0262] Optionally, the first determining module 702 is configured as follows:

[0263] Feature extraction is performed on the question to be answered to determine the semantic vector of the question to be answered;

[0264] Obtain the semantic vectors of multiple texts in the text library;

[0265] Based on the semantic vector of the question to be answered and the semantic vectors of the multiple texts, a similarity score for each text relative to the question to be answered is determined;

[0266] The multiple candidate texts are determined based on the similarity score of each text relative to the question to be answered, and the semantic vectors of the multiple candidate texts are obtained.

[0267] Optionally, the first determining module 702 is configured as follows:

[0268] Multiple texts with similarity scores greater than a second threshold are selected as candidate texts.

[0269] Optionally, the device further includes a training module, which is configured to:

[0270] Obtain a sample question, multiple sample texts, and a sample tag for each sample text, wherein the sample tag for each sample text is used to characterize the relevance of the sample text to the sample question;

[0271] Determine the semantic vector of the sample question and the semantic vector of each sample text, and construct an adjacency matrix based on the sample question and multiple sample texts;

[0272] The semantic vector of the sample question, the semantic vector of each sample text, and the adjacency matrix are input into the text filtering network. The text filtering network is processed through its hidden layer to obtain a hidden layer feature vector group. The hidden layer feature vector group includes the hidden layer feature vector obtained by combining the semantic vector of the sample question with the semantic vector of the multiple sample texts, and the hidden layer feature vector obtained by combining the semantic vector of each sample text with the semantic vector of other sample texts and the sample question.

[0273] The hidden layer feature vector group is input into a fully connected layer to obtain the relevance score of each sample text to the sample question.

[0274] The predicted label for each sample text is determined based on its relevance score to the sample question.

[0275] The text filtering network is trained based on the predicted label and the loss value of the sample text for each sample until the training stops.

[0276] Optionally, the training module is configured as follows:

[0277] If the loss value is less than or equal to the third threshold, training of the text filtering network is stopped;

[0278] If the loss value is greater than the third threshold, the text filtering network continues to be trained.

[0279] Optionally, the training module is configured as follows:

[0280] The text filtering network is trained once for each sample text based on the predicted label and the loss value of the sample label, and the number of iterations of training is incremented by one.

[0281] If the number of iterations is less than or equal to the fourth threshold, continue training the text filtering network;

[0282] If the number of iterations exceeds the fourth threshold, training of the text filtering network is stopped.

[0283] In this embodiment, based on the acquired question to be answered, the semantic vector of the question to be answered, multiple candidate texts, and the semantic vectors of the multiple candidate texts are determined. Each candidate text is a text in a text library that is semantically related to the question to be answered. An adjacency matrix is ​​constructed based on the association between the question to be answered and the multiple candidate texts. The adjacency matrix is ​​used to characterize the relevance between the question to be answered and the multiple candidate texts, as well as the relevance among the multiple candidate texts. The semantic vector of the question to be answered, the semantic vectors of the multiple candidate texts, and the adjacency matrix are input into a text filtering network to determine the target text. After determining multiple candidate texts, the above method can further filter the candidate texts through a text filtering network, deleting candidate texts that are irrelevant to the question to be answered, and obtaining target texts that are highly relevant to the question to be answered. This reduces the recall of irrelevant texts, improves the retrieval recall rate, and because the target texts are more relevant to the question to be answered, the accuracy of the answers determined based on these target texts is higher, thus improving the performance of the question-answering system.

[0284] The above is an illustrative scheme of a text processing device according to this embodiment. It should be noted that the technical solution of this text processing device and the technical solution of the above-described text processing method belong to the same concept. For details not described in detail in the technical solution of the text processing device, please refer to the description of the technical solution of the above-described text processing method.

[0285] It should be noted that each component in the device claim should be understood as a functional module necessary to implement each step of the program flow or method, and the functional modules are not actual functional divisions or separations. A device claim defined by such a set of functional modules should be understood as a functional module architecture that implements the solution primarily through the computer program described in the specification, and not as a physical device that implements the solution primarily through hardware.

[0286] In one embodiment of this application, a computing device is also provided, including a memory, a processor, and computer instructions stored in the memory and executable on the processor, wherein the processor executes the instructions to implement the steps of the text processing method described above.

[0287] The above is an illustrative scheme of a computing device according to this embodiment. It should be noted that the technical solution of this computing device and the technical solution of the above-described text processing method belong to the same concept. For details not described in detail in the technical solution of the computing device, please refer to the description of the technical solution of the above-described text processing method.

[0288] An embodiment of this application also provides a computer-readable storage medium storing computer instructions that, when executed by a processor, implement the steps of the text processing method as described above.

[0289] The above is an illustrative scheme of a computer-readable storage medium according to this embodiment. It should be noted that the technical solution of this storage medium and the technical solution of the above-described text processing method belong to the same concept, and all details not described in detail in the technical solution of the storage medium can be found in the description of the technical solution of the above-described text processing method.

[0290] This application discloses a chip that stores computer instructions, which, when executed by a processor, implement the steps of the text processing method described above.

[0291] The foregoing has described specific embodiments of this application. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired results. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

[0292] The computer instructions include computer program code, which may be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory (RAM), electrical carrier signals, telecommunication signals, and software distribution media, etc. It should be noted that the content included in the computer-readable medium may be appropriately added to or subtracted according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media may not include electrical carrier signals and telecommunication signals.

[0293] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this application. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to this application.

[0294] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0295] The preferred embodiments disclosed above are merely illustrative of this application. The optional embodiments do not exhaustively describe all details, nor do they limit the invention to the specific implementations described. Clearly, many modifications and variations can be made based on the content of this application. These embodiments are selected and specifically described in this application to better explain the principles and practical applications of this application, thereby enabling those skilled in the art to better understand and utilize this application. This application is limited only by the claims and their full scope and equivalents.

Claims

1. A candidate text determination method, characterized by, The method includes: Based on the obtained question to be answered, determine the semantic vector of the question to be answered, and obtain the semantic vectors of multiple texts in the text library; Based on the similarity between the semantic vector of the question to be answered and the semantic vectors of the multiple texts, a first candidate text that is semantically related to the question to be answered is determined from the text library; The question to be answered is segmented into words to obtain multiple first word units of the question to be answered; Based on the weight value of each first word unit and the relevance value between each first word unit and each text in the text library, the similarity score of each text relative to the question to be answered is determined, and the texts with similarity scores greater than the second threshold are determined as the second candidate texts; Based on the first candidate text and the second candidate text, candidate texts are determined, wherein the candidate texts are the intersection or union of the first candidate text and the second candidate text.

2. The candidate text determination method of claim 1, wherein, The number of candidate texts is multiple; after determining the candidate texts based on the first candidate text and the second candidate text, the method further includes: Determine the semantic vectors of multiple candidate texts; An adjacency matrix is ​​constructed based on the association between the question to be answered and the multiple candidate texts, wherein the adjacency matrix is ​​used to characterize the relevance between the question to be answered and the multiple candidate texts, as well as the relevance among the multiple candidate texts; The semantic vector of the question to be answered, the semantic vectors of the multiple candidate texts, and the adjacency matrix are input into the text filtering network to determine the target text.

3. The candidate text determination method of claim 2, wherein, Before constructing the adjacency matrix based on the association between the question to be answered and the multiple candidate texts, the method further includes: Obtain the keywords of the question to be answered and the keywords of each candidate text; If the first candidate text contains a keyword corresponding to the keyword of the question to be answered, it is determined that the first candidate text and the question to be answered are related, wherein the first candidate text is any one of the plurality of candidate texts; If the first candidate text contains a keyword corresponding to the keyword of the second candidate text, it is determined that the association between the first candidate text and the second candidate text is related, wherein the second candidate text is any candidate text other than the first candidate text among the plurality of candidate texts; The relationship between the question to be answered and itself is determined to be relevant, and the relationship between each candidate text and itself is determined to be relevant; or, the relationship between the question to be answered and itself is determined to be irrelevant, and the relationship between each candidate text and itself is determined to be irrelevant.

4. The candidate text determination method as described in claim 3, characterized in that, Constructing an adjacency matrix based on the association between the question to be answered and the multiple candidate texts includes: Using the question to be answered and the multiple candidate texts as nodes, and the nodes as rows and columns, with the row nodes and column nodes arranged in the same order, the elements at each position are determined based on the association relationship between the row nodes and column nodes corresponding to each position, thus obtaining the adjacency matrix.

5. The candidate text determination method of claim 4, wherein, The element at each position is determined based on the association between the row node and column node corresponding to each position, including: If the row node and column node corresponding to the target position are related, then the element of the target position is determined to be 1, wherein the target position is any position in the adjacency matrix; If the row node and column node corresponding to the target position are unrelated, then the element at the target position is determined to be 0.

6. The candidate text determination method of claim 2, wherein, An adjacency matrix is ​​constructed based on the association between the question to be answered and the multiple candidate texts, including: Using the question to be answered and the multiple candidate texts as nodes, different nodes with related relationships are connected to obtain a graph network; Using the nodes in the graph network as rows and columns, with the row nodes arranged in the same order as the column nodes, the elements at each position are determined based on whether there are edges between the row and column nodes corresponding to each position, thus obtaining the adjacency matrix.

7. The candidate text determination method of claim 6, wherein, The element at each position is determined based on whether there are edges between the row and column nodes corresponding to each position, including: If the row node and column node corresponding to the target position are not the same node and there is an edge in the graph network, then the element of the target position is determined to be 1, wherein the target position is any position in the adjacency matrix; If the row node and column node corresponding to the target position are not the same node and there is no edge in the graph network, then the element of the target position is determined to be 0; If the row node and column node corresponding to the target position are the same node, then the element at the target position is determined to be 1 or 0.

8. The candidate text determination method of claim 2, wherein, The semantic vector of the question to be answered, the semantic vectors of the multiple candidate texts, and the adjacency matrix are input into a text filtering network to determine the target text, including: The adjacency matrix, the semantic vector of the question to be answered, and the semantic vectors of the multiple candidate texts are input into the text filtering network to obtain the relevance score of each candidate text relative to the question to be answered. Candidate texts with relevance scores greater than a first threshold are identified as the target texts. If there are multiple target texts, sort the target texts in descending order of relevance score, and output the sorted target texts in order.

9. The candidate text determination method as described in claim 8, characterized in that, The adjacency matrix, the semantic vector of the question to be answered, and the semantic vectors of the multiple candidate texts are input into a text filtering network to obtain a relevance score for each candidate text relative to the question to be answered, including: The semantic vector of the question to be answered and the semantic vector of the multiple candidate texts are concatenated to obtain the concatenated semantic vector; The concatenated semantic vector and the adjacency matrix are input into the hidden layer of the text filtering network to obtain a hidden layer feature vector group. The hidden layer feature vector group includes a hidden layer feature vector obtained by combining the semantic vectors of the question to be answered with the semantic vectors of the multiple candidate texts, and a hidden layer feature vector obtained by combining the semantic vectors of each candidate text with the semantic vectors of other candidate texts and the question to be answered. The hidden layer feature vector group is input into a fully connected layer to obtain the relevance score of each candidate text relative to the question to be answered.

10. The candidate text determination method of claim 2, wherein, The training method for the text filtering network is as follows: Obtain a sample question, multiple sample texts, and a sample tag for each sample text, wherein the sample tag for each sample text is used to characterize the relevance of the sample text to the sample question; Determine the semantic vector of the sample question and the semantic vector of each sample text, and construct an adjacency matrix based on the sample question and multiple sample texts; The semantic vector of the sample question, the semantic vector of each sample text, and the adjacency matrix are input into the text filtering network. The text filtering network is processed through its hidden layer to obtain a hidden layer feature vector group. The hidden layer feature vector group includes the hidden layer feature vector obtained by combining the semantic vector of the sample question with the semantic vector of the multiple sample texts, and the hidden layer feature vector obtained by combining the semantic vector of each sample text with the semantic vector of other sample texts and the sample question. The hidden layer feature vector group is input into a fully connected layer to obtain the relevance score of each sample text to the sample question. The predicted label for each sample text is determined based on its relevance score to the sample question. The text filtering network is trained based on the predicted label and the loss value of the sample text for each sample until the training stops.

11. The candidate text determination method of claim 10, wherein, The text filtering network is trained based on the predicted label and the loss value of each sample text until the training stops, including: If the loss value is less than or equal to the third threshold, training of the text filtering network is stopped; if the loss value is greater than the third threshold, training of the text filtering network continues. or, The text filtering network is trained once for each sample text based on the predicted label and the loss value of the sample label, and the number of iterations is incremented by one. If the number of iterations is less than or equal to the fourth threshold, the training of the text filtering network continues. If the number of iterations is greater than the fourth threshold, the training of the text filtering network is stopped.

12. The candidate text determination method of claim 1, wherein, Based on the obtained question to be answered, the semantic vector of the question to be answered is determined, and the semantic vectors of multiple texts in the text library are obtained, including: The question to be answered and multiple texts in the text library are segmented into words to obtain multiple first word units of the question to be answered and multiple second word units of each text. For each first word unit of the question to be answered and each second word unit of the text, word embedding processing is performed to map each first word unit and each second word unit into a low-dimensional vector space to obtain the word vector of each first word unit and each second word unit; The word vector of each first word unit and the word vector of each second word unit are input into the encoding layer for encoding processing to obtain the first feature vector of each first word unit and the second feature vector of each second word unit; The first feature vectors of multiple first word units of the question to be answered are concatenated to obtain the semantic vector of the question to be answered, and the second feature vectors of multiple second word units of the same text are concatenated to obtain the semantic vector of the multiple texts.

13. The candidate text determination method of claim 1, wherein, Based on the similarity between the semantic vector of the question to be answered and the semantic vectors of the plurality of texts, a first candidate text semantically related to the question to be answered is determined from the text library, including: The semantic vector of the question to be answered is multiplied by the semantic vectors of the multiple texts, and the product is normalized to obtain the similarity between the question to be answered and each text. Based on the similarity between the question to be answered and each of the texts, a first candidate text that is semantically related to the question to be answered is determined from the text library.

14. The candidate text determination method of claim 1, wherein, The question to be answered is segmented into words to obtain multiple first word units of the question to be answered, including: The question to be answered is segmented according to a pre-compiled vocabulary to obtain multiple first word units of the question to be answered.

15. The candidate text determination method of claim 1, wherein, Based on the weight value of each first word unit and the relevance value of each first word unit to each text in the text library, the similarity score of each text relative to the question to be answered is determined. Before determining texts with similarity scores greater than a second threshold as second candidate texts, the method further includes: For each first word unit, determine the frequency of each first word unit in any text, the average length of all texts in the text library, and the length of any text; Based on the frequency, the average length, and the length of any text, determine the relevance value between each first word unit and any text; Determine the total number of all texts in the text library, and the number of texts in the text library that include any first word unit; The weight value of any first word unit is determined based on the total number and the number of texts including any first word unit.

16. The candidate text determination method of claim 1, wherein, Based on the first candidate text and the second candidate text, candidate texts are determined, including: The intersection of the first candidate text and the second candidate text is determined as the candidate text, or the union of the first candidate text and the second candidate text is determined as the candidate text.

17. A candidate text determination device, characterized in that, The device includes: The first determining module is configured to determine the semantic vector of the question to be answered based on the acquired question to be answered, and to acquire the semantic vectors of multiple texts in the text library; The second determining module is configured to determine a first candidate text semantically related to the question to be answered from the text library based on the similarity between the semantic vector of the question to be answered and the semantic vectors of the plurality of texts; The word segmentation module is configured to perform word segmentation on the question to be answered, thereby obtaining multiple first word units of the question to be answered. The third determining module is configured to determine the similarity score of each text relative to the question to be answered based on the weight value of each first word unit and the relevance value of each first word unit to each text in the text library, and determine the text with a similarity score greater than a second threshold as the second candidate text; The fourth determining module is configured to determine candidate texts based on the first candidate text and the second candidate text, wherein the candidate texts are the intersection or union of the first candidate text and the second candidate text.

18. A computing device comprising a memory, a processor, and computer instructions stored on the memory and executable on the processor, wherein, The processor implements the steps of the method of any of claims 1-16 when executing the instructions.

19. A computer-readable storage medium storing computer instructions, wherein, The instructions, when executed by the processor, implement the steps of the method of any of claims 1-16.

20. A computer program product, characterised in that, The computer program product comprises computer instructions which, when executed by the processor, implement the steps of the method of any of claims 1-16.