A Chinese question and answer information extraction method, system, device and storage medium
By combining multi-label type classification and deep learning models, the problems of multiple entities and multi-layer inference in complex Chinese questions are solved, a more accurate answer search path is achieved, and the intelligence of the Chinese question answering system is improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XI'AN UNIVERSITY OF ARCHITECTURE AND TECHNOLOGY
- Filing Date
- 2022-05-19
- Publication Date
- 2026-06-12
AI Technical Summary
Existing knowledge graph-based question answering systems struggle to effectively handle the relationships between multiple entities and multiple inference steps in complex Chinese questions, resulting in incomplete answer search paths.
A multi-label type segmentation strategy is adopted, combining the BERT model and the XGBOOST model for question classification and entity linking, and the BERT model is used for semantic relation extraction to optimize question understanding and answer search path.
It improves the accuracy and rationality of Chinese question-and-answer systems, optimizes the answer search path for complex questions, and enhances the practical intelligence of machine question-and-answer systems.
Smart Images

Figure CN115357692B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of information extraction, and relates to a method, system, device and storage medium for extracting Chinese question and answer information. Background Technology
[0002] With the rapid development of computer technology and the widespread application of the Internet, the resources people obtain through the network have exploded. How to quickly and accurately extract the required information from the ever-growing massive amount of data has become a major problem for researchers. A large proportion of this massive amount of data is unstructured free text, which contains a lot of valuable information. It is becoming increasingly difficult to extract this information manually, which requires machines to automatically help us analyze and extract valuable information. For example, in a question-answering system based on knowledge graphs, for the question "Who is the wife of star player A?", the mention of "A" may correspond to two entities in the knowledge graph: "<person_(position)>" and "<person_(position)>". In this sentence, the subject entity is obviously the former. Against this background, information extraction technology has emerged.
[0003] Information extraction technology can extract key information units from lengthy natural language expressions and present them in a structured form, helping us to automatically classify, extract, and reconstruct massive amounts of information, improving information processing speed and facilitating information access. However, early rule-based methods required the manual design of numerous rules, demanded specialized background knowledge, were prone to errors, and had poor domain transferability. Later, machine learning-based methods emerged that did not require rule design, but mostly required manual feature selection, the effectiveness of which directly affected the extraction results. Recent deep learning technologies offer advantages such as a rich variety of model frameworks and the ability to automatically extract features. Information extraction mainly includes three parts: named entity recognition, entity linking, and entity relationship extraction. These information extraction units mainly include entities, relationships, and events. Existing knowledge graph-based question-answering systems only satisfy the requirement of asking structured, single-entity Chinese questions; that is, the answer search path can be linked to the knowledge graph through a single triple. They cannot effectively solve complex problems such as multiple parallel entities and multi-layered inference steps in complex Chinese questions. Summary of the Invention
[0004] The purpose of this invention is to overcome the shortcomings of the prior art and provide a Chinese question-and-answer information extraction method, system, device and storage medium, which optimizes the representation of the original dataset corpus after layer-by-layer transformation and abstraction, and helps to improve the practical intelligence of machine question-and-answer systems.
[0005] To achieve the above objectives, the present invention employs the following technical solution:
[0006] A method for extracting Chinese question-and-answer information includes the following steps:
[0007] S1. Classify Chinese questions, establish a BERT model based on the classification, input the Chinese questions into the BERT model, and output the Chinese question classification result labels;
[0008] S2, based on the Chinese question classification result labels, perform named entity recognition on the questions to find the entities corresponding to the questions;
[0009] S3, segment entities by features, use the XGBOOST model to calculate a score for each related entity based on the segmented features, and take the entity with the highest score as the final entity linked together.
[0010] S4 constructs the candidate relation set corresponding to the entity into a sentence form, performs a semantic relation similarity calculation task between the sentence and the corresponding Chinese question, uses the BERT model to train the task, and finally outputs the highest score as the optimal semantic relation path of the question-answer pair.
[0011] Preferably, in S1, Chinese questions are divided into two categories: direct result-oriented and indirect result-oriented.
[0012] Preferably, in S1, the BERT model includes three layers of representation vectors: word vector embedding, position vector embedding, and segment vector embedding. A question sentence sequence is used as the input to the model. Through the three layers of representation vectors of the BERT model, the probability of each category is predicted, and the category with the highest probability is used as the final output category label ClassLabel.
[0013] Furthermore, in the BERT model, Indicates the first A randomly obscured portion of the characters. Indicates the first indivual Embedded vector, Indicates the first indivual The feature vector after processing by the BERT model.
[0014] Preferably, in S3, the features are divided into the initial score of entity mentions, the length of entity mentions, the ratio of the length of entity mentions to the length of the question, the ranking of the entity, the reciprocal of the ranking of the entity, the semantic similarity between the question and the entity, the semantic similarity between the question and the entity suffix, the Jaccard coefficient between the question and the entity suffix, the maximum semantic similarity between the question and the entity candidate relation, and the maximum Jaccard coefficient between the question and the entity candidate relation.
[0015] Preferably, in S3, the score calculation is the probability score of the label being the correct label.
[0016] Preferably, in S4, the candidate relation set is constructed into a sentence form, and the end is... <pad>The generalized sticky notes maintain the same length as Chinese questions.
[0017] A Chinese question-and-answer information extraction system, comprising:
[0018] The classification module is used to classify Chinese questions. Based on the classification, a BERT model is built. The Chinese questions are input into the BERT model, and the classification result labels of the Chinese questions are output.
[0019] The entity acquisition module is used to perform named entity recognition on questions based on the Chinese question classification result labels, and find the entity corresponding to the question.
[0020] The entity link confirmation module is used to segment entities by features. It uses the XGBOOST model to calculate a score for each related entity based on the segmented features, and the entity with the highest score is taken as the entity obtained by the final entity link.
[0021] The optimal semantic relation acquisition module is used to construct the candidate relation set corresponding to the entity into a sentence form, calculate the semantic relation similarity between the sentence and the corresponding Chinese question, train the BERT model for the task, and finally output the highest score as the optimal semantic relation path of the question-answer pair.
[0022] A computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the Chinese question-and-answer information extraction method as described in any of the preceding claims.
[0023] A computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the Chinese question-and-answer information extraction method described in any of the preceding claims.
[0024] Compared with the prior art, the present invention has the following beneficial effects:
[0025] This invention addresses the shortcomings of existing knowledge graph-based question-answering systems, which often rely on simple question structures and require only a single triplet path query. This makes them ill-suited for complex Chinese questions involving multiple entities and multi-layered inference steps. The invention proposes a multi-label type classification strategy to provide a new way to understand and express Chinese questions. Questions are categorized into direct-result-oriented and indirect-result-oriented types. After classification, a multi-feature ranking entity linking method based on the XGBOOST model is designed, combined with deep learning language models. A new model framework is established to model and train the paths for questions and search answers in both entity linking and relation extraction. This process selects the optimal entities mentioned in the question identification. Finally, a question and semantic relation extraction model based on the BERT language model is designed to obtain a final question-answer search path, thus finding the answer. This significantly improves the accuracy and rationality of the Chinese question-answering system and optimizes the incomplete search paths for complex questions in existing Chinese knowledge graph question-answering systems. Attached Figure Description
[0026] Figure 1 This is a structural diagram of the BERT single-sentence classification model of the present invention;
[0027] Figure 2 This is the multi-feature entity linking model of the present invention;
[0028] Figure 3 This is a structural diagram of the BERT-based semantic relation extraction model for questions according to the present invention.
[0029] Figure 4 A visual representation of the evaluation indicators for the question classification experiment of this invention;
[0030] Figure 5 This is a diagram of the evaluation model for the multi-feature ranking fitting model of the present invention. Detailed Implementation
[0031] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative effort are within the scope of protection of the present invention.
[0032] It should be noted that the terms "front," "back," "left," "right," "up," and "down" used in the following description refer to the directions shown in the attached figures, while the terms "inside" and "outside" refer to the directions toward or away from the geometric center of a specific component, respectively.
[0033] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and / or" as used herein includes any and all combinations of one or more of the associated listed items.
[0034] The Chinese question information extraction model method of the present invention includes the following steps:
[0035] (1) Chinese question classification based on BERT model multi-label strategy
[0036] a. Partitioning the Chinese question dataset using a multi-label strategy
[0037] 1) Direct Results Oriented
[0038] Direct result-oriented problems, also known as ordinary single-step problems, can be understood as problems with only one layer of linear structure. The output of the answer to a typical question is a relation or attribute value within a single triple corresponding to the entity in the question. For example, the question "Who wrote the poem 'River Snow'?" corresponds to the knowledge information in the knowledge graph that is a single triple (River Snow - Author - Liu Zongyuan (one of the "Eight Great Masters of Tang and Song Prose")), thus classifying it as a direct result-oriented problem.
[0039] Furthermore, this invention addresses questions such as "Who is the current president of Tsinghua University?" This question is also a direct result-oriented question. According to the sentence components to which the answer belongs, it is an object-oriented question. In order to reflect multi-label classification and more quickly locate the answer to a direct result-oriented question, we further divide this type of direct result-oriented question into subject-oriented questions and object-oriented questions based on the different positions of the answer.
[0040] 2) Indirect Result-Oriented Problems
[0041] Indirect result-oriented problems, also known as special multi-step problems, inquire about a specific relationship or attribute of an entity. However, the answer cannot be obtained solely from nearby entities; multiple jumps between entities are required to reach a range of candidate answers. Indirect result-oriented questions provide the entity, and the relationship is established through multiple jumps within that relationship, forming a relational path. The answer is found through knowledge reasoning via multi-step jump operations. For example, the question "Which folk festival uses the lantern riddle game to represent the festive atmosphere?" involves the entities "folk custom" and "lantern riddle." These are linked to two knowledge principles in the knowledge graph: (Lantern Festival (Chinese traditional festival) - category - folk custom) and (Lantern Festival (Chinese traditional festival) - festival activities - lantern riddle). Reasoning through these two knowledge principles is required to arrive at the answer, which is also a node. This type of problem is called a complex fact-based problem. This invention further subdivides this type of indirect result-oriented problem into multi-level inference problems and merging similar problems.
[0042] b. Construct a BERT-based Chinese question classification model
[0043] BERT is a novel bidirectional Transformer-based language model that, compared to unidirectional language models, can understand context more deeply. Therefore, this invention constructs a BERT-based Chinese question classification model. A sequence of question sentences is used as input to the model, and the model utilizes BERT's three-layer representation vectors: character vector embedding, positional vector embedding, and segmented vector embedding. Indicates the first A randomly obscured portion of the characters. Indicates the first indivual Embedded vector, Indicates the first indivual The feature vectors after being processed by the BERT fine-tuning model. Finally, after... Predict the probability of each category, and select the category with the highest probability as the final output category label, ClassLabel.
[0044] (2) Named entity recognition based on BERT-BiLSTM-CRF hybrid model
[0045] Based on the question classification obtained in step (1), the question is named entity recognition is performed on the question to find the entity corresponding to the question.
[0046] a. Constructing a hybrid BERT-BiLSTM-CRF model
[0047] Named entity recognition based on the hybrid model BERT-BiLSTM-CRF first applies a BERT pre-trained model to the Chinese named entity recognition task, constructing a BERT-BiLSTM-CRF named entity recognition model. The BERT pre-trained model is used to obtain word vectors and extract text features, which are then used as input to the BiLSTM-CRF model. The BiLSTM-CRF model focuses on further performing downstream classification and prediction tasks, ultimately locating the entities within the user's question.
[0048] (3) Entity linking method based on multi-feature ranking of XGBOOST model
[0049] a. Entity selection from multiple feature perspectives
[0050] Entity linking refers to linking mentions of identified topic entities in a question to unique entities in the knowledge base. Because identified mentions cannot be directly linked to specific entities, there are often cases where one mention corresponds to multiple entities. Therefore, this invention designs the following three sets of 10 features to complete the task of ranking candidate entities.
[0051] 1) Entity mention feature
[0052] S1: Initial score for entity mention. The initial score S1=1 for mentions extracted by the mention recognition model, but it can only be used as a candidate because there are boundary errors in many cases. In this case, it is necessary to expand or delete the left and right characters of the candidate. Adding or deleting 1 character deducts 0.1 points. The maximum expansion is 5 characters, and the minimum deletion is 1 character.
[0053] S2: Length of the entity mention. The number of characters in the mention corresponding to the entity.
[0054] S3: The ratio of the length of the entity reference to the length of the question. The proportion of the number of characters in the entity reference to the total number of characters in the question.
[0055] 2) Entity features
[0056] S4: Entity Ranking. The entity mention triple in the knowledge graph contains the specific ranking of each entity corresponding to the mention, i.e., priority 0, 1, 2, ... For example, "<person_(position)>" and "<person_(position)>" correspond to rankings of 1 and 5, respectively.
[0057] S5: The reciprocal of the rank corresponding to the entity. If it is 0, set it to 1; otherwise, set it to 1 / rank.
[0058] S6: Semantic similarity between questions and entities. The similarity measure here uses a relation similarity extraction model; the higher the value, the higher the semantic similarity.
[0059] S7: Semantic similarity between questions and entity suffixes. Entity suffixes refer to the part of the entity name in parentheses in the entity knowledge triple; this information can be used to complete entity disambiguation tasks.
[0060] S8: Jaccard coefficient for questions and entity suffixes. Here, the Jaccard coefficient refers to the ratio of the number of characters in the intersection to the number of characters in the union of two strings. The larger the coefficient value, the higher the degree of character overlap.
[0061] 3) Relationship characteristics
[0062] S9: Maximum semantic similarity between the question and the entity's candidate relations. This refers to the value of the relation among all candidate relations of the entity that is semantically most similar to the question.
[0063] S10: The maximum Jaccard coefficient between the question and entity candidate relations. It represents the value of the relation among all candidate relations of the entity that is most similar to the question character.
[0064] b. Based on the multi-feature ranking method designed in step (2) a, design an entity linking method based on XGBOOST multi-feature ranking. Fit and train the three major features and ten features. During prediction, we use its binary classification method to calculate the score (probability score of the label being the correct label) for each entity, and select the entity with the highest score as the entity obtained by our final entity linking.
[0065] (4) Question and semantic relation extraction method based on BERT model
[0066] The question and semantic relation extraction method is applied to the model used to determine the semantic relationship between the entity and the question after the entity linking is completed in step (2).
[0067] a. Construct candidate triplet sentence pairs
[0068] The triples given in the dataset are transformed into sentence structures. The beginning and end of the sentence are marked with special tags [CLS] and [SEP], respectively, such as [SEP]<name_(position)>---<wife>---<zodiac sign>-- <pad>",in" <pad>"" indicates a special tag used to generalize an entity.
[0069] b. Establish a BERT-based model for extracting questions and semantic relations.
[0070] 1) a. Candidate triples are used to form short sentences, which, combined with the question, create a task to calculate the semantic similarity between sentences. For example, the question "Which country is the land of a thousand islands?" and the triple path. <pad>"--<Alternative Name>--<Land of a Thousand Islands>" constitutes two sentence corpora.
[0071] 2) Fine-tune the BERT model, train it on the task set, and finally output the highest score as the optimal semantic relationship path of the question-answer pair.
[0072] The specific process is as follows:
[0073] Chinese Question Classification Based on a Multi-Label Strategy using the Bert Model
[0074] This invention establishes a BERT-based question classification model to classify and train different question types, thereby improving the accuracy of obtaining answers and reducing the dataset of candidate answers.
[0075] a. Classification of direct / indirect result-oriented questions
[0076] Questions whose answers correspond to only a single triple query are classified as direct result-oriented questions, while those whose answers require multiple triple queries are classified as indirect result-oriented questions. This paper trains on the questions and corresponding SPARQL queries given in the training corpus. The number of variables within curly braces in each SPARQL query is used as the classification criterion. Questions with three or more variables within curly braces are classified as direct result-oriented questions and labeled 0. Questions with more than three variables within curly braces are classified as indirect result-oriented questions and labeled 1. A BERT model is then introduced to train a binary classification model. Table 1 below shows examples of the two types of questions.
[0077] Table 1 Examples of Direct / Indirect Result-Oriented Questions
[0078]
[0079] For single-sentence classification tasks, the basic BERT model framework is fine-tuned. Specifically, the output of the first label [CLS] in the last layer of the model is directly used as the fused representation of the entire sentence, which is then passed through a multilayer perceptron for classification. The model structure is as follows: Figure 1 As shown.
[0080] The final calculation formula is:
[0081] (1)
[0082] Where softmax represents the activation function, calculating the probability distribution for each class, W. These are the weights of the hidden layer. It is a bias, and K represents the number of categories.
[0083] (2) Question sentence component classification
[0084] Sentence-based classification refers to the classification of single-hop questions, where the answer corresponds to one of the subject, predicate, or object in the triple. Because when the subject entity of a question is known, we cannot know whether that entity corresponds to the subject or object position in the triple of the knowledge base, this invention divides direct-result-oriented questions into subject-based and object-based question-finding. Based on the position of the question mark in the triple of the SPARQL statement for direct-result-oriented questions, the data for single-hop questions is divided into three categories, labeled 0, 1, and 2 respectively. Data samples are shown in Table 2. Then, a three-class classification model is trained based on BERT.
[0085] Table 2 Classification of Component Types in Inquiry Sentences
[0086]
[0087] c. Classification of multi-level inference problems
[0088] Multilevel inference problems refer to queries involving multiple triple queries, with the triples exhibiting a progressive relationship. These complex problems typically contain multiple relational attributes within their queries. Based on whether the triples in the SPARQL statement exhibit a progressive relationship, all data is segmented into chained and non-chained problems. Since single-hop problems may also have multiple entities within the query, multi-hop problems are not directly categorized into chained and multi-entity problems.
[0089] Then train a binary classification model, as shown below as an example of chain problem classification.
[0090] Table 3 Examples of Chain Question Classification
[0091]
[0092] (2) Named entity recognition based on BERT-BiLSTM-CRF hybrid model
[0093] a. Constructing a BERT pre-trained model
[0094] Given a Chinese question such as "What is the capital of Shaanxi Province?", the sentence is first processed into a sequence before being fed into the BERT model. After three segments of text are processed, the sequence is as follows: After processing these word and phrase information sequences, and passing them through multiple Transformer encoders, the output sequence is... The semantic vector representation sequence.
[0095] b. Establish a BiLSTM-CRF model
[0096] 1) The output of BiLSTM is to concatenate the two vectors to form a new vector. In the LSTM model, The hidden layer outputs only the state from the previous time step. Regarding this, semantic feature vectors can only be obtained in a unidirectional forward direction for computation, and cannot be obtained... The output state of the hidden layer at the next time step In text processing tasks, considering contextual information greatly aids in correctly parsing sentences. For example, in the sentence "Where is the Shaanxi Provincial People's Hospital in Beilin District?", when using a single-layer LSTM model for entity recognition, "Shaanxi Province" might be output as a location entity because "People's Hospital" has no necessary connection to "Shaanxi Province". Therefore, by adding another LSTM layer to the unidirectional parsing of sentence features, a bidirectional LSTM (BiLSTM) is formed. This allows the model to learn feature information from both positive and negative perspectives, concatenating the output vectors from both directions to form a new vector value as the final output.
[0097] 2) To legitimize the label, a CRF (Conditional Random Field) is added to the above (1) to introduce an emission fraction. Create another score The , represents the transition score, which is the score from the previous state vector label to the current label state. CRF considers the global context, accumulating the previous score into the current score sequence vector. To obtain the final output, the optimal score representation needs to be decoded in reverse to find an optimal path as the correct label output. The Viterbi algorithm is a dynamic programming algorithm used to find the most likely Viterbi path-hidden state sequence for the observed time series. Therefore, the Viterbi algorithm is used to decode the score result in reverse, and the calculation formula is as follows:
[0098] (3) Entity linking method based on multi-feature ranking of XGBOOST model
[0099] Dataset preprocessing
[0100] Each sample group in the training dataset includes a question, the entity mentioned in the question, and the scores of the ten sets of features in the three major categories mentioned above for possible candidate entities. In the training set, each sample is labeled, with 1 for correct entities and 0 for incorrect entities.
[0101] b. Constructing an XGBOOST binary classification model
[0102] The objective function of XGBOOST is:
[0103] (2)
[0104] In equation (2), The loss function; It is a regular expression term used to control the complexity of the tree; It is a constant term. It is the predicted value of the new tree. Its calculation involves summing the results of the number of trees (K).
[0105] The XGBoost tree is constructed by extending a node into two branches, and this process of splitting nodes layer by layer eventually forms the entire tree. There are two ways to split tree nodes: a greedy algorithm that enumerates all different tree structures, and an approximate algorithm. The greedy algorithm starts at tree depth 0. For each node, it traverses all features. For a given feature, it first sorts it by its value, then linearly scans the feature to determine the best split point. Finally, after splitting all features, it selects the feature with the highest gain. The gain is calculated as follows:
[0106] Gain= (3)
[0107] The model outputs a probability sequence. The `round(value)` function is used to classify the output, resulting in a set of label sequences containing only 0s and 1s. The candidate entities corresponding to the questions in the corpus with a result of 1 are considered optimal and used as the final result set for entity linking. The entity disambiguation model is as follows: Figure 2 As shown.
[0108] (4) A method for extracting semantic relations of questions based on the BERT model
[0109] The question-path semantic similarity model refers to the model used to determine the most semantically relevant relationship between an entity and a question after entity linking has been completed. Because after identifying an entity, there will be multiple candidate relation sets for that entity. For example, for the question "What is the birthday of tennis champion Li Mou?", according to the corresponding SPARQL query, the candidate relation set for the entity includes "nationality", "husband", "major achievements", "date of birth", etc. Therefore, the question and these candidate relation sets are treated as two concatenated sentences. Using the BERT model, a binary classification model for semantic similarity between questions is designed. First, word vector embedding is performed on the question and the candidate relation set. The question sequence is... The sequence of relation sets is The sequence is then fed into the BERT model for processing. Then, it is classified through a multilayer perceptron, and its model structure is as follows: Figure 3 As shown.
[0110] The final calculation formula is:
[0111] (4)
[0112] Where softmax represents the activation function, calculating the probability distribution for each class, W. These are the weights of the hidden layer. It is a bias, and K represents the number of categories.
[0113] The model is trained using a 1:5 ratio of positive to negative images, with positive examples labeled as 1 and the five negative examples labeled as 0. The trained semantic similarity model between questions is used to calculate the similarity between the question and each candidate relation (the probability of being classified as label 1), and then the relationships are sorted. The relation with the highest similarity is selected to search for the final answer.
[0114] Experimental instructions
[0115] This invention uses publicly available evaluation data from CCKS2019-CKBQA. CCKS2019-CKBQA includes three question-answering datasets, all of which were manually constructed and labeled. The Institute of Computer Technology at Peking University provided 3 / 4 of the open-domain question-answering data, and Hundsun Technologies Inc. provided 1 / 4 of the financial domain question-answering data. Since complex questions constitute a large proportion of CCKS2019-CKBQA data, this paper also includes a portion of the NLPCC 2016 evaluation dataset. NLPCC 2016 mainly consists of simple questions. A total of 3200 question-answering data points were selected as the sample set for this experiment, and the sample set was divided into training, validation, and test sets in a 5:2:1 ratio.
[0116] Question Classification Experiment
[0117] The BERT pre-trained model used in this invention is "BERT-Base, Chinese," implemented using the Tensorflow framework. It has a 12-layer encoder, with each layer's hidden state output having a dimension of 768. The maximum length of a Chinese question is 60. The model is optimized using the Adam algorithm for parameter updates and fine-tuning, with an initial learning rate of 2e-5. Batch training is employed with a batch size of 32. The dropout ratio is set to 0.1 by default. The maximum number of iterations is 100, and the model is saved every 50 steps during training, with a development set used for validation. The performance of the proposed question classification method is evaluated using the macroscopic accuracy P. Figure 4 The graph shows the evaluation metrics for the question classification experiment. It reveals that the accuracy of the direct / indirect guided question classification is only 88.14%. Analysis of the result set shows that some of the indirect guided questions can actually be solved using the direct guided question solution method. That is, alias references can be linked to the main entity through entity links without needing extra triples. There are many similar alias questions, leading to the low performance. The accuracy of the other models is all above 93%.
[0118] Multi-feature ranking fitting experiment
[0119] This invention addresses the three aspects of features proposed in step (2) above: topic entity mention features, entity features, and relationship features. In the training data, the correct entity is labeled as 1, and the remaining candidate entities are labeled as 0. The XGBOOST model is used to fit these features to complete the binary classification task. Then, on the validation set, test set, and training set, the trained model is used to score each candidate entity (the probability value of classifying it as label 1), and the entity ranked first is selected. Furthermore, by comparing with other classification models such as KNN (Nearest Neighbor Classifier) and SVM (Support Vector Machine Classifier), the XGBOOST model demonstrates better classification performance in the feature fusion and ranking classification task. The evaluation metric used in this paper is accuracy. Recall rate and The experimental results comparison is shown in the figure below. Figure 5 The experimental results show that XGBoost outperforms the other two classification models in terms of precision, recall, and F1 score.
[0120] The following are embodiments of the apparatus of the present invention, which can be used to execute embodiments of the method of the present invention. For details not omitted in the apparatus embodiments, please refer to the embodiments of the method of the present invention.
[0121] In another embodiment of the present invention, a Chinese question-and-answer information extraction system is provided. This system can be used to implement the above-mentioned Chinese question-and-answer information extraction method. Specifically, the system includes a classification module, an entity acquisition module, an entity link confirmation module, and an optimal semantic relationship acquisition module.
[0122] The classification module is used to segment Chinese questions. Based on the segmentation, a BERT model is built. The Chinese questions are input into the BERT model, and the output is the Chinese question classification result label.
[0123] The entity acquisition module is used to perform named entity recognition on questions based on the Chinese question classification result labels, and find the entity corresponding to the question.
[0124] The entity link confirmation module is used to segment entities by features. It uses the XGBOOST model to calculate a score for each related entity based on the segmented features, and the entity with the highest score is taken as the entity obtained by the final entity link.
[0125] The optimal semantic relation acquisition module is used to construct the candidate relation set corresponding to the entity into a sentence form, and to perform the semantic relation similarity calculation task between the sentence and the corresponding Chinese question. The BERT model is used to train the task, and finally the highest score is output as the optimal semantic relation path of the question-answer pair.
[0126] In another embodiment of the present invention, a terminal device is provided, comprising a processor and a memory. The memory stores a computer program, the computer program including program instructions, and the processor executes the program instructions stored in the computer storage medium. The processor may be a Central Processing Unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), or field-programmable gate arrays (FPGAs). Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., are the computing core and control core of the terminal. They are suitable for implementing one or more instructions, specifically suitable for loading and executing one or more instructions to realize the corresponding method flow or corresponding function. The processor described in this embodiment of the invention can be used for the operation of a Chinese question-and-answer information extraction method, including: S1, classifying Chinese questions, establishing a BERT model according to the classification, inputting Chinese questions into the BERT model, and outputting Chinese question classification result labels; S2, performing named entity recognition on the questions based on the Chinese question classification result labels to find the entities corresponding to the questions; S3, performing feature division on the entities, using the XGBOOST model to calculate the score of each related entity according to the divided features, and using the entity with the highest score as the entity obtained by linking the final entity; S4, constructing the candidate relation set corresponding to the entity into a sentence form, performing a semantic relation similarity calculation task between the sentence and the corresponding Chinese question, using the BERT model to train the task, and finally outputting the highest score as the optimal semantic relation path of the question-and-answer pair.
[0127] In another embodiment, the present invention also provides a computer-readable storage medium (Memory), which is a memory device in a terminal device for storing programs and data. It is understood that the computer-readable storage medium here may include both the built-in storage medium in the terminal device and extended storage media supported by the terminal device. The computer-readable storage medium provides storage space that stores the terminal's operating system. Furthermore, the storage space also stores one or more instructions suitable for loading and execution by a processor, which may be one or more computer programs (including program code). It should be noted that the computer-readable storage medium here may be high-speed RAM or non-volatile memory, such as at least one disk storage device.
[0128] One or more instructions stored in a computer-readable storage medium can be loaded and executed by the processor to implement the corresponding steps of the Chinese question-and-answer information extraction method in the above embodiments; one or more instructions in the computer-readable storage medium are loaded and executed by the processor in the following steps: S1, classify Chinese questions, establish a BERT model according to the classification, input Chinese questions into the BERT model, and output Chinese question classification result labels; S2, perform named entity recognition on the questions based on the Chinese question classification result labels to find the entities corresponding to the questions; S3, perform feature division on the entities, use the XGBOOST model to calculate the score of each related entity according to the divided features, and take the entity with the highest score as the entity obtained by linking the final entity; S4, construct the candidate relation set corresponding to the entity into the form of a sentence, perform the semantic relation similarity calculation task between the sentence and the corresponding Chinese question, use the BERT model to train the task, and finally output the highest score as the optimal semantic relation path of the question-and-answer pair.
[0129] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0130] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0131] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0132] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0133] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus.
[0134] It should be understood that the above description is for illustrative purposes and not for limitation. Many embodiments and applications beyond the provided examples will be apparent to those skilled in the art upon reading the above description. Therefore, the scope of this teaching should not be determined by reference to the above description, but rather by reference to the foregoing claims and the full scope of their equivalents. For purposes of completeness, all articles and references, including patent applications and publications, are incorporated herein by reference. The omission of any aspect of the subject matter disclosed herein in the foregoing claims is not intended as a waiver of that subject matter, nor should it be construed as an indication that the applicant has not considered that subject matter as part of the disclosed inventive subject matter.< / pad> < / pad> < / pad> < / pad>
Claims
1. A method for extracting Chinese question-and-answer information, characterized in that, Includes the following steps: S1. Classify Chinese questions, build a BERT model based on the classification, input the Chinese questions into the BERT model, and output the Chinese question classification result labels; S2, based on the Chinese question classification result labels, perform named entity recognition on the questions to find the entities corresponding to the questions; S3, segment entities by features, use the XGBOOST model to calculate a score for each related entity based on the segmented features, and take the entity with the highest score as the final entity linked together. S4: Construct the candidate relation set corresponding to the entity into a sentence, perform the semantic relation similarity calculation task between the sentence and the corresponding Chinese question, train the BERT model on the task, and finally output the highest score as the optimal semantic relation path of the question-answer pair; The process of constructing the candidate relation set corresponding to entities into sentence form is as follows: construct candidate triplet sentence pairs, construct the candidate relation set into sentence form, and use the following at the end: <pad> The generalized sticky notes maintain the same length as Chinese questions.< / pad> 2. The Chinese question-and-answer information extraction method according to claim 1, characterized in that, In S1, Chinese questions are divided into two categories: direct result-oriented and indirect result-oriented.
3. The Chinese question-and-answer information extraction method according to claim 1, characterized in that, In S1, the BERT model comprises three layers of representation vectors: word embedding, positional embedding, and segmented embedding. A sequence of question sentences is taken as input to the model, passed through the three layers of representation vectors of the BERT model, and finally processed... Predict the probability of each category, and select the category with the highest probability as the final output category label, ClassLabel.
4. The Chinese question-and-answer information extraction method according to claim 2, characterized in that, In the BERT model, Indicates the first A randomly obscured portion of the characters. Indicates the first indivual Embedded vector, Indicates the first indivual The feature vector after processing by the BERT model.
5. The Chinese question-and-answer information extraction method according to claim 1, characterized in that, In S3, the features are divided into the initial score of entity mentions, the length of entity mentions, the ratio of the length of entity mentions to the length of the question, the ranking of the entity, the reciprocal of the ranking of the entity, the semantic similarity between the question and the entity, the semantic similarity between the question and the entity suffix, the Jaccard coefficient between the question and the entity suffix, the maximum semantic similarity between the question and the entity candidate relation, and the maximum Jaccard coefficient between the question and the entity candidate relation.
6. The Chinese question-and-answer information extraction method according to claim 1, characterized in that, In S3, the score is calculated as the probability score of the label being the correct label.
7. A Chinese question-and-answer information extraction system, characterized in that, include: The classification module is used to classify Chinese questions. Based on the classification, a BERT model is built. The Chinese questions are input into the BERT model, and the classification result labels of the Chinese questions are output. The entity acquisition module is used to perform named entity recognition on questions based on the Chinese question classification result labels, and find the entity corresponding to the question. The entity link confirmation module is used to segment entities by features. It uses the XGBOOST model to calculate a score for each related entity based on the segmented features, and the entity with the highest score is taken as the entity obtained by the final entity link. The optimal semantic relation acquisition module is used to construct the candidate relation set corresponding to the entity into a sentence form, calculate the semantic relation similarity between the sentence and the corresponding Chinese question, train the BERT model on the task, and finally output the highest score as the optimal semantic relation path of the question-answer pair. The process of constructing the candidate relation set corresponding to entities into sentence form is as follows: construct candidate triplet sentence pairs, construct the candidate relation set into sentence form, and use the following at the end: <pad> The generalized sticky notes maintain the same length as Chinese questions.< / pad> 8. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the Chinese question-and-answer information extraction method as described in any one of claims 1 to 6.
9. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the steps of the Chinese question-and-answer information extraction method as described in any one of claims 1 to 6.