Method and system for identifying entity annotation errors in a knowledge graph on a corpus of documents

By using the SentencePiece word segmenter and the voting mechanism of a deep learning network model, entity annotation errors in the literature dataset are identified and corrected, solving the problem of low entity annotation accuracy and achieving efficient knowledge graph construction.

CN115130465BActive Publication Date: 2026-06-16ZHEJIANG UNIV CITY COLLEGE

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG UNIV CITY COLLEGE
Filing Date
2022-07-18
Publication Date
2026-06-16

Smart Images

  • Figure CN115130465B_ABST
    Figure CN115130465B_ABST
Patent Text Reader

Abstract

The application provides a knowledge graph entity annotation error identification method on a literature data set, comprising the following steps: data preprocessing is performed on a literature data set that has been subjected to entity annotation; a preset number of pre-training models using a SentencePiece word segmentation tool are selected; a corresponding number of deep learning network models are established based on the selected pre-training models to perform training, and the models and parameters in the entire training process are recorded and saved as candidate judge models; 2k models are selected as judge models from the candidate judge models based on model accuracy, and the judge models are set with credible parameters, wherein k is the number of selected pre-training models; based on a voting mechanism, the selected judge models are used to select controversial entities in the text data set; the first n entities in the text data set that have text information coincidence degrees exceeding a preset coincidence degree threshold value with the controversial entities are searched, the controversial entities are scored according to the coincidence degrees and frequencies, and the controversial entities with scores less than a discrimination threshold value are discriminated as error entities.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer natural language processing technology, and in particular to a method and system for identifying entity annotation errors on a knowledge graph in a document dataset. Background Technology

[0002] Knowledge graphs have proven effective in modeling structured information and conceptual knowledge. Constructing a knowledge graph typically involves two tasks: Named Entity Recognition (NER) and Relation Extraction (RE). NER identifies named entities from text data, while RE extracts the relationships between discrete named entities, connecting them to form a network of knowledge. High-quality entity annotation is a crucial step in building a knowledge graph, and ensuring the accuracy of entity recognition is fundamental to RE. However, with the increasing size of databases across various domains, maintaining a dataset and ensuring the accuracy of its entity annotation is no easy task. Summary of the Invention

[0003] Based on the above background, this invention proposes a method for identifying entity annotation errors in knowledge graphs on literature datasets. This method can be used to construct high-quality knowledge graphs in professional fields, and specifically adopts the following technical solution:

[0004] The first aspect of this invention is a method for identifying entity annotation errors in a knowledge graph on a document dataset, comprising the following steps:

[0005] S1. Perform data preprocessing on the document dataset with entity annotation;

[0006] S2. Select a preset number of pre-trained models using the SentencePiece word segmenter;

[0007] S3. Based on the selected pre-trained model, establish a corresponding number of deep learning network models for training, and record and save the models and parameters throughout the training process as models to be selected by the judges.

[0008] S4. Based on the model accuracy, select 2k models from the candidate judge models as judge models, and set confidence parameters for them, where k is the number of pre-trained models selected.

[0009] S5. Based on the voting mechanism, the selected judge model is used to select the disputed entities in the text dataset;

[0010] S6. Search the text dataset for the top n entities whose text information overlaps with the disputed entity by more than a preset overlap threshold. Score the disputed entities based on overlap and frequency, and identify disputed entities with scores less than the discrimination threshold as erroneous entities.

[0011] Furthermore, in step S1, the data preprocessing includes processing the entity nesting problem in the literature dataset, specifically including converting traditional BIO tags into machine reading comprehension tag format, including context, whether it contains entities, entity tags, entity start position, entity end position, text identifier, entity identifier qas_id, and question query.

[0012] Furthermore, in step S2, the pre-trained models of the SentencePiece word segmenter include XLNet, ELMo, RoBERTa, and ALBERT models.

[0013] Furthermore, step S3 specifically includes:

[0014] S31. Load each pre-trained model through the BertModel and BertPreTrainedModel modules to form multiple downstream neural networks;

[0015] S32. Input the preprocessed data into the multiple upstream neural networks respectively to obtain the semantic representation of multiple contexts, and then set multiple downstream neural networks corresponding to the upstream neural networks through multiple fully connected layers to form multiple deep learning network models.

[0016] S33. Record and save the parameters learned by each deep learning network model in each epoch, and obtain the model and parameters throughout the training process as the candidate judge model.

[0017] Furthermore, in step S4, the formula for calculating the confidence parameter is:

[0018] T = Softmax(P1, P2, ..., P 2k )

[0019] Among them, P i Let T be the accuracy of the i-th judge's model, and T be the confidence parameter.

[0020] Furthermore, step S5 specifically includes:

[0021] S51. Input the entity labels of each entity in the literature dataset into the judge model to obtain entity labels that do not match the labels, and record them as disputed entities to be voted on.

[0022] S52. Based on the credibility parameters of each judge's model, vote on the disputed entities to be voted on, and select the disputed entities based on the preset score threshold, where the credibility parameter of each judge's model is the number of votes for each entity.

[0023] Furthermore, step S6 specifically includes:

[0024] S61. Search the text dataset for the top n entities whose text information overlaps with the disputed entity by more than a preset overlap threshold, and use them as query entities;

[0025] S62. Based on the overlap D of n query entities... i and entity frequency F i The disputed entity is scored based on its frequency μ in the literature dataset, and the scoring method is as follows:

[0026] Score i =F i / μ×D i , i = (1, 2, ..., n)

[0027] S63. Perform n calculations to obtain the score set (Score1, Score2, ..., Score) corresponding to the disputed entity. n If any score in the score set is less than the discrimination threshold, the disputed entity is judged as an incorrect entity.

[0028] Furthermore, the method of the present invention also includes:

[0029] S0. Collect literature data in a specific field to form a literature dataset, and perform entity annotation on the literature dataset. Specifically, this includes: cutting a whole article into text pieces of less than 256 characters, using the BIO annotation method, and manually annotating each text piece with entities.

[0030] A second aspect of the present invention is a knowledge graph entity annotation error recognition system on a document dataset, comprising:

[0031] The data preprocessing module is used to preprocess the entity-annotated literature dataset.

[0032] The pre-trained model configuration module is used to configure a preset number of pre-trained models using the SentencePiece word segmenter.

[0033] The model training module is used to build a corresponding number of deep learning network models based on the selected pre-trained models for training, and to record and save the models and parameters throughout the training process as models to be selected by the judges.

[0034] The judge model generation module is used to select 2k models from the candidate judge models based on the model accuracy and set confidence parameters for them, where k is the number of pre-trained models selected.

[0035] The disputed entity selection module is used to select disputed entities in the text dataset based on a voting mechanism and a selected judge model.

[0036] The error finding module is used to search the text dataset for the top n entities whose text information overlaps with the disputed entity by more than a preset overlap threshold. The disputed entities are scored based on overlap and frequency, and disputed entities with scores less than the discrimination threshold are identified as error entities.

[0037] Furthermore, the system also includes:

[0038] The annotation generation module is used to perform entity annotation on a literature dataset consisting of collected literature data in a specific field. Specifically, it includes: cutting an entire article into text segments of less than 256 characters, using the BIO annotation method, and manually annotating each text segment with entities.

[0039] The beneficial effects of this invention lie in its original method and system for identifying entity annotation errors in knowledge graphs on literature datasets. It combines named entity recognition and machine reading comprehension from the field of natural language processing to address the frequent entity nesting problem in literature datasets. It proposes a unique dataset maintenance method for the first time, which preserves the training results of multiple deep learning models and their two most accurate parameter models as "judges" to determine whether errors exist in the dataset, and proposes a method for setting trust parameters. This ensures that the "judges" have different levels of credibility and familiarity with the semantic information of the text during the error correction process, while also ensuring a sufficient number of "judges." The method and system of this invention perform well on the DiaKG medical literature dataset. Furthermore, this method can be easily extended to other literature datasets, enabling more efficient construction of high-quality knowledge graphs in various fields. Attached Figure Description

[0040] Figure 1 Figure 1 This is a schematic diagram of the basic process of an embodiment of the method of the present invention.

[0041] Figure 2 This is a schematic diagram illustrating a specific process of an embodiment of the present invention. Detailed Implementation

[0042] To further understand the present invention, preferred embodiments of the present invention are described below in conjunction with examples. However, it should be understood that these descriptions are only for further illustrating the features and advantages of the present invention, and not for limiting the scope of the claims of the present invention.

[0043] This invention focuses on the named entity recognition and error correction stages in the knowledge graph construction task of literature datasets. While conventional named entity recognition in natural language processing typically doesn't encounter entity nesting issues, specialized literature datasets often present situations where a single text contains multiple entities. Furthermore, domain-specific abbreviations are difficult to find in dictionaries, and Chinese literature databases frequently suffer from mixed Chinese and English content. Therefore, this invention assumes these problems will be encountered during its description, and the method employed addresses these issues while being applicable to literature databases that do not exhibit these problems.

[0044] Deep learning has a wide range of applications, such as computer vision, natural language processing, and speech analysis. This invention uses cutting-edge deep learning pre-trained models, such as XLNet, RoBERTa, and ALBERT, and proposes for the first time a multi-model "voting" error correction method, which saves time and manpower costs in the data annotation process.

[0045] It should be noted that when implementing the solution of this invention, the selection of deep learning pre-trained models is not necessarily limited to the models listed in this invention. Those skilled in the art can follow the latest pre-trained models released in the field of deep learning to select a model suitable for their own dataset. The design of each hyperparameter in this specification can also be modified to some extent based on the artist's own understanding of the problem.

[0046] In the field of deep learning, some techniques and methods have become highly modularized. Therefore, it is understandable for those skilled in the art that certain well-known structures and their descriptions are omitted in the accompanying drawings.

[0047] The following is in conjunction with the appendix Figure 1-2 The method and corresponding system of the present invention will be further described in detail with reference to specific embodiments.

[0048] See appendix Figure 1-2 In one illustrated embodiment, a method for identifying entity annotation errors in a knowledge graph on a document dataset includes the following steps:

[0049] The first step was to collect and build the DiaKG dataset, a medical literature dataset on diabetes. The dataset consists of 41 diabetes guidelines and consensus statements, all from authoritative Chinese journals, covering the broadest range of research content and hot topics in recent years, including clinical research, drug use, clinical cases, diagnosis, and treatment methods. The text information was then annotated, specifically as follows:

[0050] A complete article is divided into text segments of less than 256 characters each. AI experts and domain experts use the BIO annotation method to annotate each text segment with entities, forming a literature dataset with entity annotation.

[0051] It should be noted that the above steps are merely an example of generating an entity-annotated literature dataset, and are not essential steps of this invention. The method of this invention is applicable to all entity-annotated literature datasets generated using similar or other means.

[0052] The second step is to preprocess the entity-annotated literature dataset.

[0053] Taking the DiaKG medical literature dataset as an example, this dataset contains a total of 22,050 entities, and its categories include:

[0054] "Disease", "Class", "Reason", "Pathogenesis", "Symptom", "Test", "Test_items", "Test_Value", "Drug", "Total", "Frequency", "Method", "Treatment", "Operaction", "ADE", "Anatomy", "Level".

[0055] Among these, entities are nested with each other. For example, in the case of "type 2 diabetes", "type 2 diabetes" is an entity of the "Disease" category, and "type 2" is an entity of the "Class" category. It can be found that two entities of different categories appear in the same text. This situation is called entity nesting, which is very common in literature datasets and is a problem that must be addressed.

[0056] Furthermore, this dataset contains many domain-specific terms and English abbreviations. For example, "HbA1c" belongs to the "Test_items" category and refers to the glycated hemoglobin test in the medical field. It is difficult for researchers outside the medical field to understand its meaning, and there is no vocabulary that perfectly corresponds to this term.

[0057] Therefore, preprocessing is required to address the entity nesting problem in the literature dataset. Entity nesting is resolved using machine reading comprehension methods, converting traditional Named Entity Recognition (BIO) tags into a machine reading comprehension tag format, including context, whether the entity is contained (impossible), entity label (entity_label), entity start position (start_position), entity end position (end_position), text and entity identifier (qas_id), and query.

[0058] In the above dataset example, there are a total of 17 entity categories. Therefore, 17 queries are set for each context text piece. The queries mainly help the machine establish the query scope and determine whether there are related entities in this text piece. At the same time, the queries contain text information, which can help the model converge faster.

[0059] The query settings can be referenced from Wikipedia, or researchers can customize the questions based on their own understanding of the dataset. For example, for the entity "Disease", the query could be set to "Does the following text contain descriptions of diseases, such as type 1 diabetes, type 2 diabetes, etc." The specific preprocessing format is shown in Table 1 below:

[0060] Table 1

[0061]

[0062] Because the text "The second blood draw should be exactly 2 hours after taking the glucose solution, and the blood glucose level should be measured by taking a blood sample from the forearm (time is calculated from the first sip of glucose until exactly 2 hours later, which is 2hPG)" does not contain the entity "Disease", its settings for entity_label="Disease" are start_position=[], end_position=[], and impossible=true. However, the text does contain the entity "Test_items", so impossible=false. Impossible can help the machine quickly filter out unimportant data during training, saving time. The specific composition of qas_id is "text id" + "." + "entity id".

[0063] After preprocessing, when feeding the query and context into the deep learning neural network for training, the query and context are combined into the format [CLS]+query+[SEP]+context+[SEP], with the labels start_position and end_position. This method can store all possible entity labels for a piece of text information, effectively solving the problem of entity nesting.

[0064] The third step is to select a preset number of pre-trained models using the SentencePiece word segmenter.

[0065] After data preprocessing, labeled input data was obtained. It was found that the medical diabetes literature dataset contained many professional terms and English abbreviations in the field, resulting in the actual situation of the Chinese literature dataset being a mixture of Chinese and English. For example, in the above context, "2hPG" would be mapped to the out-of-vocabulary (OV) identifier "unknown" in the usual BERT vocabulary.

[0066] Therefore, a pre-trained model using the SentencePiece word segmenter, such as RoBERTa, ALBERT, XLNet, ELMo, etc., should be selected. The advantage of this byte-level BPE vocabulary is that it can encode any input text and will not encounter out-of-vocabulary words.

[0067] This paper provides a brief introduction to RoBERTa, ALBERT, and XLNet to offer some insights for those implementing this invention when selecting a model. RoBERTa introduces dynamic masking technology on top of BERT, meaning the mask position and method are calculated in real-time during model training. This pre-trained model also utilizes more data for training. ALBERT addresses the issue of excessively large parameters during training by introducing word vector parameter factorization, ensuring that the hidden layer dimension is not equal to the word vector dimension. It reduces the word vector dimension by adding fully connected layers and introduces a more complex Sentence Order Prediction (SOP) task to replace the Next Sentence Prediction (NSP) task in traditional BERT, enabling the pre-trained model to learn more subtle semantic differences and discourse coherence. XLNet uses Transformer-XL as its main framework and employs a bidirectional autoregressive language model structure, meaning it outputs the predicted next character as input. This approach avoids the artificial masking problem introduced by traditional BERT.

[0068] The fourth step is to build a corresponding number of deep learning network models based on the selected pre-trained models, train them, and record and save the models and parameters throughout the training process as models to be selected as judges.

[0069] After obtaining the preprocessed data and selecting pre-trained models, the BertModel and BertPreTrainedModel modules are imported from the transformers package to load the selected pre-trained models, forming multiple upstream neural networks. Then, the preprocessed data is input into each of these upstream neural networks to obtain semantic representations of multiple contexts. Multiple downstream neural networks, corresponding to the upstream neural networks, are then configured through multiple fully connected layers, forming multiple deep learning network models. Finally, the parameters learned by each deep learning network model in each epoch are recorded and saved, resulting in the models and parameters from the entire training process, which serve as the models to be selected for judging.

[0070] In this step, the data passes through an upstream neural network to obtain textual semantic information, which is then fed into a downstream network. Finally, it passes through two fully connected layers, outputting the entity's start position (start_prediction) and end position (end_prediction). The loss is calculated using the labels start_position and end_position and their corresponding masks start_position_mask and end_position_mask. The BCEWithLogitsLoss module in PyTorch is used to obtain start_loss and end_loss. Different weights can be set for start_loss and end_loss; here, 0.5 is used as a reference, meaning the start and end positions have equal weight in the loss calculation. The formula for calculating the total loss (total_loss) is obtained as follows:

[0071] start_loss=BCEWithLogitsLoss(start_prediction, start_position)*start_position_mask

[0072] end_loss=BCEWithLogitsLoss(end_prediction, end_position)*end_position_mask

[0073] total_loss=(start_loss+end_loss) / 2

[0074] Of course, the semantic information learned by the same pre-trained model is not the same in different rounds; different pre-trained models also learn different semantic information; therefore, each pre-trained model must be trained separately, and the two models with the highest accuracy should be selected and retained.

[0075] The fifth step is to select 2k models from the candidate judge models based on the model accuracy, and set confidence parameters for them, where k is the number of pre-trained models selected.

[0076] In this example, six "judges" are set up. The two models with the highest accuracy, selected from the training results using RoBERTa, ALBERT, and XLNet as pre-trained models, are chosen as "judges." Based on the accuracy [P1, P2, P3, P4, P5, P6], different confidence parameters are set using softmax to ensure that when evaluating data with prediction errors, the better the model is trained, the greater its influence. In this example, the confidence parameter is calculated using the following formula:

[0077] T = Softmax(P1, P2, ..., P... 2k )

[0078] Among them, P i Let T be the accuracy of the i-th judge's model, and T be the confidence parameter.

[0079] The sixth step involves using a voting mechanism and the selected judge model to identify the disputed entities in the text dataset.

[0080] First, the entity labels from the literature dataset are input into the judge model to obtain entity labels that do not match the original labels; these are recorded as disputed entities to be voted on. Then, based on the credibility parameters of each judge model, votes are cast for the disputed entities to be voted on, and disputed entities are selected based on a preset score threshold. The credibility parameter of each judge model is the number of votes cast for each entity.

[0081] In this example, six judging models "vote" for entities. The confidence parameter of each judging model is the number of "votes" for each entity. Each judging model votes for entities whose predicted results do not match the labeled results. Entities whose final scores exceed the set threshold are called "controversial" entities. In practice, a threshold of 3.5 performs best, identifying 93% of erroneous entities without generating too many entries, thus preventing the discriminator from taking too long to judge.

[0082] Step 7: Search the text dataset for the top n entities whose text information overlaps with the disputed entity by more than a preset overlap threshold. Score the disputed entities based on overlap and frequency, and identify disputed entities with scores less than the discrimination threshold as erroneous entities.

[0083] First, the top n entities in the text dataset whose text information overlaps with the disputed entity by more than a preset overlap threshold are selected as query entities. Then, based on the overlap D corresponding to the n query entities... i and entity frequency F i The score is calculated based on the frequency μ of the disputed entity in the literature dataset, and the score is calculated as follows: Score i =F i / μ×D i Let i = (1, 2, ..., n). Finally, perform n calculations to obtain the score set (Score1, Score2, ..., Score) corresponding to the disputed entity. n If any score in the score set is less than the discrimination threshold, the disputed entity is judged as an incorrect entity.

[0084] Specifically, in this example, the entities with the highest degree of controversy selected by the judging model through "voting" are obtained and recorded. At this point, these entities are only "controversial" entities, and there are still many entities whose labels are correct, but the model's ability is limited and they are incorrectly judged. Therefore, further screening is required. In this step, the time complexity of the discriminator used is (n×total×log(length)), where n is the number of "controversial" entities, total is the total number of data entries, and length is the length of a single data entry. Therefore, attention should be paid to the threshold design in the previous step, and the threshold should not be set too low, which would cause the discrimination process to take too long. The discriminator searches for the top five entities in the dataset with a textual overlap greater than 90% based on the textual information of the "controversial" entities. If there are fewer than five, only entities with an overlap greater than 90% are selected. Based on the overlap degree D, the frequency F of the entity with an overlap greater than 90%, and the frequency μ of the "controversial" entity itself in the dataset, the above scoring formula is used to score, resulting in min(num, 5) Score results, where num is the number of entities with an overlap greater than 90%. In practice, a score < 0.045 indicates that the "disputed" entity does not conform to the norm in the overall dataset. In experiments, the discriminator achieved a discrimination accuracy of up to 98%.

[0085] In the implementation of the method of the present invention, after identifying erroneous entities, AI experts and domain experts can further review and correct the errors on the original dataset to obtain a more accurate dataset.

[0086] Another embodiment of the present invention provides a knowledge graph entity annotation error recognition system on a literature dataset, comprising:

[0087] The annotation generation module is used to perform entity annotation on a literature dataset consisting of collected literature data in a specific field. Specifically, it includes: cutting an entire article into text segments of less than 256 characters, using the BIO annotation method, and manually annotating each text segment with entities.

[0088] The data preprocessing module is used to preprocess the entity-annotated literature dataset.

[0089] The pre-trained model configuration module is used to configure a preset number of pre-trained models using the SentencePiece word segmenter.

[0090] The model training module is used to build a corresponding number of deep learning network models based on the selected pre-trained models for training, and to record and save the models and parameters throughout the training process as models to be selected by the judges.

[0091] The judge model generation module is used to select 2k models from the candidate judge models based on the model accuracy and set confidence parameters for them, where k is the number of pre-trained models selected.

[0092] The disputed entity selection module is used to select disputed entities in the text dataset based on a voting mechanism and a selected judge model.

[0093] The error finding module is used to search the text dataset for the top n entities whose text information overlaps with the disputed entity by more than a preset overlap threshold. The disputed entities are scored based on overlap and frequency, and disputed entities with scores less than the discrimination threshold are identified as error entities.

[0094] The specific implementation of each module in the above system can be found in the steps described in the aforementioned method embodiments, and will not be explained in detail here.

[0095] When the above system is applied, the original dataset is continuously improved and corrected through repeated cycles of using the system to identify erroneous entities and manual review. As a result, the training results of each model in the system become better and better, and the identified erroneous entities become more and more accurate. During this process, the hyperparameters of the models in the system can be adjusted to set more stringent discriminators.

[0096] After using the method and system of this invention, researchers no longer need to repeatedly check the entire literature dataset line by line to correct errors. Instead, they only need to wait for the system to output the specific erroneous entities and then confirm the modification of the dataset, which reduces the burden of maintaining a large knowledge graph entity of a literature dataset.

[0097] The above description of the embodiments is only for the purpose of helping to understand the method and core ideas of the present invention. It should be noted that those skilled in the art can make several improvements and modifications to the present invention without departing from the principles of the present invention, and these improvements and modifications also fall within the protection scope of the claims of the present invention.

Claims

1. A method for identifying entity annotation errors in a knowledge graph on a literature dataset, characterized in that, Includes the following steps: S1. Perform data preprocessing on the document dataset with entity annotation; S2. Select a preset number of pre-trained models using the SentencePiece word segmenter; S3. Based on the selected pre-trained model, establish a corresponding number of deep learning network models for training, and record and save the models and parameters throughout the training process as models to be selected by the judges. S4. Based on the model accuracy, select 2k models from the candidate judge models as judge models, and set confidence parameters for them, where k is the number of pre-trained models selected. S5. Based on the voting mechanism, the selected reviewer model is used to select the controversial entities in the literature dataset, specifically including: S51. Input the entity labels of each entity in the literature dataset into the judge model to obtain entity labels that do not match the labels, and record them as disputed entities to be voted on. S52. Based on the credibility parameters of each judge's model, vote on the disputed entities to be voted on, and select the disputed entities based on the preset score threshold. The credibility parameter of each judge's model is the number of votes for each entity. S6. Search the literature dataset for the top n entities whose text information overlaps with the disputed entity by more than a preset overlap threshold. Score the disputed entities based on overlap and frequency, and identify disputed entities with scores less than the discrimination threshold as erroneous entities.

2. The method for identifying entity annotation errors on a knowledge graph in a document dataset as described in claim 1, characterized in that, In step S1, the data preprocessing includes processing the entity nesting problem in the literature dataset. Specifically, it includes converting traditional BIO tags into machine reading comprehension tag format, including context, whether it contains entities, entity tags, entity start position, entity end position, text identifier, entity identifier qas_id, and question query.

3. The method for identifying entity annotation errors in a knowledge graph on a document dataset as described in claim 1, characterized in that, In step S2, the pre-trained models of the SentencePiece word segmenter include XLNet, ELMo, RoBERTa, and ALBERT models.

4. The method for identifying entity annotation errors in a knowledge graph on a document dataset as described in claim 1, characterized in that, Step S3 specifically includes: S31. Load each pre-trained model through the BertModel and BertPreTrainedModel modules to form multiple upstream neural networks; S32. Input the preprocessed data into the multiple upstream neural networks respectively to obtain the semantic representation of multiple contexts, and then set multiple downstream neural networks corresponding to the upstream neural networks through multiple fully connected layers to form multiple deep learning network models. S33. Record and save the parameters learned by each deep learning network model in each epoch, and obtain the model and parameters throughout the training process as the candidate judge model.

5. The method for identifying entity annotation errors in a knowledge graph on a document dataset as described in claim 1, characterized in that, In step S4, the formula for calculating the confidence parameter is: in, For the first The accuracy of the individual judge model, where T is the confidence parameter.

6. The method for identifying entity annotation errors in a knowledge graph on a document dataset as described in claim 1, characterized in that, Step S6 specifically includes: S61. Search the text dataset for the top n entities whose text information overlaps with the disputed entity by more than a preset overlap threshold, and use them as query entities; S62. Based on the overlap of n query entities and entity frequency And the frequency of the disputed entity itself in the literature dataset. The disputed entities are scored, and the scoring method is as follows: , S63. Perform n calculations to obtain the score set corresponding to the disputed entity. If any score in the score set is less than the discrimination threshold, the disputed entity is judged as an incorrect entity.

7. The method for identifying entity annotation errors in a knowledge graph on a document dataset as described in any one of claims 1-6, characterized in that, Also includes: S0. Collect literature data in a specific field to form a literature dataset, and perform entity annotation on the literature dataset. Specifically, this includes: cutting a whole article into text pieces of less than 256 characters, using the BIO annotation method, and manually annotating each text piece with entities.

8. A knowledge graph entity annotation error recognition system on a literature dataset, characterized in that, include: The data preprocessing module is used to preprocess the entity-annotated literature dataset. The pre-trained model configuration module is used to configure a preset number of pre-trained models using the SentencePiece word segmenter. The model training module is used to build a corresponding number of deep learning network models based on the selected pre-trained models for training, and to record and save the models and parameters throughout the training process as models to be selected by the judges. The judge model generation module is used to select 2k models from the candidate judge models based on the model accuracy and set confidence parameters for them, where k is the number of pre-trained models selected. The disputed entity selection module is used to select disputed entities in the literature dataset based on a voting mechanism and using selected judge models. Specifically, it includes inputting the labels of each entity in the literature dataset into the judge models to obtain entity labels that do not match the labels, which are recorded as disputed entities to be voted on; voting on the disputed entities to be voted on based on the credibility parameters of each judge model; and selecting disputed entities based on a preset score threshold, wherein the credibility parameter of each judge model is the number of votes for each entity. The error detection module is used to search the literature dataset for the top n entities whose text information overlaps with the disputed entity by more than a preset overlap threshold. The disputed entities are scored based on overlap and frequency, and disputed entities with scores less than the discrimination threshold are identified as erroneous entities.

9. The knowledge graph entity annotation error recognition system on a document dataset as described in claim 8, characterized in that, Also includes: The annotation generation module is used to perform entity annotation on a literature dataset consisting of collected literature data in a specific field. Specifically, it includes: cutting an entire article into text segments of less than 256 characters, using the BIO annotation method, and manually annotating each text segment with entities.