A method and system for recognizing named entities in electronic medical records based on dependency syntax structure

By employing a dependency syntax structure approach, this method reconstructs dependency structures using graph convolutional neural networks and the BERT model, and combines this with a joint optimization loss function. This addresses the issue of low accuracy in named entity recognition in Chinese electronic medical records and improves the accuracy of entity boundary recognition.

CN116306643BActive Publication Date: 2026-06-16ZHONGKE FANYU TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHONGKE FANYU TECH
Filing Date
2022-12-28
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing Chinese electronic medical record named entity recognition methods have low accuracy in medical texts, mainly due to the diversity of professional vocabulary caused by medical expertise and the differences in doctors' writing habits, as well as the high cost of annotated corpora and insufficient semantic encoding information.

Method used

We employ a dependency syntax structure approach, combining a graph convolutional neural network and the pre-trained Chinese language model BERT with a graph autoencoder (GAE) to reconstruct the dependency structure connections between words. We train the model by jointly optimizing the loss function and integrate the loss of the dependency edge prediction model to improve the model's sensitivity to structural information.

🎯Benefits of technology

It significantly improves the accuracy of named entity recognition in Chinese electronic medical record texts, especially in the recognition of entity boundaries with inconsistent expressions, and enhances the model's ability to utilize syntactic structure information.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116306643B_ABST
    Figure CN116306643B_ABST
Patent Text Reader

Abstract

The application belongs to the technical field of electronic medical record named entity recognition, and specifically provides a kind of electronic medical record named entity recognition method and system of dependency syntax structure, through the dependency syntax structure information of text is encoded by graph neural network, simultaneously, the loss of dependency edge (dependency relationship) prediction model is merged, by maximizing the interval between the model of merging structure information and not merging structure information, force model to associate its decision with the structure information merged, make model sensitive to structure information.By means of the structural information of text, the performance of the model in the named entity recognition of Chinese electronic medical record text is improved, especially for similar entity references with different expressions, through the merged dependency syntax structure information, the model can accurately judge the entity boundary by combining the syntax structure information of the text, and improve the recognition accuracy of this part of entity.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of named entity recognition technology for electronic medical records, and more specifically, to a method and system for named entity recognition of electronic medical records with dependency syntax structure. Background Technology

[0002] With the proliferation of electronic medical records and the potential demand for medical information services and decision support, the automatic extraction and processing of medical information has become a key focus of deep learning research. Named entity recognition algorithms for electronic medical records are an important component of clinical electronic medical record and clinical decision support systems. In recent years, with the development of machine learning technology, deep learning-based named entity recognition methods have received considerable attention. These include Bidirectional Long Short-Term Memory (Bi-LSTM) networks, Conditional Random Fields (CRF), and the pre-trained language model BERT, all of which have been applied to this problem and achieved good performance.

[0003] However, due to the special nature of the medical industry, using deep learning methods to extract entities and relations from medical texts has the following problems: 1) Due to the professional nature of medical knowledge, a large number of professional terms are included, and doctors have different writing habits, so different medical texts may have different expressions; 2) Due to the high cost of annotating electronic medical record corpora, there are few accurately annotated corpora, and the semantic encoding information obtained when using simple models to model on a small amount of corpus is not rich enough.

[0004] The aforementioned problems result in unsatisfactory performance of general-domain named entity recognition methods on medical electronic texts. To address this issue, researchers have proposed improvements at the feature encoding layer for Chinese electronic medical record named entity recognition tasks. This involves encoding multiple features to enhance the model's semantic representation, such as incorporating glyphs, pronunciations, and sentence structure features into the text representation. Typically, these methods encode one or more of the multiple features and then fuse them with character or word-level features of the text, using a cross-entropy loss function to optimize and train the model. However, these methods have the following drawbacks:

[0005] (1) The main feature incorporated is to enrich the semantic representation of the text in the model, which is difficult to encode the connection relationship between characters / words and to use this information to solve the problem of difficult boundary determination when using sequence labeling for named entity recognition tasks.

[0006] (2) Few studies have focused on the loss function used to train this type of model. The cross-entropy loss function is commonly used to optimize only the posterior likelihood probability of the target task, without explicitly modeling the role of the incorporated information, which results in the model not using the incorporated information to make decisions. Summary of the Invention

[0007] This invention addresses the technical problem of low accuracy in named entity recognition of Chinese electronic medical records that integrate multiple features in the existing technology.

[0008] This invention provides a method for named entity recognition in electronic medical records with dependency syntax structures, comprising the following steps:

[0009] S1, Obtain the training set of Chinese electronic medical record texts;

[0010] S2, use a Chinese dependency parser to obtain dependency syntax structure information of the text in the training set;

[0011] S3 uses the Chinese pre-trained language model BERT to obtain the embedding vector of each character in the text and combines them into a text sequence representation;

[0012] S4, the character embedding vector and dependency syntax structure information are fed into the graph convolutional neural network layer for encoding training, to obtain the text feature representation vector and dependency edge type vector that integrate text dependency structure information;

[0013] S5, the text feature sequence is fed into the CRF layer to decode the entity labels and predict the entity label sequence;

[0014] S6, feed the dependency edge type vector into the fully connected layer to perform dependency relationship label classification and predict the dependency edge type;

[0015] S7. The recognition model is obtained by jointly optimizing the loss of the prediction results of S5 and S6. The Chinese electronic medical record text can be input into the recognition model for recognition.

[0016] Preferably, S4 specifically includes:

[0017] The embedding vectors of characters and dependency syntax structure information are fed into the graph convolutional neural network layer. By reconstructing the dependency structure connection relationship between characters and words, the dependency structure connection information between characters is encoded to obtain the text feature representation vector and dependency edge type vector that integrate the text dependency structure information.

[0018] Preferably, the dependency structure connection relationship in S4 specifically includes:

[0019] A word vector is obtained by adding the vectors of the characters that make up the word. Characters that make up the same word share the dependency edges of the word.

[0020] Preferably, S4 specifically includes:

[0021] (1) The dependency connection relationship between characters / words is reconstructed by using the graph autoencoder (GAE) method to obtain the adjacency matrix;

[0022] (2) In order for the model to learn the connection relationship type between encoded characters / words, a classification task for edge vectors is constructed to output the prediction of the connection edge type.

[0023] Preferably, S7 specifically includes: weighted summation of the loss of the dependency edge prediction model and the loss of the named entity recognition model, followed by loss calculation and gradient backpropagation, and then optimization of the model parameters to obtain the recognition model.

[0024] Preferably, the specific process of loss calculation and gradient backpropagation in S7 is as follows:

[0025] L = L E (θ)+L DR (θ)

[0026] L B (θ)=-logP θ (Y E |S (GCN-L) A)

[0027] L DR (θ)=-logP θ (Y DR |E (GCN-L) A)

[0028] Among them, Y E For entity label sequence, L E (θ) represents the entity prediction loss, Y DR For dependency relationship type labels, L DR (θ) represents the dependency classification loss, A represents the graph convolutional neural network (parameters), and P θ Let θ represent the predicted probability, where θ represents the parameters that the entire network needs to optimize.

[0029] Preferably, the joint optimization loss in S7 specifically includes:

[0030] Incorporating the loss of the dependency edge prediction model forces the model to associate its decisions with the incorporated structural information, making the model sensitive to structural information.

[0031] This invention also provides a named entity recognition system for electronic medical records with dependency syntax structures. The system is used to implement a named entity recognition method for electronic medical records with dependency syntax structures, specifically including:

[0032] The sample acquisition module is used to acquire the training set of Chinese electronic medical record texts;

[0033] The dependency parsing module is used to obtain the dependency parsing structure information of the text in the training set using a Chinese dependency parser;

[0034] The Chinese training module is used to obtain the embedding vector of each character in the text using the pre-trained Chinese language model BERT and combine them into a text sequence representation.

[0035] The encoding module is used to feed the character embedding vectors and dependency syntax structure information into the graph convolutional neural network layer for encoding training, so as to obtain the text feature representation vector and dependency edge type vector that integrate the text dependency structure information.

[0036] The entity prediction module is used to feed the text feature sequence into the CRF layer, decode the entity labels, and predict the entity label sequence.

[0037] The dependency edge prediction module is used to feed the dependency edge type vector into the fully connected layer, perform dependency relationship label classification, and predict the dependency edge type.

[0038] The optimization module is used to jointly optimize the loss of the prediction results to obtain the recognition model. The Chinese electronic medical record text can be input into the recognition model for recognition.

[0039] The present invention also provides an electronic device, including a memory and a processor, wherein the processor is used to implement the steps of a dependency syntax structure-based electronic medical record named entity recognition method when executing a computer management program stored in the memory.

[0040] The present invention also provides a computer-readable storage medium storing a computer management program thereon, wherein the computer management program, when executed by a processor, implements the steps of a dependency syntax-based electronic medical record named entity recognition method.

[0041] Beneficial Effects: This invention provides a method and system for named entity recognition in electronic medical records based on dependency syntax. By encoding the dependency syntax information of the text through a graph neural network and incorporating it into the loss of a dependency edge (dependency relation) prediction model, the method maximizes the margin between models with and without incorporated structural information. This forces the model to associate its decisions with the incorporated structural information, making the model more sensitive to structural information. This solution improves the named entity recognition performance of Chinese electronic medical record texts by leveraging the structural information of the text. Especially for similar entity references with inconsistent expressions, the model can effectively combine the incorporated dependency syntax information to accurately determine entity boundaries, thus improving the recognition accuracy of these entities. Attached Figure Description

[0042] Figure 1 A flowchart of a method for named entity recognition in electronic medical records with dependency syntax structure provided by the present invention;

[0043] Figure 2 A schematic diagram of the hardware structure of a possible electronic device provided by the present invention;

[0044] Figure 3 A schematic diagram of the hardware structure of a possible computer-readable storage medium provided by the present invention;

[0045] Figure 4 This is a schematic diagram of the GAE training method provided by the present invention;

[0046] Figure 5 The dependency parsing annotation relation graph provided by this invention. Detailed Implementation

[0047] The specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings and examples. The following examples are for illustrative purposes only and are not intended to limit the scope of the invention.

[0048] This invention employs a structural information integration approach that directly constructs structural information connection relationships. It utilizes an external structural parser to obtain text dependency structure information, constructs a text structural information encoding model through a graph convolutional neural network, and builds a structure-sensitive loss function to enhance the semantic and dependency structure information representation of the model text. Furthermore, it proposes a method for named entity recognition in electronic medical records based on dependency syntax structures. Figure 1 As shown, the specific steps are as follows:

[0049] S1: Obtain the training set of Chinese electronic medical record texts and preprocess the texts.

[0050] S2: Use the dependency parser (publicly available from Stanford CoreNLP) to obtain the dependency parsing information of the text.

[0051] S3: Use the pre-trained Chinese language model BERT to obtain the embedding of each character in the text, thus obtaining the text sequence representation. L is the number of BERT layers. This is the embedding vector, where i and n are the character indices.

[0052] S4: The character embedding vectors and dependency syntax structure information are fed into the graph convolutional neural network layer. By reconstructing the dependency structure connections between characters and words, the dependency structure connection information between characters is encoded to obtain a text feature representation vector that integrates text dependency structure information. and dependency edge type vectors in Let i be the text feature vector of the i-th character. Let be the dependency edge type vector of the i-th character.

[0053] S4.1 The word vector is obtained by adding the character vectors that make up the word. Characters that make up the same word share the dependency edges of the word.

[0054] S4.2 Training Method:

[0055] (1) Employing a graph autoencoder (GAE) method, such as... Figure 4 As shown, the dependency connections between characters / words are reconstructed to obtain the adjacency matrix;

[0056] (2) In order for the model to learn the connection relationship type between encoded characters / words, a classification task for edge vectors is constructed to output the prediction of the connection edge type.

[0057] S5: The output feature sequence list of the graph convolutional neural network layer, i.e., the text feature sequence list that incorporates text dependency structure information, is used. (GCN-L) The data is fed into the CRF layer for entity label decoding, and the predicted entity label sequence is the named entity recognition model.

[0058] S6: Convert the dependency edge type vector E of the graph convolutional neural network layer (GCN-L) The data is fed into a fully connected layer for dependency edge type (dependency relation label) classification, and the resulting dependency edge type prediction model is obtained. For example... Figure 5 As shown in the figure, there are a total of 14 types of dependency parsing annotation relations, which will not be elaborated here.

[0059] S7: Based on the prediction results of S5 and S6, a joint optimization loss is performed to obtain the recognition model. The Chinese electronic medical record text can then be input into the recognition model for identification. Specifically, the loss of the dependency edge prediction model and the loss of the named entity recognition model are weighted and summed, then the loss is calculated and gradients are backpropagated. Finally, the model parameters are optimized to obtain the recognition model.

[0060] The loss consists of two parts: entity prediction loss and dependency classification loss.

[0061] L = L E (θ)+L DR (θ)

[0062] L E (θ)=-logP θ (Y E |S (GCN-L) A)

[0063] L DR (θ)=-logP θ (Y DR |E (GCN-L) A)

[0064] Among them, Y E For entity label sequence, L E (θ) represents the entity prediction loss, YDR For dependency relationship type labels, L DR (θ) represents the dependency classification loss, A represents the graph convolutional neural network (parameters), and P θ Let θ represent the predicted probability, where θ represents the parameters that the entire network needs to optimize.

[0065] The joint optimization loss specifically includes the loss of incorporating the dependency edge prediction model, which forces the model to associate its decisions with the incorporated structural information, making the model sensitive to structural information. By employing a joint training approach and minimizing both losses, the model, when encoding character vectors, not only focuses on the connection information between characters but also learns the types of connection relationships between characters. This information can then be used to make decisions while performing entity label prediction.

[0066] In a specific implementation scenario, here are two examples of syntactic structure parsing results for text:

[0067] Medical Record A: "...Cervical spine oblique X-ray examination performed in the outpatient clinic..."

[0068] Medical Record B: "...Cervical spine oblique DR X-ray examination performed in the outpatient clinic..."

[0069] Although the word segmentation methods for the entities 【Cervical Spine Bilateral Oblique View】 and 【Cervical Spine Bilateral Oblique View DR View】 are different, and the semantic representations at the individual character / word level are inconsistent, from the perspective of sentence dependency structure analysis, the dependency relationships within the two entities and between the preceding and following words are the same. In particular, for the dependency relationships within the entities (ATT: attributive-head relation), entity boundary recognition errors can be effectively avoided.

[0070] By designing a structural contrast loss function, the model can focus on the structural information of the text and make full use of this information to make decisions, thereby improving the problem of difficulty in determining the boundaries of similar entities in the electronic medical record named entity recognition task and significantly improving the model's recognition performance.

[0071] Please see Figure 2 This is a schematic diagram illustrating an embodiment of the electronic device provided in this invention. For example... Figure 2 As shown, an embodiment of the present invention provides an electronic device, including a memory 1310, a processor 1320, and a computer program 1311 stored in the memory 1310 and executable on the processor 1320. When the processor 1320 executes the computer program 1311, it performs the following steps: S1, obtaining a training set of Chinese electronic medical record texts;

[0072] S2, using a Chinese dependency parser to obtain dependency syntax structure information of the text in the training set:

[0073] S3 uses the pre-trained Chinese language model BERT to obtain the embedding vector of each character in the text and combines them into a text sequence representation:

[0074] S4, the character embedding vector and dependency syntax structure information are fed into the graph convolutional neural network layer for encoding training, to obtain the text feature representation vector and dependency edge type vector that integrate text dependency structure information;

[0075] S5, the text feature sequence is fed into the CRF layer to decode the entity labels and predict the entity label sequence;

[0076] S6, feed the dependency edge type vector into the fully connected layer to perform dependency relationship label classification and predict the dependency edge type:

[0077] S7. The recognition model is obtained by jointly optimizing the loss of the prediction results of S5 and S6. The Chinese electronic medical record text can be input into the recognition model for recognition.

[0078] Please see Figure 3 This is a schematic diagram illustrating an embodiment of a computer-readable storage medium provided by the present invention. (See diagram below.) Figure 3 As shown, this embodiment provides a computer-readable storage medium 1400, on which a computer program 1411 is stored. When the computer program 1411 is executed by a processor, it performs the following steps: S1, obtaining a training set of Chinese electronic medical record texts;

[0079] S2, using a Chinese dependency parser to obtain dependency syntax structure information of the text in the training set:

[0080] S3 uses the Chinese pre-trained language model BERT to obtain the embedding vector of each character in the text and combines them into a text sequence representation;

[0081] S4, the character embedding vectors and dependency syntax structure information are fed into the graph convolutional neural network layer for encoding training, resulting in a text feature representation vector that integrates text dependency structure information and a dependency edge type vector:

[0082] S5, the text feature sequence is fed into the CRF layer to decode the entity labels and predict the entity label sequence;

[0083] S6, feed the dependency edge type vector into the fully connected layer to perform dependency relationship label classification and predict the dependency edge type;

[0084] S7. The recognition model is obtained by jointly optimizing the loss of the prediction results of S5 and S6. The Chinese electronic medical record text can be input into the recognition model for recognition.

[0085] It should be noted that the descriptions of each embodiment in the above embodiments have different focuses. For parts that are not described in detail in a certain embodiment, please refer to the relevant descriptions in other embodiments.

[0086] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0087] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0088] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0089] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0090] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention.

[0091] Obviously, those skilled in the art can make various modifications and variations to this invention without departing from its spirit and scope. Therefore, if these modifications and variations fall within the scope of the claims of this invention and their equivalents, this invention also intends to include these modifications and variations.

Claims

1. A method for named entity recognition in electronic medical records with dependency syntax, characterized in that, Includes the following steps: S1, Obtain the training set of Chinese electronic medical record texts; S2, use a Chinese dependency parser to obtain dependency syntax structure information of the text in the training set; S3 uses the Chinese pre-trained language model BERT to obtain the embedding vector of each character in the text and combines them into a text sequence representation; S4, the character embedding vectors and dependency syntax structure information are fed into the graph convolutional neural network layer for encoding training, resulting in a text feature representation vector that integrates text dependency structure information and a dependency edge type vector: S5, the text feature sequence is fed into the CRF layer to decode the entity labels and predict the entity label sequence; S6, feed the dependency edge type vector into the fully connected layer to perform dependency relationship label classification and predict the dependency edge type; S7. The recognition model is obtained by jointly optimizing the loss of the prediction results of S5 and S6. The Chinese electronic medical record text can be recognized by inputting it into the recognition model. S4 specifically includes: The embedding vectors of characters and dependency syntax structure information are fed into the graph convolutional neural network layer. By reconstructing the dependency structure connection relationship between characters and words, the dependency structure connection information between characters is encoded to obtain the text feature representation vector and dependency edge type vector that integrate the text dependency structure information. S4 specifically includes: (1) The dependency connection relationship between characters / words is reconstructed by using the graph autoencoder (GAE) method to obtain the adjacency matrix; (2) In order for the model to learn the connection relationship type between encoded characters / words, a classification task for edge vectors is constructed to output the prediction of the connection edge type; S7 specifically includes: weighted summation of the loss of the dependency edge prediction model and the loss of the named entity recognition model, followed by loss calculation and gradient backpropagation, and then optimization of the model parameters to obtain the recognition model; The joint optimization loss in S7 specifically includes: Incorporating the loss of the dependency edge prediction model forces the model to associate its decisions with the incorporated structural information, making the model sensitive to structural information.

2. The method for named entity recognition of electronic medical records with dependency syntax structure according to claim 1, characterized in that, The dependency structure connection relationships in S4 specifically include: A word vector is obtained by adding the vectors of the characters that make up the word. Characters that make up the same word share the dependency edges of the word.

3. The method for named entity recognition of electronic medical records with dependency syntax structure according to claim 1, characterized in that, The specific process of loss calculation and gradient backpropagation in S7 is as follows: in, A sequence of entity labels. Forecast losses for entities, For dependency relationship type labels, Let A represent the graph convolutional neural network loss, where A is the dependency classification loss. Denotes the predicted probability, where This represents the parameters that need to be optimized for the entire network.

4. A named entity recognition system for electronic medical records with a dependency syntax structure, characterized in that, The system is used to implement the electronic medical record named entity recognition method with dependency syntax structure as described in any one of claims 1-3, specifically including: The sample acquisition module is used to acquire the training set of Chinese electronic medical record texts; The dependency parsing module is used to obtain the dependency parsing structure information of the text in the training set using a Chinese dependency parser; The Chinese training module is used to obtain the embedding vector of each character in the text using the pre-trained Chinese language model BERT, and combine them into a text sequence representation: The encoding module is used to feed the character embedding vectors and dependency syntax structure information into the graph convolutional neural network layer for encoding training, so as to obtain the text feature representation vector and dependency edge type vector that integrate the text dependency structure information. The entity prediction module is used to feed the text feature sequence into the CRF layer, decode the entity labels, and predict the entity label sequence. The dependency edge prediction module is used to feed the dependency edge type vector into the fully connected layer, perform dependency relationship label classification, and predict the dependency edge type. The optimization module is used to jointly optimize the loss of the prediction results to obtain the recognition model. The Chinese electronic medical record text can be input into the recognition model for recognition.

5. An electronic device, characterized in that, The system includes a memory and a processor, wherein the processor is used to implement the steps of the electronic medical record named entity recognition method with dependency syntax structure as described in any one of claims 1-3 when executing a computer management program stored in the memory.

6. A computer-readable storage medium, characterized in that, It stores a computer management program, which, when executed by a processor, implements the steps of the electronic medical record named entity recognition method with dependency syntax structure as described in any one of claims 1-3.