Medical record structured analysis method based on medical field entities

A structured and physical technology, applied in unstructured text data retrieval, special data processing applications, instruments, etc., to improve accuracy, avoid ambiguity, and improve recognition results

Active Publication Date: 2019-07-19
5 Cites 44 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0015] The purpose of the present invention is to provide a method for structured analysis of medical records base...
View more


The invention discloses a medical record structured analysis method based on medical field entities, and the method comprises the steps of 1) building a medical entity and an attribute category tablefor a common medical record text, and carrying out the corresponding relation mapping; 2) identifying the medical entity in the medical record text by adopting a Bert _ BiLSTM _ CRF model; 3) segmenting the medical record text according to semantics to form events; 4) recombining the events; 5) constructing an attribute recognition model, and extracting the attributes in the segmented events; 6) connecting the medical entities of the events in the same sentence by utilizing the knowledge graph to obtain the relationship between the entities, and 7) customizing different attribute recognition models for different types of medical record text segments, and finally forming a final medical record structured analysis text according to the text sequence accumulating structured analysis results.

Application Domain

Semantic analysisSpecial data processing applications +2

Technology Topic

Structured analysisKnowledge graph +6


  • Medical record structured analysis method based on medical field entities
  • Medical record structured analysis method based on medical field entities
  • Medical record structured analysis method based on medical field entities


  • Experimental program(1)

Example Embodiment

[0028] The present invention will be further described in detail below with reference to the drawings and embodiments.
[0029] figure 1 It is a framework diagram of the overall implementation of a medical record structured analysis method based on entities in the medical field in this application. The method includes the following steps:
[0030] The first step: Medical researchers select entities in the medical field. The entities in the medical field mainly include: diseases, symptoms, drugs, examinations, signs, and treatments. Table 1 is a framework corresponding to the attributes of the medical records defined in this application;
[0031] Table 1:
[0033] Step 2: Build a mapping relationship table between entities and attributes; the attributes are also set by practitioners with medical experience in combination with business needs, including: location, occurrence time, duration, frequency, size, quantity, degree, incentives , Aggravating factors, mitigating factors, nature, color, smell, state, stage/type, dosage, efficacy, administration method, therapeutic effect, inspection description value, etc. See the specific mapping relationship figure 2.
[0034] The third step: build the Bert_BiLSTM_CRF model to identify medical entities in the medical record text, which are mainly divided into 6 categories, diseases, symptoms, drugs, examinations, signs, and treatments. Bert_BiLSTM_CRF can be divided into three parts, Bert as the pre-training input layer, BiLSTM as the training middle layer, and CRF as the top output layer. details as follows:
[0035] Here we first introduce the Bert pre-training model. Google’s Bert is trained through deep two-way Transformer encoder representation. This method adds contextual information from the left and right to each layer. The trained Bert model can be migrated to other tasks with a little addition. Among them, the best results have been achieved in 11 tasks and competitions in the field of natural language processing. The Bert_BiLSTM_CRF model introduced here is to add BiLSTM_CRF on the basis of bert-base-chinese, that is, input text, preprocess the text format, add paragraph start and end markers, and paragraph id, split the text by word, and map the word to The corresponding id number is converted into a vector form, the position of the word in the text is recorded, the position is converted into a vector form, and the paragraph is converted into a vector form at the same time, and the text word level vector, position vector, and paragraph vector are input into the deep two-way Transformer model, The output node vector of the model is used as the input vector of BiLSTM_CRF, and finally the category of each word is predicted by the BiLSTM_CRF model, and the classification is combined according to the combination of words, which is the result of entity recognition.
[0036] A brief introduction to the Bert model. The full name of Bert is Bidirectional Encoder Representations from Transformers (Transformer-based two-way encoder representation). When the bidirectional representation model processes a certain word, it can use both the previous word and the following word at the same time. The source of this bidirectionality is that Bert is different from traditional language models. It does not predict the most probable current word given all the previous words, but randomly masks some words and uses all unmasked words to make predictions. BERT can be regarded as a new model that combines the advantages of OpenAI's GPT and ELMo. Among them, ELMo uses two independently trained LSTMs to obtain two-way information, while GPT uses a new Transformer and classic language model to obtain only one-way information. The main goal of Bert is to make some improvements to the pre-training task on the basis of GPT, in order to simultaneously take advantage of the Transformer's deep model and bidirectional information.
[0037] Input representation: The input of the model has two natural sentences, A sentence and B sentence. We first need to convert each character and special symbol into a word embedding vector. Add a special character [SEP] between two sentences to separate the two sentences, and add [SEP] at the end of the sentence. And the special character [CLS] at the top of the A/B sentence, this special character can be regarded as a representation of the entire input sequence. The final position encoding is determined by the Transformer architecture itself, because the method based on full attention cannot encode the positional relationship between words like CNN or RNN, but it is precisely because of this attribute that the distance between two words can be modeled. The relationship between. Therefore, in order for Transformer to perceive the positional relationship between words, we need to use position coding to add position information to each word.
[0038] The core of Bert is the pre-training process. In simple terms, the model will extract two sentences from the data set. Among them, the B sentence has a 50% probability of being the next sentence of the A sentence, and then the two sentences are converted into the previously shown Enter the characterization. Now we randomly mask (Mask off) 15% of the words in the input sequence, and ask the Transformer to predict these masked words and the two tasks of the probability that sentence B is the next sentence of sentence A. In the field of entity recognition, by retaining the Tokens transformed by the Bert model and using them as the input of the sequence recognition model, transfer learning can be completed on the basis of the Bert model to achieve the purpose of entity recognition.
[0039] The Chinese pre-training model bert_base_chinese of Bert selected here has 12 layers, 768 hidden state nodes, and 12 self-Attention ‘Head’s. The specific attention mechanism (Attention) is simply the process of mapping the query to the correct input, given a query and a key-value pair. The query, key, value and final output here are all vectors. The output is often in the form of a weighted summation, and the weight is determined by query, key and value. In self-Attention, query, key, and value are all equal to the input sequence x, where the number of Heads h represents the use of h linear transformations. The d-dimensional key, value, and query are mapped into dk, dk, and dv dimensions, and then substituted into the attention mechanism to generate a total of h×dv-dimensional output, then put them together, and then use a linear transformation to get the final output. The specific formula is as follows:
[0040] head=Attention(QW,KW,VW)
[0041] MultiHead(Q,K,V)=Concat(head 1 ,head 2 ,...,head n )W
[0042] Where W represents the weight matrix, and n represents the number of Heads.
[0043] BILSTM+CRF model: It is the current mainstream entity recognition model. BiLSTM can obtain the context information in the left and right word order of the sequence, give the predicted label probability to the input word, and add the CRF layer on it to predict the entire text sequence The label probability forms a relatively most accurate prediction path that meets the linguistic description according to the constraints. And this path is the predicted result of our entity recognition model, and the required target entity category is extracted from this path.
[0044] Briefly introduce the BiLSTM+CRF model. The model is divided into two parts: the first part is the bidirectional long short memory model (BiLSTM), and the second part is the CRF layer. BiLSTM considers the context information of each unit X in the input sequence, and adding CRF can consider the dependency information between tags.
[0045] The first part: BiLSTM can refer to LSTM, the input sequence passes through forget gate, input gate, output gate, and output hidden state vector; two-way LSTM not only considers the positive influence of the sequence, but also considers the backward influence of the sequence. The forward LSTM and the backward The LSTM is combined into BiLSTM. For example, we encode the sentence "I love China", forward LSTM L Input "I", "Love", "中" and "Guo" in turn to get four vectors {h L0 ,h L1 ,h L2 ,h L3 ), backward LSTM R Input "country", "in", "love" and "I" in turn to get four vectors {h R0 ,h R1 ,h R2 ,h R3 }, and finally the forward and backward hidden vectors are spliced ​​to obtain {[h L0 ,h R3 ],[h L1 ,h R2 ],[h L2 ,h R1 ],[h L3 ,h R0 ]}, i.e. {h 0 ,h 1 ,h 2 ,h 3 }, connect the converted label sequence and train the model.
[0046] Brief introduction of LSTM: The key to LSTM is the cell state. The cell state is similar to a conveyor belt and runs directly on the entire chain. There are only a few linear interactions, and it is easy for the information to flow on it to remain unchanged. LSTM has the ability to remove or add information to the cell state through a "gate" structure. A door is a way of letting information through selectively. They include a sigmoid neural network layer and a pointwise multiplication operation. The Sigmoid layer outputs a value between 0 and 1, describing how much of each part can pass. 0 means "no amount is allowed to pass", 1 means "allow any amount to pass". LSTM has three gates to protect and control the cell state, namely forget gate, input gate and output gate. The first step in LSTM is to decide what information we will discard from the cell state. This decision is done through a layer called the forget gate; the next step is to determine what new information is stored in the cell state. There are two parts here. : First, the sigmoid layer called the "input gate layer" determines what value we will update. Then, a tanh layer creates a new candidate value vector; in the next step, use these two information to generate an update to the state; finally we need to determine what value to output, this output will be based on our cell state, first run a The sigmoid layer determines which part of the cell state will be output, and then process the cell state through tanh (get a value between -1 and 1) and multiply it with the output of the sigmoid gate. In the end, we only output We determine the part of the output. The specific formula is as follows:
[0048] CRF model: Here CRF uses the BMIOS marking method, a character corresponds to a label, where B represents the beginning of the word, M represents the word in the middle of the word, I represents the end of the word, and S represents a single character as an independent word. O represents other words that are not the target entity. CRF has two parts in the calculation: the emission probability matrix and the transition probability matrix. In BILSTM+CRF, the output of the BiLSTM layer is the respective score of all tags of each word, which is equivalent to the emission probability value of each word mapped to the tag, in CRF The transition probability matrix A, A i,j Representative tag i Transfer to tag j The transition probability. For the output tag sequence y corresponding to the input sequence X, the score is defined as score, where each score corresponds to a complete path. Use Viterbi algorithm to predict and solve the optimal path. The optimal path is the final prediction result of the output sequence.
[0049] The fourth step: According to the semantic segmentation of the medical record text to form an event, an event represents a relatively complete semantic content.
[0050] The first is the first segmentation based on the text-based form according to common Chinese and English punctuation marks, and sentences passing through Chinese punctuation marks are the minimum practice.
[0051] Secondly, import the dictionaries of various entities. Due to the particularity of the time mode, the time recognition model is embedded in the word segmentation model to identify time and various entities during word segmentation. The smallest event is segmented, and the event and entity and the sentence identifier to which the event belongs are retained.
[0052] After the event is segmented, a new event is formed according to the following criteria:
[0053] First: Determine the punctuation of the segmentation event. If it is a period, it represents the end of the sentence. The next event is the start of a new sentence. Record the sentence identifier. If it is another punctuation that identifies a semantic pause, it is recorded as an event and the sentence identifier is added to the event.
[0054] Then determine whether it is necessary to combine events. If the first event in the sentence does not contain an entity, then add the next event as a complete event, according to the principle of forward maximum matching, until there is an entity in the next event. If there is an event in another location, the next event does not contain an entity, then the next event is added to the event to form a new event, and iteratively proceed until there is an entity in the next event. Proceed in this way, dividing all sentences into events according to this principle. Treat events as the text range corresponding to entities and attributes.
[0055] Step 5: Extract the attributes in the event to form entity and attribute pairs.
[0056] Attribute recognition: status, that is, presence, absence, and uncertainty. In cases, denial of XX disease and no symptoms of XX often occur. The status is particularly important in the analysis of medical records. Here, we mainly use the common negative words in the writing of the medical history text of the experience and put them into the dictionary. After the word segmentation model is used for word segmentation, the principle of greedy matching is used to correspond to the entities of the same event one by one (entities with state attributes: diseases and symptoms) on.
[0057] Attribute recognition: time and duration of occurrence. In the event segmentation above, the time recognition model is embedded in the word segmentation model, so the time in the event is recognized in the previous step (in the third step), and it is necessary to determine whether the time occurs or lasts. The agreed time is a point in time, and the duration is a time period. Based on the difference between time period and time point, the two can be distinguished in a regular pattern. Since in the time model, some hospitals' specific time cannot be identified, such as: discharge, admission, etc., here additional regular rules are added to identify these words.
[0058] Attribute recognition: location. On the one hand, it is based on the domain dictionary to obtain the parts during word segmentation, and the other uses regular rules to expand the parts, combine adjacent parts, and combine the location words and parts near the part to generate a new part.
[0059] Attribute recognition: frequency. On the one hand, it is based on the domain dictionary to obtain part of the frequency during word segmentation. On the other hand, it uses common and common situations in entities to construct regular rules to identify frequencies. The expressions of frequencies under different entities are also quite different. For example, the symptoms are generally X times. , And in medicine is generally X times/day and so on.
[0060] Attribute recognition: size and quantity. The size and quantity attributes are recognized based on pattern matching. Among them, the size can be divided into two categories, one is an adjective describing the size of an object, and the other is a value + a unit of measurement. Identify the unit of measurement to locate the attribute, and extract the attribute based on pattern matching. Here, the distinguishing unit category is recorded as the quantity attribute if the unit is capacity and quality, otherwise it is recorded as the size attribute. Moreover, the unit needs to be identified in the size. Since the unit added in the thesaurus is generally a single unit, which does not contain/or * and other symbolic link combination units, it is necessary to add rules to identify this part of the combination unit before identifying the size attribute.
[0061] Attribute recognition: incentives, aggravating factors and mitigating factors. Based on regular rules and part-of-speech recognition triggers, triggers generally appear near the symptoms and disease entity texts. When the triggers are followed by changes in symptoms, it is an aggravating factor if it causes symptoms or a worsening of the disease, and if it leads to a reduction in symptoms or disease To mitigate factors.
[0062] Attribute recognition: The method and dosage of the medicine are used to extract regular rules from a large number of drug instructions and cases, and these attributes are extracted based on the rules.
[0063] Attribute recognition: degree, color, nature, temperature. Based on the word segmentation of the dictionary, the corresponding part of speech is selected as the attribute.
[0064] Different entities have different attributes. The entity at the event center is judged, and attribute pairs of candidate entities are formed based on the entity attribute comparison table.
[0065] There are also attributes in entities. For example, there are often part attributes and qualitative attributes in symptoms and diseases. The attributes and parts are extracted again for these two entities.
[0066] Step 6: Add logical judgments to eliminate relation pairs that do not conform to medical logic.
[0067] In the matched entity and attribute pairs, the attributes in an event will be matched to all the entities in the event. In order to further reduce the wrong entity attribute pairs, the following processing is done here: When the attributes are properties, colors, sizes, and quantities, By default, it only corresponds to one entity. The entity closest to the position of this type of attribute is taken as a valid entity attribute pair, and the remaining entities and this type of attribute pair are eliminated; if there are multiple time attributes in an event, each entity may correspond to multiple Time attributes, part of the time cannot correspond to the entity, it may be the time corresponding to the attribute. Here, the time needs to be matched. If the time is to increase and reduce the attributes, then this part of the time is excluded from this type of entity attribute pair; according to medicine Logic is eliminated. Only entities with symptoms such as blood, vomiting, nodules, lumps, etc. can exist in quantity. Most entities such as fever and chest pain do not have these attributes. Therefore, add judgments, if it is not blood, vomiting, or lumps. Symptoms such as knots and lumps are excluded from the quantitative attributes.
[0068] Step 7: Use the knowledge graph to connect the entities in the sentence with the entities.
[0069] Use the constructed knowledge graph to obtain the correspondence between entities. The knowledge graph contains the synonyms and abbreviations of the entities and has a definite medical relationship. Use the graph to obtain the correspondence between the entity relationships in the sentence.
[0070] Step 8: Build a customized recognition model based on different text segments of the medical record, and the text type supports extension.
[0071] The main complaint, current medical history, past history, personal history, family history, physical examination, diagnosis and other text types are respectively structured, and then the overall results are arranged in order to form an overall structured analysis of the text. Among them, the physical examination text section is specially processed. When the input text is a physical examination, the entity type identified as a symptom in the text is converted into a physical sign, and to a certain extent, the situation of high similarity between physical signs and symptom words is eliminated.
[0072] This application can have various changes and changes. Any modification, equivalent replacement, improvement, etc., made within the spirit and principle of this application shall be included in the protection scope of this application.


no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.

Similar technology patents

Device for adjusting computer network development platform based on light sensation control

ActiveCN111522379Aavoid ambiguity

False track elimination method based on distance differential

PendingCN111751815AAvoid Tracking Failuresavoid ambiguity

Implementation method of driver danger prediction test system

InactiveCN113967018Aavoid ambiguityavoid dealing with problems
Owner:和德保险经纪有限公司 +1

Method for accurately predicting bacterial blight of rice at early stage by utilizing miRNA 398b genes

ActiveCN104141006Aavoid ambiguityEarly forecast

Classification and recommendation of technical efficacy words

  • avoid ambiguity
  • Good recognition result

Display apparatus, and image data processing apparatus and method

InactiveCN106469533AAnti-aliased edgesavoid ambiguity

Computerized coder-decoder without being restricted by language and method

InactiveUS20020052748A1avoid ambiguityinexpensive to implement and maintain

Device for adjusting computer network development platform based on light sensation control

ActiveCN111522379Aavoid ambiguity

False track elimination method based on distance differential

PendingCN111751815AAvoid Tracking Failuresavoid ambiguity

Method and system for adaptively setting deep belief network (DBN) parameter

PendingCN106897744Aimprove accuracyGood recognition result
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products