Cardiovascular disease medical record structuring system based on NLP
A structured and cardiovascular technology, applied in the field of natural language processing and deep learning, can solve the problems of poor generalization and transferability and low accuracy of general semantic representation models, and achieve strong generalization and transferability, high accuracy, and improved Effects on portability and computational efficiency
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0063] like figure 1 As shown, the present invention proposes an NLP-based structured system of cardiovascular disease medical records, using NLP-related technologies to convert and extract unstructured medical records to obtain structured text files, which include:
[0064] Text format conversion modules, such as figure 2 As shown, it is used to convert the medical record files related to cardiovascular diseases uploaded by users, and perform format conversion on demand. Word or PDF files are supported, and the output text file after conversion is recorded as F; the file type is judged by the file suffix name. If specified If the file is a Word file (the file suffix is .docx or .doc), use the third-party tool library docx2txt in Python to parse the text in the Word file and convert it into an operable string in Python; if the specified file is PDF For files (the file suffix is .pdf), users need to specify whether the PDF content is a text version or a picture version. F...
Embodiment 2
[0087] Different from Example 1, as Figure 4 As shown, the process of the data augmentation method (DAGA) based on language model text generation in the named entity recognition of the present embodiment is:
[0088] 1.1. Perform label linearization on the original manually marked NER training data, that is, mix the characters of the text with the original sequence label, and place the label corresponding to each character of the entity in front of the character. Such as "diagnosing angina pectoris", after linearization, "diagnosing B-disease and diagnosing heart I-disease and diagnosing angina I-disease and diagnosing pain". As a result, a new linearized label data is formed, which is recorded as D man .
[0089] 1.2. From the proprietary corpus W medical Focus on screening out the corpus W related to cardiovascular disease cardio , based on the existing medical entity dictionary (including entity categories such as disease and diagnosis, operation, drug, anatomical part...
Embodiment 3
[0094] Different from Embodiments 1 and 2, the LexiconEnhanced method (LexiconEnhanced) in the named entity recognition of this embodiment is as follows image 3 and Figure 5 As shown, the specific process is:
[0095] 2.1. Construct a sequence of character-vocabulary pairs, that is, for a given input Chinese sentence s c ={c 1 ,c 2 ,...,c n}, for each character c in the sentence i , use the dictionary of the word vector Med-WordVec in the medical field to match the potential vocabulary containing the character, and form a vocabulary pair with the character and the matched vocabulary, expressed as,
[0096] the s cw ={(c 1 ,ws 1 ),(c 2 ,ws 2 ),...,(c n ,ws n )}
[0097] Among them, c i Represents the i-th character in the sentence, ws i Represents the set of lexical components that contain the character. For example, in "diagnosing angina pectoris", for the character "heart", its character vocabulary pair sequence is {("heart", "heart"), ("heart", "heart disea...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


