NER-oriented Chinese clinical text data enhancement method and device

A text data and Chinese technology, applied in the field of Chinese clinical text data enhancement, can solve problems such as aggravated generation methods, violation of medical logic, ignoring semantic characteristics, etc., to achieve the effect of exploring the potential of the model and improving the difficulty

Active Publication Date: 2022-08-05
ZHEJIANG LAB
View PDF12 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the method of enhancing through language model generation is mostly based on single characters or subword sequences for text prediction, and most medical entities are composed of fixed semantic units. When the general method is directly applied to the medical field, it will be ignored. Losing the unique semantic characteristics of medical entities may result in generated data that may not conform to the characteristics of medical terms or violate medical logic, thereby affecting the accuracy of the NER model
[0009] Common generative models mostly use left-to-right decoding, which can only use the historical information that has been generated, but cannot use the future information that has not yet been generated, resulting in generated There is a certain degree of paranoia in the sample; at the same time, as the generation sequence becomes longer, the single-direction generation method is likely to aggravate the problem of error accumulation: for example, if an unreasonable vocabulary is generated somewhere in the middle, it will bias the follow-up prediction results and affect the quality of the overall generated sample

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • NER-oriented Chinese clinical text data enhancement method and device
  • NER-oriented Chinese clinical text data enhancement method and device
  • NER-oriented Chinese clinical text data enhancement method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0059] The specific embodiments of the present invention will be further described in detail below with reference to the accompanying drawings.

[0060] like figure 1 As shown, a kind of NER-oriented Chinese clinical text data enhancement method provided by the present invention, the main process and detailed description are as follows:

[0061] 1. Data preprocessing:

[0062] The data preprocessing process mainly includes word segmentation for unlabeled data and label linearization for labeled data.

[0063] For unlabeled data, it is mainly used for language model learning in the pre-training stage. Based on the existing medical dictionary, the unlabeled data is segmented by a combination of dictionary and rules.

[0064] For labeled data, it is mainly used for generative model training and optimization in the finetune stage. The main processing flow is as follows:

[0065] Entity segmentation:

[0066]Based on the existing medical dictionary and combined with the knowle...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an NER-oriented Chinese clinical text data enhancement method and device. Unlabeled data and labeled data subjected to label linearization processing are obtained through data preprocessing. Unlabeled data is used, part of information in a text is masked, the masked part is predicted on the basis of reserved information, meanwhile, an entity word level judgment task is introduced, and fragment-based language model pre-training is carried out; a plurality of decoding mechanisms are introduced in a fine tuning stage, a relationship between a text vector and text data is obtained based on a pre-trained fragment-based language model, linearized data with entity labels are converted into the text vector, text generation is carried out through forward decoding and reverse decoding in a prediction stage of a text generation model, and the text generation model is used for generating text data. The method comprises the following steps of: analyzing a label to obtain enhanced data with annotation information; according to the method, the data diversity is further improved, and the quality of the enhanced data is also improved, so that the model can generate more high-quality enhanced data.

Description

technical field [0001] The invention relates to the field of text data enhancement, in particular to an NER-oriented Chinese clinical text data enhancement method and device. Background technique [0002] Named entity recognition task is a basic task in the field of natural language processing. It is a kind of sequence labeling problem. Similar to classification tasks, each unit in a text sequence (Chinese named entity recognition task is usually based on single words or subwords). processing) for category judgment, and the judgment results usually include various categories such as "non-entity", "entity beginning word", "entity middle word", "entity ending word", among which, the entity-related type will be determined according to the entity to be predicted. Types vary. [0003] With the advancement of medical informatization construction, the amount of medical text data shows an explosive growth trend. The extraction and utilization of information in unstructured medical ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/117G06F40/166G06F40/274G06F40/284G06F40/30G06N3/04G06N3/08
CPCG06F40/117G06F40/166G06F40/284G06F40/274G06F40/30G06N3/08G06N3/044
Inventor 李劲松史黎鑫辛然杨宗峰田雨周天舒
Owner ZHEJIANG LAB
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products