Unlock instant, AI-driven research and patent intelligence for your innovation.

Method, system, device and medium for generating training data set based on labeled text

A training data set and text technology, applied in the field of training set data generation, can solve the problems of inconsistent labeling, labeling deviation, and high cost of manual data labeling, and achieve the effects of ensuring consistency, improving accuracy, and reducing costs.

Active Publication Date: 2021-08-27
上海森亿医疗科技有限公司
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] 1) For enterprises, the cost of large-scale manual data labeling is extremely high; 2) For labelers, medical data requires labelers to have professional medical knowledge and basic linguistic knowledge; 3) Manual labeling workload is heavy, boring, And there are a lot of repeated texts in the labeling process, and the labelers cannot remember the accurate labeling method of each repeated text, resulting in inconsistent labeling before and after; 4) In the process of large-scale multi-person collaborative labeling, different labelers have different understandings of the same sentence, which will lead to A large number of labeling deviations cannot guarantee the consistency of labeling, which will seriously affect the training effect of subsequent model algorithms

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method, system, device and medium for generating training data set based on labeled text
  • Method, system, device and medium for generating training data set based on labeled text
  • Method, system, device and medium for generating training data set based on labeled text

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] The implementation of the present application is described below through specific examples, and those skilled in the art can easily understand other advantages and effects of the present application from the content disclosed in this specification. The present application can also be implemented or applied through other different specific implementation modes, and various modifications or changes can be made to the details in this specification based on different viewpoints and applications without departing from the spirit of the present application. It should be noted that, in the case of no conflict, the following embodiments and the features in the embodiments can be combined with each other.

[0023] It should be noted that the diagrams provided in the following embodiments are only schematically illustrating the basic idea of ​​the application, although only the components related to the application are shown in the drawings rather than the number, shape and shape ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

This application provides a method, system, device and medium for generating a training data set based on labeled text. By obtaining multiple texts to be labeled, each original long text is disassembled into multiple short texts that are decomposed and deduplicated and cleaned. Processing; after processing, store it in the database to be assigned to a unique database id; use the forward maximum matching sentence algorithm to obtain the corresponding matching information in the database; perform entity / association annotation on the short text of the split sentence to generate a unique annotation id , and obtain the mapping relationship between the corresponding database id and the label id according to the short texts of the split sentences; according to the matching information and the mapping information, the short texts of the split sentences are spliced ​​into long texts containing entity / association annotations for training set data. This application can greatly reduce the cost of manual labeling for enterprises, ensure the consistency of repeated text labeling, and at the same time reduce the interference caused by inconsistent corpus during model algorithm training, and improve the accuracy of model learning.

Description

technical field [0001] This application relates to the technical field of training set data generation, in particular to a method, system, device and medium for generating training data sets based on labeled text. Background technique [0002] The lack of training data is an eternal problem in the field of natural language processing NLP. Lack of labeled data, a lot of noise in samples, and data deviation are all common phenomena. In the field of natural language processing, especially vertical fields (such as medicine), This phenomenon is more obvious. The current industry labeling mainly has the following problems: [0003] 1) For enterprises, the cost of large-scale manual data labeling is extremely high; 2) For labelers, medical data requires labelers to have professional medical knowledge and basic linguistic knowledge; 3) Manual labeling workload is heavy, boring, And there are a lot of repeated texts in the labeling process, and the labelers cannot remember the accur...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F40/117G06F40/232G06F40/279G06F40/295
CPCG06F40/117G06F40/232G06F40/279G06F40/295
Inventor 张少典顾根刘霄晨
Owner 上海森亿医疗科技有限公司