Active learning based data automatic marking method

An automatic labeling and active learning technology, applied in the field of active learning, can solve the problems of complexity, large manual labor, and high marking cost, and achieve the effect of shortening time and overhead.

Active Publication Date: 2017-08-18
芽米科技(广州)有限公司
View PDF6 Cites 49 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, this marking process has the following problems: 1. The entire marking process is very complicated, and when the amount of data is particularly large, it requires a large amount of manual labor; 2. During the data labeling process, due to the limited energy of the labeling personnel or the Subjectivity and other factors, resulting in the inability to guarantee the 100% accuracy of the labeled data, that is, the quality of the label cannot be judged
Since the manually labeled samples are limited, and there is no guarantee that all manually labeled data samples are correct
Moreover, in practical problems, some sample data, such as genetic composition data used in genetic analysis, are expensive to label, so generally speaking, the number of unlabeled samples will far exceed the data of labeled samples

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Active learning based data automatic marking method
  • Active learning based data automatic marking method
  • Active learning based data automatic marking method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0048] refer to figure 1 , figure 1 A text data labeling system based on active learning provided for this embodiment, specifically includes:

[0049] 101 Processing the marked text data set flag and the unmarked text data set imflag: clustering the marked text data flag, and marking the value of the center point of each cluster;

[0050] The clustering processing of the marked text data set flag refers to clustering the marked text data set flag samples according to the label value of each text data sample, and dividing the samples that are similar and of the same category into one category. The number of clusters in the experimental clustering is k, and the set of clusters is expressed as {f 1 , f 2 ,..., f k}, and calculate the value {a of the cluster center point of each cluster 1 ,a 2 ,...,a k}.

[0051] The unlabeled data set imflag uses a cluster-based linear scan search, and the special search method is aimed at reducing the number of calculations for the very ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an active learning based data automatic marking method, and belongs to the field of active learning. The data automatic marking method comprises the steps of 101, processing marked data and unmarked data; 102, classifying the unmarked data by using a plurality of different classifiers; 103, selecting data with low difference entropy; 104, performing manual marking on the data with low difference entropy; and 105, performing self-checking on a manual marking result. The invention provides a data automatic marking system with a self-checking function through combining an active learning method in allusion a problem of how to ensure the accuracy of manually marked data as far as possible while reducing the volume of the manually marked data, thereby achieving the purposes of reducing the workload and improving the accuracy of the manually marked data.

Description

technical field [0001] The invention relates to the field of active learning, in particular to an automatic data labeling method based on active learning. Background technique [0002] With the advent of the era of big data, a new type of profession has emerged on the Internet—data labeller. The job of a data labeler is to use automated tools to crawl and collect data from the Internet, including text, pictures, voice, etc., and then organize and label the captured data. Specific workflow: First, the labelers are trained to determine the sample data to be labeled and labeling rules; then, label the sample data according to the pre-arranged rules; finally, merge the labeled results. However, this marking process has the following problems: 1. The entire marking process is very complicated, and when the amount of data is particularly large, it requires a large amount of manual labor; 2. During the data labeling process, due to the limited energy of the labeling personnel or t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62
CPCG06F18/217G06F18/24133G06F18/23213G06F18/214
Inventor 王进张登峰卜亚楠李颖范磊李智星欧阳卫华孙开伟陈乔松邓欣胡峰雷大江
Owner 芽米科技(广州)有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products