Tag noise correction based crowd-sourced tagging data quality improvement method

A crowdsourced labeling and data quality technology, applied in the field of crowdsourced labeling data quality improvement based on label noise correction, can solve problems such as ignoring data feature information, label integration algorithms cannot further improve data quality, and labeling quality is not guaranteed. To achieve the effect of improving label quality

Inactive Publication Date: 2016-03-23
张静
View PDF3 Cites 60 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, crowdsourcing annotation also has its inherent defects: the annotators are all ordinary users from the Internet, and compared with traditional expert annotation, the quality of their annotation is not guaranteed
[0004] One of the main reasons why the above label integration algorithm cannot further improve data quality is that the algorithm only uses label information from multiple uncertain labelers, while ignoring the characteristic information of the data itself.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Tag noise correction based crowd-sourced tagging data quality improvement method
  • Tag noise correction based crowd-sourced tagging data quality improvement method
  • Tag noise correction based crowd-sourced tagging data quality improvement method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0016] In order to describe the present invention more specifically, below in conjunction with Attached picture A specific embodiment of the present invention is described in detail.

[0017] Step (1): (Crowdsourced Tag Integration)

[0018] (1-1) In the initial crowdsourced dataset D Run a label integration algorithm on . The most commonly used algorithm is the majority voting algorithm. The algorithm for each sample in the data set i , to count the number of tags of the sample from multiple annotators, if the category is c k has the largest number of labels, then the integrated label of this sample is c k . If there is more than one label category with the largest number, then a category is randomly selected as the integrated label of the sample.

[0019] (1-2) Data set D I any sample in i , whose integration tag is , annotated by j give sample i is tagged as , then the annotator j labeling quality Calculated as:

[0020]

[0021] in I yes D I Th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a tag noise correction based crowd-sourced tagging data quality improvement method. The method comprises the following steps: running a tag integration algorithm in an initial crowd-sourced tagging data set to form a data set after tag integration, and estimating tagger quality and integrated tag quality information of samples in the process; performing multi-round K-fold cross validation by utilizing the data set after tag integration, and constructing a high-quality data set; determining a tag noise set in combination with the tagger quality and the tag quality of the samples by utilizing a prediction probability of a class tag of each sample in the multi-round K-fold cross validation process; and training a classification model by utilizing the high-quality data set generated in the multi-round K-fold cross validation process, and performing prediction and replacement on the class tag of each sample in the tag noise data set by using the model. With the tag noise correction method, the quantity of potential noise tag samples in the data set after original tag integration is reduced, so that the data quality is improved.

Description

technical field [0001] The invention belongs to the technical field of data labeling, and in particular relates to a method for improving the quality of crowdsourcing labeling data based on label noise correction. Background technique [0002] Obtaining high-quality labeled data is a fundamental task in the fields of information retrieval, machine learning, and data mining. Taking supervised learning in machine learning as an example, the whole learning process is to perform model training on a moderately scaled dataset with class labels, so as to obtain a learning model that can accurately predict unlabeled samples. Traditionally, the class labels in the training data are usually provided by experts in the application domain. The class labels provided by experts are highly accurate, which is conducive to building high-quality models. However, such expert annotation is itself costly. With the development of intelligent computing technology, more and more labeling requirem...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/00
CPCG06F2218/12
Inventor 张静
Owner 张静
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products