Weak supervised text classification method and device based on active learning

An active learning and text classification technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as constraints on the efficiency and accuracy of text classification, difficulty in ensuring the quality of annotation, and low efficiency in the corpus annotation process. The effect of enriching sample semantic representation, expanding sample size, improving generalization ability and robustness

Active Publication Date: 2019-07-02
安徽省泰岳祥升软件有限公司
View PDF2 Cites 41 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] This application provides a weakly supervised text classification method and device based on active learning to solve the problem that the existing corpus tagging process is inefficient and difficult to guarantee the tagging quality, thus restricting the efficiency and accuracy of text classification

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Weak supervised text classification method and device based on active learning
  • Weak supervised text classification method and device based on active learning
  • Weak supervised text classification method and device based on active learning

Examples

Experimental program
Comparison scheme
Effect test

example 1

[0028] Example 1: I hope you keep this book, even if you grow up year after year, even if it sits on the shelf for a long time, gathering dust. As soon as you open it again, you'll be glad you didn't lose it.

[0029] In the field of natural language processing, corpus annotation is to add explanatory and symbolic annotation information to text corpus, such as category annotation, part-of-speech annotation, entity relationship annotation, word sense disambiguation, etc. Generally, annotated corpus carries annotation information, such as category tags, part-of-speech tags, etc., while unlabeled corpus does not contain such information. The samples in the sample set described in this application are unlabeled samples.

[0030] It should be noted that the embodiment of the present application uses category marking as an example to illustrate the idea and implementation of the technical solution of the present application, and the category marking does not constitute a limitation...

example 1-1

[0061] Example 1-1: I hope you keep this book, even if you grow up year after year, even if it sits on the shelf for a long time, gathering dust.

example 1-2

[0062] Example 1-2: As soon as you open it again, you'll be glad you didn't lose it.

[0063] Then example 1-1 and example 1-2 will be used as two new samples to form the initial training set. Compared with Example 1, Example 1-1 and Example 1-2 are smaller in length, thus enriching the text granularity of the training set.

[0064] Step 123, based on the TF-IDF algorithm, obtain the feature word vector of each sample in the initial training set.

[0065] Based on the TF-IDF algorithm, calculate the category discrimination of each vocabulary in the sample relative to the sample to which at least one characteristic vocabulary is selected from all the vocabulary contained in the sample, and then use the pre-trained word vector model to obtain the vector representation of the characteristic vocabulary. That is, the feature word vector, which is a prior art well known to those skilled in the art, will not be described in detail in this embodiment.

[0066] Step 124, using the fe...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a weak supervised text classification method and device based on active learning. The method comprises steps of firstly, extracting a first sample serving as a cluster center of a sample cluster from an unlabeled sample set; forming an initial training set based on the first samples, training a reference model by using the initial training set to obtain an initial classification model, and forming the initial training set by using the first samples, thereby not only reducing the number of training samples, but also ensuring the accuracy of the classification model at the initial stage; repeatedly utilizing the classification model to obtain the initial classification and confidence coefficient of the remaining samples in the sample set, so that manual labeling is not needed; extracting a second sample from the remaining samples according to the confidence coefficient, and performing data enhancement processing on the second sample to update the training set, thereby improving the generalization capability and robustness of the model; and finally, training the classification model by using the updated target training set until the classification model meets apreset condition, thereby realizing multi-round active training of the classification model.

Description

technical field [0001] The present application relates to the technical field of text classification, in particular to a weakly supervised text classification method and device based on active learning. Background technique [0002] In the field of natural language processing technology, text classification is an important type of text data processing task, which refers to the process of automatically determining the text category according to the text content under a given classification system. [0003] In a text classification method based on machine learning, the training corpus must first be obtained and marked, and then the text classifier is trained using the marked corpus so that the text classifier has the ability to classify unknown text information. The classification accuracy of a text classifier depends on the quality of the labeled corpus. Existing corpus annotation tasks are generally completed by "manual" annotators, who usually require annotators to have a ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06K9/62
CPCG06F40/279G06F40/30G06F18/217G06F18/24Y02D10/00
Inventor 李健铨陈玮陈夏飞
Owner 安徽省泰岳祥升软件有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products