Short text classification method based on multiple weak supervision integration

A classification method and short text technology, applied in text database clustering/classification, unstructured text data retrieval, special data processing applications, etc. Improve performance and efficiency, alleviate data sparsity, and solve imbalanced classification problems

Active Publication Date: 2020-07-24
湖南董因信息技术有限公司
View PDF2 Cites 8 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] In view of this, the present invention is committed to providing a short text classification method based on multiple weak supervision integration, which can solve the problems of label bottleneck, data sparseness and unbalanced classification in short text classification as a whole.
The method of the present invention not only innovatively introduces three sources of weak supervision: keyword matching, regular expressions, and far-supervised clustering into short text annotations for the particularity of short texts; it also proposes a multiple Weakly supervised integration method, which integrates the discrete labels directly output by multiple weak supervisions into probability labels, in order to solve the imbalanced classification problem

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Short text classification method based on multiple weak supervision integration
  • Short text classification method based on multiple weak supervision integration
  • Short text classification method based on multiple weak supervision integration

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0038] Such as figure 1 , a short text classification method based on multiple weakly supervised ensembles, including the following steps:

[0039] Step 1, obtain the original data set and knowledge base, and perform data preprocessing;

[0040] Step 2, using multiple weak supervision methods for knowledge extraction on the preprocessed data;

[0041] Step 3, program the extracted knowledge as a labeling function and use it for data labeling;

[0042] Step 4,...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a short text classification method based on multiple weak supervision integration, and the method comprises the steps: obtaining an original data set and a knowledge base, andcarrying out the data preprocessing; carrying out knowledge extraction on the preprocessed data; representing the extracted knowledge as an annotation function, and using the annotation function for data annotation; carrying out label integration through a conditional independent model; training a classification model based on a full-connection neural network; evaluating and optimizing the classification model to obtain an optimal model; and performing short text classification by utilizing the optimal model. According to the short text classification method based on multiple weak supervisionintegration, explicit knowledge and implicit knowledge are completely expressed in a mode of combining keyword matching, regular expression and remote supervision clustering; by means of probability labels generated by a label integration mechanism, automatic labeling of label-free data is achieved, the problem of data sparsity of short texts is relieved, and the problem of unbalanced classification of the short texts is effectively solved.

Description

technical field [0001] The invention belongs to the field of natural language processing, and in particular relates to a short text classification method based on multiple weak supervision integration. Background technique [0002] Under the background of mobile Internet, the development of instant messaging not only promotes the surge of short text, but also makes the research and application of short text classification more and more important. [0003] Supervised machine learning mainly relies on manually labeled data and good feature representation. Good feature expression can be learned automatically with deep learning. However, due to the thousands of parameters that need to be learned, supervised deep learning is still inseparable from a large amount of labeled data. In fact, the training data for supervised learning is still dominated by manual annotation. Manual labeling is very expensive and time-consuming. Furthermore, as real-world applications continue to ch...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35G06F40/279G06F40/289
CPCG06F16/35
Inventor 修保新
Owner 湖南董因信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products