Multi-label text classification method based on statistics and pre-trained language model

A language model and text classification technology, applied in the direction of text database clustering/classification, semantic analysis, character and pattern recognition, etc., can solve the problems of manual design, great influence of classification effect, and high cost of labeling data set acquisition. The effect of improving accuracy

Active Publication Date: 2021-01-12
UNIV OF ELECTRONICS SCI & TECH OF CHINA
View PDF10 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] (1) The multi-label text classification method based on traditional machine learning requires manual design of features, which is very time-consuming and labor-intensive, and the quality of features has a great impact on the classification effect
[0008] (2) Most of the existing deep learning-based methods use CNN, RNN, etc. to extract semantic information. Although good results can be achieved, there is still a certain gap compared to using pre-trained language models to extract semantic information.
[0009] (3) Both of the above two methods require a large-scale labeled data set, especially the multi-label text classification method based on deep learning, which puts forward higher requirements for the label accuracy and scale of the training data set, while For many application domains, the acquisition cost of large-scale high-accuracy labeled datasets is often very high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-label text classification method based on statistics and pre-trained language model
  • Multi-label text classification method based on statistics and pre-trained language model
  • Multi-label text classification method based on statistics and pre-trained language model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0042] The technical solution of the present invention will be further described below in conjunction with the accompanying drawings.

[0043] Such as figure 1 As shown, a kind of multi-label text classification method based on statistics and pre-trained language model of the present invention comprises the following steps:

[0044] S1. Preprocess the training corpus that needs to be classified; the specific implementation method is: obtain the corpus data set OrgData that needs to be labeled, and remove stop words (such as stop words such as "le", "a" and special symbols, etc. words), and then get NewData and save it.

[0045] S2. Establish a label acquisition model based on statistical methods and language models; the label acquisition model includes sequentially connected keyword layers, input coding layers, pre-trained language model layers, and similarity analysis layers, such as figure 2 shown.

[0046] Keyword layer: obtain the top k keywords by statistical methods ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a multi-label text classification method based on statistics and a pre-training language model. The multi-label text classification method comprises the following steps: S1, preprocessing training corpora needing to be classified; S2, establishing a label acquisition model based on a statistical method and a language model; S3, processing the obtained label data; S4, establishing a multi-label classification model based on a pre-training language model, and performing model training by utilizing the obtained label data; S5, performing multi-label classification on the text data to be classified by using the trained multi-label text classification model. According to the method, a statistical method and a pre-trained language model label obtaining method are combined, the ALBERT language model is used for obtaining the semantic coding information of the text, a data set does not need to be manually labeled, and the label obtaining accuracy can be improved.

Description

technical field [0001] The invention relates to a multi-label text classification method based on statistics and a pre-trained language model. Background technique [0002] Since 2013, the deep learning theory based on neural network has made great progress, and has been widely used in the fields of image and natural language processing, and has derived many research and application directions. Text classification is one of the most important tasks in natural language processing, and it has many applications in real life, such as public opinion monitoring, tag recommendation, information search, etc. Traditional single-label text classification algorithms are difficult to solve the problem of text diversity in real-life scenarios. Multi-label text classification has become a popular research direction in natural language processing text classification tasks. [0003] The current multi-label text classification methods are mainly divided into two categories: [0004] The fi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35G06F40/216G06F40/30G06F40/126G06K9/62G06N3/04
CPCG06F16/355G06F40/216G06F40/30G06F40/126G06N3/045G06F18/2415
Inventor 廖伟智周佳瑞阴艳超曹阳
Owner UNIV OF ELECTRONICS SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products