Multi-label text classification method based on statistics and pre-trained language model

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A language model and text classification technology, applied in the direction of text database clustering/classification, semantic analysis, character and pattern recognition, etc., can solve the problems of manual design, great influence of classification effect, and high cost of labeling data set acquisition. The effect of improving accuracy

Active Publication Date: 2021-01-12

UNIV OF ELECTRONICS SCI & TECH OF CHINA

View PDF10 Cites 25 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0007] (1) The multi-label text classification method based on traditional machine learning requires manual design of features, which is very time-consuming and labor-intensive, and the quality of features has a great impact on the classification effect

[0008] (2) Most of the existing deep learning-based methods use CNN, RNN, etc. to extract semantic information. Although good results can be achieved, there is still a certain gap compared to using pre-trained language models to extract semantic information.

[0009] (3) Both of the above two methods require a large-scale labeled data set, especially the multi-label text classification method based on deep learning, which puts forward higher requirements for the label accuracy and scale of the training data set, while For many application domains, the acquisition cost of large-scale high-accuracy labeled datasets is often very high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0042] The technical solution of the present invention will be further described below in conjunction with the accompanying drawings.

[0043] Such as figure 1 As shown, a kind of multi-label text classification method based on statistics and pre-trained language model of the present invention comprises the following steps:

[0044] S1. Preprocess the training corpus that needs to be classified; the specific implementation method is: obtain the corpus data set OrgData that needs to be labeled, and remove stop words (such as stop words such as "le", "a" and special symbols, etc. words), and then get NewData and save it.

[0045] S2. Establish a label acquisition model based on statistical methods and language models; the label acquisition model includes sequentially connected keyword layers, input coding layers, pre-trained language model layers, and similarity analysis layers, such as figure 2 shown.

[0046] Keyword layer: obtain the top k keywords by statistical methods ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a multi-label text classification method based on statistics and a pre-training language model. The multi-label text classification method comprises the following steps: S1, preprocessing training corpora needing to be classified; S2, establishing a label acquisition model based on a statistical method and a language model; S3, processing the obtained label data; S4, establishing a multi-label classification model based on a pre-training language model, and performing model training by utilizing the obtained label data; S5, performing multi-label classification on the text data to be classified by using the trained multi-label text classification model. According to the method, a statistical method and a pre-trained language model label obtaining method are combined, the ALBERT language model is used for obtaining the semantic coding information of the text, a data set does not need to be manually labeled, and the label obtaining accuracy can be improved.

Description

technical field [0001] The invention relates to a multi-label text classification method based on statistics and a pre-trained language model. Background technique [0002] Since 2013, the deep learning theory based on neural network has made great progress, and has been widely used in the fields of image and natural language processing, and has derived many research and application directions. Text classification is one of the most important tasks in natural language processing, and it has many applications in real life, such as public opinion monitoring, tag recommendation, information search, etc. Traditional single-label text classification algorithms are difficult to solve the problem of text diversity in real-life scenarios. Multi-label text classification has become a popular research direction in natural language processing text classification tasks. [0003] The current multi-label text classification methods are mainly divided into two categories: [0004] The fi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F16/35G06F40/216G06F40/30G06F40/126G06K9/62G06N3/04

CPCG06F16/355G06F40/216G06F40/30G06F40/126G06N3/045G06F18/2415

Inventor 廖伟智周佳瑞阴艳超曹阳

Owner UNIV OF ELECTRONICS SCI & TECH OF CHINA

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Multi-label text classification method based on statistics and pre-trained language model

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology