Semi-supervised text classification model training method, text classification method, system, device and medium

A text classification and model training technology, applied in the field of deep learning, can solve the problems of not being able to use text classification directly, affecting the accuracy of the training model, and not considering the confidence of the model, so as to alleviate the lack of problems, avoid the impact, and improve the performance.

Active Publication Date: 2020-09-29
上海携旅信息技术有限公司
View PDF6 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, the above two semi-supervised methods are dedicated to image data processing and cannot be directly used for text classification to improve the accuracy of text classification in the absence of labeled samples
In addition, the ab

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Semi-supervised text classification model training method, text classification method, system, device and medium
  • Semi-supervised text classification model training method, text classification method, system, device and medium
  • Semi-supervised text classification model training method, text classification method, system, device and medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0068] This embodiment provides a semi-supervised text classification model training method, such as figure 1 As shown, the method includes the following steps:

[0069] S101. Obtain an initial sample set, where the initial sample set includes a labeled sample set and the unlabeled sample set where x i Indicates the i-th labeled sample, u i Represents the i-th unlabeled sample, n represents the number of labeled samples, and m represents the number of unlabeled samples. In this embodiment, a labeled sample refers to a sample labeled with a classification label, and an unlabeled sample refers to a sample not labeled with a classification label.

[0070] S102, mark each sample x i and unlabeled sample u i Perform data cleaning. For example, suppose it is necessary to train a text classification model for a certain language (such as Chinese), then delete the words in the sample that are not in the language. In addition, cleaning processing such as stop word filtering ca...

Embodiment 2

[0110] This embodiment provides a text classification method, such as figure 2 shown, including the following steps:

[0111] S201, acquiring the target text to be classified;

[0112] S202, input the target text into the target text classification model trained according to the aforementioned text classification model training method for processing, obtain the predicted probability that the target text belongs to each classification label, and use the classification label corresponding to the maximum value of the prediction probability as the target text classification results.

[0113] Since the accuracy of the target text classification model trained according to the foregoing text classification model training method is high, the classification result obtained in this embodiment is more accurate.

Embodiment 3

[0115] This embodiment provides a semi-supervised text classification model training system, such as image 3 As shown, the system 10 includes: an initial sample set acquisition module 101, a cleaning module 102, an enhancement module 103, a model processing module 104, a new sample construction module 105, a label estimation module 106, a verification module 107, and a trusted sample acquisition module 108 , a confidence sample set construction module 109 , an expansion module 110 and a model training module 111 . Each module is described in detail below:

[0116] The initial sample set obtaining module 101 is used to obtain the initial sample set, and the initial sample set includes the labeled sample set and the unlabeled sample set where x i Indicates the i-th labeled sample, u i Represents the i-th unlabeled sample, n represents the number of labeled samples, and m represents the number of unlabeled samples. In this embodiment, a labeled sample refers to a sample l...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a semi-supervised text classification model training method and system, a text classification method and system, equipment and a medium. The training method comprises the stepsof acquiring an initial sample set; enhancing the unlabeled sample to obtain a data enhanced sample; inputting the unlabeled sample and the data enhancement sample into a text classification model toobtain an embedded vector and a prediction probability belonging to each classification label; for each unlabeled sample, obtaining an embedded vector mean value of the unlabeled sample and the corresponding data enhancement sample as a new sample; for each unlabeled sample, obtaining a label estimation result of a new sample after sharpening the prediction probability mean value of the unlabeledsample and the corresponding data enhancement sample belonging to each classification label; verifying whether the new sample is credible or not, and if yes, marking the new sample as a credible new sample; constructing a confidence sample set according to the labeled sample and the trusted new sample, and expanding to obtain a target sample set; and training the text classification model according to the target sample set. According to the method, the text classification accuracy is improved under the condition of lack of annotation samples.

Description

technical field [0001] The present invention relates to the field of deep learning, in particular to a semi-supervised text classification model training method, text classification method, system, equipment and media. Background technique [0002] Machine learning methods attempt to use historical data of tasks to improve the performance of tasks. In order to obtain good learning performance, machine learning methods, such as supervised learning methods, usually require historical data to be clearly labeled (called labeled data) and require a large amount of labeled data. However, in many real-world tasks, because the acquisition of labeled data requires a lot of human and material resources, labeled data is usually scarce, while a large amount of unlabeled historical data (called unlabeled data) can be easily obtained. How to use a large amount of unlabeled data to help improve the performance obtained by using only a small amount of labeled data has become an important t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/35G06F40/216G06K9/62
CPCG06F16/355G06F40/216G06F18/2415G06F18/214Y02D10/00
Inventor 刘江宁鞠剑勋李健
Owner 上海携旅信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products