Unlock instant, AI-driven research and patent intelligence for your innovation.

Imbalanced text classification method introducing keyword features

A text classification and keyword technology, which is applied in text database clustering/classification, unstructured text data retrieval, semantic analysis, etc., can solve the problems of inability to solve the diversity of washing samples, inability to solve sparse category underfitting, etc. , to achieve the effect of solving the category imbalance

Pending Publication Date: 2022-08-05
THE 28TH RES INST OF CHINA ELECTRONICS TECH GROUP CORP
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, neither of these two methods can solve the underfitting problem of sparse categories. Suppose a sparse category has only one sample, but the sample contains more words, and the combination of these words will only appear in this category.
Neither resampling nor class weighting can address the diversity of this wash sample

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Imbalanced text classification method introducing keyword features
  • Imbalanced text classification method introducing keyword features
  • Imbalanced text classification method introducing keyword features

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0107] The invention provides an unbalanced text classification method that introduces keyword features. First, a hierarchical classification system is defined for the military news field, including 32 leaf categories; the key of each category is extracted by using normalized point mutual information and improved information gain. words; fusion of keyword features and neural network semantic features for training. Through the above steps, the present invention can effectively solve the problem of text classification in the case of unbalanced categories. like figure 1 shown, including the following steps:

[0108] Step 1 includes:

[0109] Step 1-1: Define a hierarchical classification system and describe the hierarchical relationship between categories, such as "collaboration-verbal-expression of willingness-substantial cooperation", and labels at different levels are separated by "-". Provide text-level classification functions for news in the fields of politics, military,...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides an unbalanced text classification method introducing keyword features, which comprises the following steps of: firstly, defining a hierarchical classification system aiming at the field of military news, including 32 leaf categories; extracting keywords of each category by utilizing normalized point mutual information and improved information gain; and carrying out training by fusing the keyword features and the neural network semantic features. Through the steps, for the unbalanced text classification problem, class label distribution serves as prior information to be introduced into a text classification model, and text content and class keyword information are utilized in the training process; performing category keyword selection by using normalized point mutual information and improved information gain as statistical magnitude of category keyword selection; the keyword information and the text information are utilized to jointly train the text classification model, and the problem of class imbalance in text classification is effectively solved.

Description

technical field [0001] The invention relates to a text classification method, in particular to an unbalanced text classification method that introduces keyword features. Background technique [0002] Category imbalance means that there is a long tail phenomenon in the corpus, and the amount of data in each category varies greatly. The existing text classification models are essentially the optimization of the loss function, which minimizes a certain loss function. In imbalanced text classification, because the sample size of each category is different, the large-scale sample loss function will account for a high proportion, and eventually the model will be biased towards the large-scale category. [0003] For the common text classification problem, the existing solution is to obtain the vector representation of the text by pre-training the language model bert, and then use the cross-entropy loss function for fine-tuning. For the problem of class imbalance, it is mainly imp...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35G06F40/258G06F40/279G06F40/30G06K9/62
CPCG06F16/35G06F40/258G06F40/279G06F40/30G06F18/214
Inventor 徐建张桂林阮国庆李晓冬王羽
Owner THE 28TH RES INST OF CHINA ELECTRONICS TECH GROUP CORP