Two-stage text feature selection method under unbalanced data set

A feature selection method, feature selection technology, applied in the direction of digital data processing, natural language data processing, special data processing applications, etc., can solve the problems of ignoring feature correlation, biasing the majority class, ignoring, etc., to improve classification accuracy , design reasonable effect

Active Publication Date: 2020-05-12
SHANDONG UNIV OF SCI & TECH
View PDF3 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Most of the existing feature selection methods extract global features relative to the entire document set, and these feature selection methods ignore the correlation between features and individual categories
And the existing feature selection method has a better effect in the balanced data set, but in the unbalanced data set, it is easy to favor the majority class and ignore the features in the small class, and most of the data sets distributed on the network are unprocessed unbalanced data set

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Two-stage text feature selection method under unbalanced data set
  • Two-stage text feature selection method under unbalanced data set
  • Two-stage text feature selection method under unbalanced data set

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] Below in conjunction with accompanying drawing and specific embodiment the present invention is described in further detail:

[0050] A two-stage text feature selection method under imbalanced datasets, such as figure 2 As shown, including the local feature selection method and the global feature selection method;

[0051] The local feature selection method, that is, using the CHI feature selection method based on word frequency to select local feature words, specifically includes the following steps:

[0052] Step S11: Obtain text data with category labels and use it as a training sample set D={d 1 , d 2 ,... d t};

[0053] Step S12: Preprocess the text data in the training sample set to obtain the category label set C={c 1 , c 2 ,... c m}, perform word segmentation and stop word processing according to the category, and each category c i Form an initial feature set T i ={t i1 , t i2 ,...t ik}, 1≤i≤m;

[0054] Step S13: Calculate the initial feature set T...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a two-stage text feature selection method under an unbalanced data set, and belongs to the field of text feature selection in natural language processing. According to the invention, word segmentation preprocessing is carried out on training set data according to category labels; an initial feature set Ti is formed for each category, a CHI method based on word frequency isused for carrying out first-stage local feature selection on Ti, and then an improved IG method is used for carrying out second-stage global feature selection on a result obtained after first-stage feature selection. The invention relates to a feature selection method considering global and local features. The proportion of the features in the small class samples in the final feature set is ensured; a comparison experiment is performed on the method and three related feature selection methods on a news corpus with a label provided by a Sogou laboratory, a result shows that the method is superior to the compared methods in precision ratio, recall ratio and F1 value, and the classification accuracy of the unbalanced data set is improved.

Description

technical field [0001] The invention belongs to the field of text feature selection in natural language processing, and in particular relates to a two-stage text feature selection method under an unbalanced data set. Background technique [0002] Text classification refers to the process of allowing a computer to automatically identify a given text content as one or several categories of pre-defined categories. Text classification is mainly divided into five steps, obtaining training set, text preprocessing, feature extraction, document representation, and classification algorithm. A general data set can generate tens of thousands of features after preprocessing, and a large data set can even generate millions of features. High-dimensional features not only increase the calculation time but also reduce the accuracy of text classification. Effective feature extraction can reduce the feature dimension and improve the accuracy of text classification, so feature extraction is o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F40/289G06F16/35G06K9/62
CPCG06F16/355G06F18/2113G06F18/2411Y02D10/00
Inventor 赵卫东赵嘉莹王铭
Owner SHANDONG UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products