Two-stage text feature selection method under unbalanced data set

A feature selection method, feature selection technology, applied in the direction of digital data processing, natural language data processing, special data processing applications, etc., can solve the problems of ignoring feature correlation, biasing the majority class, ignoring, etc., to improve classification accuracy , design reasonable effect
CN111144106AActive Publication Date: 2020-05-12SHANDONG UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN Ā· China
Patent Type
Applications(China)
Current Assignee / Owner
SHANDONG UNIV OF SCI & TECH
Publication Date
2020-05-12

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention discloses a two-stage text feature selection method under an unbalanced data set, and belongs to the field of text feature selection in natural language processing. According to the invention, word segmentation preprocessing is carried out on training set data according to category labels; an initial feature set Ti is formed for each category, a CHI method based on word frequency isused for carrying out first-stage local feature selection on Ti, and then an improved IG method is used for carrying out second-stage global feature selection on a result obtained after first-stage feature selection. The invention relates to a feature selection method considering global and local features. The proportion of the features in the small class samples in the final feature set is ensured; a comparison experiment is performed on the method and three related feature selection methods on a news corpus with a label provided by a Sogou laboratory, a result shows that the method is superior to the compared methods in precision ratio, recall ratio and F1 value, and the classification accuracy of the unbalanced data set is improved.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention belongs to the field of text feature selection in natural language processing, and in particular relates to a two-stage text feature selection method under an unbalanced data set. Background technique

[0002] Text classification refers to the process of allowing a computer to automatically identify a given text content as one or several categories of pre-defined categories. Text classification is mainly divided into five steps, obtaining training set, text preprocessing, feature extraction, document representation, and classification algorithm. A general data set can generate tens of thousands of features after preprocessing, and a large data set can even generate millions of features. High-dimensional features not only increase the calculation time but also reduce the accuracy of text classification. Effective feature extraction can reduce the feature dimension and improve the accuracy of text classification, so feature extraction is o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More