Two-stage text feature selection method under unbalanced data set

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A feature selection method, feature selection technology, applied in the direction of digital data processing, natural language data processing, special data processing applications, etc., can solve the problems of ignoring feature correlation, biasing the majority class, ignoring, etc., to improve classification accuracy , design reasonable effect

Active Publication Date: 2020-05-12

SHANDONG UNIV OF SCI & TECH

View PDF3 Cites 6 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] Most of the existing feature selection methods extract global features relative to the entire document set, and these feature selection methods ignore the correlation between features and individual categories

And the existing feature selection method has a better effect in the balanced data set, but in the unbalanced data set, it is easy to favor the majority class and ignore the features in the small class, and most of the data sets distributed on the network are unprocessed unbalanced data set

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0049] Below in conjunction with accompanying drawing and specific embodiment the present invention is described in further detail:

[0050] A two-stage text feature selection method under imbalanced datasets, such as figure 2 As shown, including the local feature selection method and the global feature selection method;

[0051] The local feature selection method, that is, using the CHI feature selection method based on word frequency to select local feature words, specifically includes the following steps:

[0052] Step S11: Obtain text data with category labels and use it as a training sample set D={d 1 , d 2 ,... d t};

[0053] Step S12: Preprocess the text data in the training sample set to obtain the category label set C={c 1 , c 2 ,... c m}, perform word segmentation and stop word processing according to the category, and each category c i Form an initial feature set T i ={t i1 , t i2 ,...t ik}, 1≤i≤m;

[0054] Step S13: Calculate the initial feature set T...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a two-stage text feature selection method under an unbalanced data set, and belongs to the field of text feature selection in natural language processing. According to the invention, word segmentation preprocessing is carried out on training set data according to category labels; an initial feature set Ti is formed for each category, a CHI method based on word frequency isused for carrying out first-stage local feature selection on Ti, and then an improved IG method is used for carrying out second-stage global feature selection on a result obtained after first-stage feature selection. The invention relates to a feature selection method considering global and local features. The proportion of the features in the small class samples in the final feature set is ensured; a comparison experiment is performed on the method and three related feature selection methods on a news corpus with a label provided by a Sogou laboratory, a result shows that the method is superior to the compared methods in precision ratio, recall ratio and F1 value, and the classification accuracy of the unbalanced data set is improved.

Description

technical field [0001] The invention belongs to the field of text feature selection in natural language processing, and in particular relates to a two-stage text feature selection method under an unbalanced data set. Background technique [0002] Text classification refers to the process of allowing a computer to automatically identify a given text content as one or several categories of pre-defined categories. Text classification is mainly divided into five steps, obtaining training set, text preprocessing, feature extraction, document representation, and classification algorithm. A general data set can generate tens of thousands of features after preprocessing, and a large data set can even generate millions of features. High-dimensional features not only increase the calculation time but also reduce the accuracy of text classification. Effective feature extraction can reduce the feature dimension and improve the accuracy of text classification, so feature extraction is o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F40/289G06F16/35G06K9/62

CPCG06F16/355G06F18/2113G06F18/2411Y02D10/00

Inventor赵卫东赵嘉莹王铭

OwnerSHANDONG UNIV OF SCI & TECH

Two-stage text feature selection method under unbalanced data set

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology