An Adaptive Feature Selection Method Based on Chi-Square Statistics

A feature selection method and self-adaptive technology, applied in computing, special data processing applications, natural language data processing, etc., can solve the problem of not considering the positive and negative correlation between feature items and categories, only considering the frequency of documents, zooming in on weights, etc. question

Inactive Publication Date: 2019-02-26
BEIJING UNIV OF TECH
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] As one of the commonly used text feature selection methods, the CHI method has the characteristics of simple implementation and low time complexity; but there are also many shortcomings, so that the classification effect is not ideal
The shortcomings of the CHI algorithm mainly include two aspects: first, CHI only considers the document frequency of feature items, and ignores the word frequency of feature items, resulting in the weight of low-frequency words being amplified; The weight of feature items that are many and often appear in other classes
Aiming at the deficiencies of the CHI algorithm, many researchers have improved it, and summarized the improvement methods into the following two aspects: First, several adjustment parameters are introduced to reduce the reliance on low-frequency words, but this method does not consider the relationship between feature items and categories. Positive and negative correlation between

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • An Adaptive Feature Selection Method Based on Chi-Square Statistics
  • An Adaptive Feature Selection Method Based on Chi-Square Statistics
  • An Adaptive Feature Selection Method Based on Chi-Square Statistics

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0043] The present invention is realized by the following technical means:

[0044] An adaptive text feature selection method based on chi-square statistics. First, preprocess the training text set and test text set, including word segmentation and stop word processing. Secondly, perform adaptive text feature selection based on chi-square statistics, define word frequency factor α and inter-class variance β, and introduce them into CHI The algorithm adds an appropriate scale factor μ to the CHI algorithm. Finally, combined with the classic KNN algorithm, the scale factor μ is automatically adjusted to make the improved CHI applicable to different corpora to ensure high classification accuracy.

[0045] The above-mentioned adaptive text feature selection method based on chi-square statistics for text classification includes the following steps:

[0046] Step 1. Download the Chinese corpus released by Fudan University from the Internet-training text set and test text set;

[0047] Step...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a chi square statistic based self-adaption feature selection method and relates to the field of computer text data processing. Firstly, preprocessing of a training text set and a test text set is performed and comprises participle processing and stop word processing, then, self-adaption text feature selection based on chi square statistic is performed, a word frequency factor and interclass variance are defined and introduced into a CHI algorithm, an appropriate scaling factor is added for the CHI algorithm, finally, the scaling factor is automatically adjusted in combination of classical KNN algorithm evaluation indexes, improved CHI is adapted to different text corpora, and higher classification accuracy is guaranteed. An experimental result proves that by comparison with a conventional CHI method, the classification accuracy of a balanced corpus and a non-balanced corpus is improved.

Description

Technical field [0001] The present invention relates to the field of computer text data processing, in particular to a method based on chi-square statistics (χ 2 , CHI) adaptive text feature selection method. Background technique [0002] In today's era of big data, mining the potential value of data is essential. As a technology to discover the potential value of data, data mining has attracted great attention. Text data accounts for a large proportion of big data, and text classification, as a data mining method for effective organization and management of text data, has gradually become a hot spot. It is widely used in information filtering, information organization and management, information retrieval, digital libraries, and spam filtering. Text Classification (TC) refers to the process of automatically classifying unknown texts into one or more categories based on their content under a predetermined category system. Commonly used text classification methods, such as K-Ne...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35G06F17/27
CPCG06F16/00G06F40/205G06F2216/03
Inventor 汪友生樊存佳王雨婷
Owner BEIJING UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products