Chi square statistic based self-adaption feature selection method

A feature selection method and self-adaptive technology, applied in computing, special data processing applications, natural language data processing, etc., can solve the problem of unsatisfactory classification effects, failure to consider the positive and negative correlation between feature items and categories, and enlarge weights, etc. question

Inactive Publication Date: 2016-04-20
BEIJING UNIV OF TECH
View PDF4 Cites 38 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] As one of the commonly used text feature selection methods, the CHI method has the characteristics of simple implementation and low time complexity; but there are also many shortcomings, so that the classification effect is not ideal
The shortcomings of the CHI algorithm mainly include two aspects: first, CHI only considers the document frequency of feature items, and ignores the word frequency of feature items, resulting in the weight of low-frequency words bein

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chi square statistic based self-adaption feature selection method
  • Chi square statistic based self-adaption feature selection method
  • Chi square statistic based self-adaption feature selection method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0043] The present invention is realized by adopting the following technical means:

[0044] An adaptive text feature selection method based on chi-square statistics. First, preprocess the training text set and test text set, including word segmentation and stop word processing. Second, perform adaptive text feature selection based on chi-square statistics, define word frequency factor α and inter-class variance β, and introduce them into CHI Algorithm, adding an appropriate scale factor μ to the CHI algorithm, and finally, combined with the classic KNN algorithm, automatically adjusts the scale factor μ to make the improved CHI applicable to different corpora to ensure higher classification accuracy.

[0045] The above-mentioned adaptive text feature selection method based on chi-square statistics is used for text classification, including the following steps:

[0046] Step 1, download the Chinese corpus released by Fudan University from the Internet - the training text set ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a chi square statistic based self-adaption feature selection method and relates to the field of computer text data processing. Firstly, preprocessing of a training text set and a test text set is performed and comprises participle processing and stop word processing, then, self-adaption text feature selection based on chi square statistic is performed, a word frequency factor and interclass variance are defined and introduced into a CHI algorithm, an appropriate scaling factor is added for the CHI algorithm, finally, the scaling factor is automatically adjusted in combination of classical KNN algorithm evaluation indexes, improved CHI is adapted to different text corpora, and higher classification accuracy is guaranteed. An experimental result proves that by comparison with a conventional CHI method, the classification accuracy of a balanced corpus and a non-balanced corpus is improved.

Description

technical field [0001] The invention relates to the field of computer text data processing, in particular to a method based on chi-square statistics (χ 2 ,CHI) adaptive text feature selection method. Background technique [0002] In today's era of big data, it is very important to mine the potential value of data. As a technology to discover the potential value of data, data mining has attracted great attention. Text data accounts for a considerable proportion of big data, and text classification, as a data mining method for effectively organizing and managing text data, has gradually become a hot spot. It is widely used in information filtering, information organization and management, information retrieval, digital library, and spam filtering. Text classification (Text Classification, TC) refers to the process of automatically classifying unknown texts into one or more classes according to their content under a predetermined classification system. Commonly used text cl...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/00G06F40/205G06F2216/03
Inventor 汪友生樊存佳王雨婷
Owner BEIJING UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products