Chi square statistic based self-adaption feature selection method

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A feature selection method and self-adaptive technology, applied in computing, special data processing applications, natural language data processing, etc., can solve the problem of unsatisfactory classification effects, failure to consider the positive and negative correlation between feature items and categories, and enlarge weights, etc. question

Inactive Publication Date: 2016-04-20

BEIJING UNIV OF TECH

View PDF4 Cites 38 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] As one of the commonly used text feature selection methods, the CHI method has the characteristics of simple implementation and low time complexity; but there are also many shortcomings, so that the classification effect is not ideal

The shortcomings of the CHI algorithm mainly include two aspects: first, CHI only considers the document frequency of feature items, and ignores the word frequency of feature items, resulting in the weight of low-frequency words being amplified; The weight of feature items that are many and often appear in other classes

Aiming at the deficiencies of the CHI algorithm, many researchers have improved it, and summarized the improvement methods into the following two aspects: First, several adjustment parameters are introduced to reduce the reliance on low-frequency words, but this method does not consider the relationship between feature items and categories. Positive and negative correlation between

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0043] The present invention is realized by adopting the following technical means:

[0044] An adaptive text feature selection method based on chi-square statistics. First, preprocess the training text set and test text set, including word segmentation and stop word processing. Second, perform adaptive text feature selection based on chi-square statistics, define word frequency factor α and inter-class variance β, and introduce them into CHI Algorithm, adding an appropriate scale factor μ to the CHI algorithm, and finally, combined with the classic KNN algorithm, automatically adjusts the scale factor μ to make the improved CHI applicable to different corpora to ensure higher classification accuracy.

[0045] The above-mentioned adaptive text feature selection method based on chi-square statistics is used for text classification, including the following steps:

[0046] Step 1, download the Chinese corpus released by Fudan University from the Internet - the training text set ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a chi square statistic based self-adaption feature selection method and relates to the field of computer text data processing. Firstly, preprocessing of a training text set and a test text set is performed and comprises participle processing and stop word processing, then, self-adaption text feature selection based on chi square statistic is performed, a word frequency factor and interclass variance are defined and introduced into a CHI algorithm, an appropriate scaling factor is added for the CHI algorithm, finally, the scaling factor is automatically adjusted in combination of classical KNN algorithm evaluation indexes, improved CHI is adapted to different text corpora, and higher classification accuracy is guaranteed. An experimental result proves that by comparison with a conventional CHI method, the classification accuracy of a balanced corpus and a non-balanced corpus is improved.

Description

technical field [0001] The invention relates to the field of computer text data processing, in particular to a method based on chi-square statistics (χ 2 ,CHI) adaptive text feature selection method. Background technique [0002] In today's era of big data, it is very important to mine the potential value of data. As a technology to discover the potential value of data, data mining has attracted great attention. Text data accounts for a considerable proportion of big data, and text classification, as a data mining method for effectively organizing and managing text data, has gradually become a hot spot. It is widely used in information filtering, information organization and management, information retrieval, digital library, and spam filtering. Text classification (Text Classification, TC) refers to the process of automatically classifying unknown texts into one or more classes according to their content under a predetermined classification system. Commonly used text cl...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30G06F17/27

CPCG06F16/00G06F40/205G06F2216/03

Inventor 汪友生樊存佳王雨婷

Owner BEIJING UNIV OF TECH

Chi square statistic based self-adaption feature selection method

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology