Feature selection method based on document frequency of within-class and between-class and term frequency statistics

A feature selection method and word frequency statistics technology, applied in computing, special data processing applications, instruments, etc., can solve the problems of negative correlation interference of feature words, ignoring the frequency distribution of feature words, etc., and achieve the effect of improving accuracy

Inactive Publication Date: 2018-09-04
HUBEI UNIV OF TECH
View PDF4 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, there are deficiencies in the traditional IG method. (1) The word frequency distribution of feature words in each category is not considered; (2) The negative correlation of feature words interferes; (3) It ca

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Feature selection method based on document frequency of within-class and between-class and term frequency statistics
  • Feature selection method based on document frequency of within-class and between-class and term frequency statistics
  • Feature selection method based on document frequency of within-class and between-class and term frequency statistics

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] In order to facilitate those skilled in the art to understand and implement the present invention, the present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments. The implementation examples described here are only used to illustrate and explain the present invention, and are not intended to limit the present invention.

[0031] please see figure 1 and figure 2 , a feature selection method based on document frequency and word frequency statistics between classes within a class provided by the present invention, comprising the following steps:

[0032] Step 1: The text in the training set is represented by terms after word segmentation and stop words are removed, which is recorded as the original feature space. Input all the original feature words in the training set, where the feature words in the original feature space are recorded as t k , 0≤k≤N, N is the total number of feature words in the original fe...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a feature selection method based on document frequency of within-class and between-class and term frequency statistics. The document frequency, the word frequency, the between-class concentration ratio and the within-class dispersity of a feature term are comprehensively considered to construct a feature selection assessment function based on DFCTFS (Document Frequency of within-class and between-class and Term Frequency Statistics); and the original feature space, which is subjected to text preprocessing, of a training set uses a feature selection assessment function which is put forward by the invention to select a certain ratio of feature terms in each class of the training set to form the feature term bank of the class, and the feature term bank of the trainingset is the union set of each class of feature term bank of the training set. The invention puts forward the feature selection method based on the DFCTFS, feature terms which are intensively distributed in the certain class of document, are evenly distributed in the class of documents and frequently appear can be diagnostically selected, and a Chinese text classification effect can be improved.

Description

technical field [0001] The invention belongs to the technical field of Chinese text classification, and relates to a feature selection method, in particular to a feature selection method based on statistics of document frequency and word frequency within a class and between classes. Background technique [0002] The overall idea of ​​Chinese text classification is roughly: text preprocessing, feature selection, establishment of text representation model, classification using classification algorithm, classification model evaluation. Feature selection is a key step in Chinese text classification. It refers to selecting some important features from the high-dimensional original feature space to form a low-dimensional space, thereby improving classification accuracy and classification efficiency. [0003] Traditional feature selection methods include: document frequency (DF), mutual information (MI), information gain (IG), chi-square statistics (CHI), etc. The method of featur...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/27
CPCG06F40/216G06F40/289
Inventor 邵雄凯赵婧刘建舟王春枝华满阳邹陈亮亮
Owner HUBEI UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products