Feature extraction method for text categorization based on improved mutual information and entropy

A text classification and feature extraction technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve the problems of text classification accuracy and recall rate that need to be further improved

Inactive Publication Date: 2014-03-26
NANJING UNIV OF POSTS & TELECOMM
View PDF2 Cites 24 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0014] The purpose of the present invention is to provide a text classification feature extraction method based on improved mutual inf...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Feature extraction method for text categorization based on improved mutual information and entropy
  • Feature extraction method for text categorization based on improved mutual information and entropy

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0043] For the convenience of description, we assume the following application example: Now there are overwhelming news on the Internet every day, and we want to determine which aspect of a network news document is mainly about, that is, to determine the category of the document. In the document classification process, the feature extraction method proposed by the present invention can be used to extract features and determine text vectors, and then a classifier can be used for text classification.

[0044] The specific embodiment of the present invention is:

[0045] (1) Manually find a certain number of articles of each category from the Internet as the training data set for the text classification system;

[0046] (2) Preprocess these articles, remove stop words after word segmentation, obtain feature words, count the frequency of words and inverse document frequency, calculate the weight of feature words according to TF-IDF, and express each article as two-tuple as a mult...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a feature extraction method for text categorization. The feature extraction method is used for solving the problem that the accuracy rate and the recall rate of text categorization need to be increased further. The feature extraction method is a strategic method. In consideration of the concept of entropy in statistical thermodynamics, entropy is used for describing the degree of disorder of a system and is significantly applied to the fields of cybernetics, probability theory, number theory, astrophysics, bioscience, information theory and the like. According to the feature extraction method, entropy can also be used in text categorization, a feature is regarded as an event, a category set of text is a system, and therefore entropy can be used for measuring the degree of disorder of features and categories and converted into the closeness degree of the relation between the features and the categories. According to the feature extraction method, on the basis of improved mutual information, the concept of entropy is combined, a new feature evaluation function is provided, feature extraction is conducted on the basis of the function, a superior feature subset can be selected for showing the text and building a categorizer, and therefore the accuracy rate and the recall rate of text categorization are increased.

Description

technical field [0001] The invention relates to the technical field of text mining, in particular to a text classification feature extraction method based on improved mutual information and entropy. Background technique [0002] With the development of computer technology and the popularization of the Internet, we are in an information age, and the number of online texts is increasing rapidly. The previous method of manually screening texts for classification is no longer suitable, and there is an urgent need for a fast and efficient The technology of collecting data and organizing the required information, thus producing text classification technology. Text classification refers to the process of classifying texts into corresponding predefined categories according to their content under a given classification system. The process of text classification is actually to identify the pattern features of the text, and the key technologies include text preprocessing, feature extr...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27
Inventor 成卫青唐旋范恒亮杨庚梁胜
Owner NANJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products