Text classification character screening method based on character distribution information

A feature screening and text classification technology, applied in special data processing applications, instruments, electrical and digital data processing, etc. Effect

Inactive Publication Date: 2013-05-15
NORTHWESTERN POLYTECHNICAL UNIV
View PDF4 Cites 30 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0021] In order to overcome the shortcomings of the poor accuracy of existing text classification feature screening method

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text classification character screening method based on character distribution information
  • Text classification character screening method based on character distribution information
  • Text classification character screening method based on character distribution information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] The concrete steps of the inventive method are as follows:

[0037] 1. Concepts related to the present invention.

[0038] Tf*idf (Term frequency inverse document frequency): It is a statistical method used to evaluate the importance of a word for a document set or a document in a corpus. The importance of a word increases proportionally to the number of times it appears in the document, but decreases inversely proportional to the frequency it appears in the corpus.

[0039] Intra-class distribution: refers to the distribution of a feature word in a certain type of document. If it is evenly distributed in each document of this type, the intra-class dispersion of the feature word in this type of document is low. ; Conversely, if it is concentrated in a few documents and does not appear in other documents, then the feature word has a high degree of dispersion within the class in this type of document.

[0040] Inter-class distribution (Inter-class distribution): refers ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a text classification character screening method based on character distribution information. The method is used for resolving the technical problems that an existing text classification character screening method is poor in accuracy. The technical scheme includes conducting preprocessing for each document of a document set firstly; enabling the whole document collection to be presented as a vector space modal (VSM); constructing a character dictionary; counting document frequency DF (t, Cj), comprising the character t, of each classification Ci; calculating a normalized tf*idf value of each classification Ci, and then calculating the dispersion D Intra and average inter-classification dispersion D Inter Avg of the character in each classification Ci; calculating the weight wi (t) of each character tk in each classification Ci of a text character space; and enabling all the characters to be arranged in a descending order mode according to the weight of all the characters in the whole document set, and preferentially keeping the characters having front orders during character screening. On the basis of a character distribution system, the method enables the character distribution system to be applied to the character screen process, and improves text classification efficiency and accuracy.

Description

technical field [0001] The invention relates to a text classification feature screening method, in particular to a text classification feature screening method based on feature distribution information. Background technique [0002] With the development of information and network technology, a large number of electronic documents such as news, emails, Weibo, etc. are generated every day on the Internet. As a method for efficiently classifying and managing a large number of documents, automatic text classification has been widely used in many fields. [0003] With the explosive growth of information volume, one of the main problems faced by automatic text classification is how to deal with the high-dimensional text vector feature space generated by a large amount of text data. Excessive text vector feature space will have two adverse effects on text classification methods: (1) Many mature methods cannot be optimized in high-dimensional space, and thus cannot be applied to te...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 李思男李战怀李宁
Owner NORTHWESTERN POLYTECHNICAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products