Feature item selection and weight calculation based text classification method

A feature item selection and weight calculation technology, which is applied in calculation, special data processing applications, instruments, etc., can solve problems such as misjudgment, and achieve the effect of improving efficiency, accuracy, and high accuracy

Inactive Publication Date: 2013-02-13
UNIV OF ELECTRONIC SCI & TECH OF CHINA
View PDF3 Cites 72 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] It can be seen that the traditional TF-IDF only considers the distribution of feature items in the text set, but ignores the distribution ratio of feature items between different text ca

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Feature item selection and weight calculation based text classification method
  • Feature item selection and weight calculation based text classification method
  • Feature item selection and weight calculation based text classification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0067] The present invention is described below with a simplified embodiment.

[0068] In this embodiment, the resources are video text resources. Collect video resource text introductions and text annotations downloaded by web crawlers from major websites, a total of 9 video texts, analyze and organize video text data, and classify them into corpus training sets according to 3 categories. The video text in the training set is segmented by the word segmentation tool, and the stop words are removed, and the word frequency of the corresponding feature item is counted.

[0069] Table 1 is a statistical table of word frequency of feature items of video text.

[0070]

[0071] Table 1

[0072] Among them, T11~T13 are three texts of category 1, T21~T23 are three texts of category 2, and T31~T33 are three texts of category 3. t 1 , t 2 , t 3 , t 4 , t 5 It is part of the feature items in the T11~T33 text set. Analyzing the feature items in Table 1, the weight distributio...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a feature item selection and weight calculation based text classification method, which comprises the following steps: for a corpus training set obtained through analyzing and arranging, based on the traditional TF-IDF (term frequency-inverse document frequency) feature item weight calculation, through considering the relevancy among feature items and different categories according to CHI chi-square statistics and considering the intra-category average distribution situation of the feature items according to the information entropy of the feature item, carrying out adjustment on the weights of the feature items; then, carrying out assessment and selection on the feature items according to the inter-category weight difference of the feature items; respectively expressing a text in the training set and a text to be classified into vector form by using the feature item in a feature subspace; and determining the category of the text to be classified through calculating the similarity between the text in the training set and the text to be classified. As in the processes of feature item selection and weight calculation, the inter-category and intra-category distribution situations of feature items are comprehensively considered based on TF-IDF, so that the accuracy of feature item selection is enhanced, and the dimensionality of feature items is effectively reduced, thereby improving the efficiency and accuracy of text classification.

Description

technical field [0001] The invention belongs to the technical field of text classification in information resource management, and specifically relates to a text classification method based on feature item selection and weight calculation. Background technique [0002] In the explosive growth of Internet information resources, text information is the most widely used form, because text is the carrier of information, and most other forms of information (images, sounds) can be marked with text. In order to quickly and effectively discover information and resources, text classification technology emerges as an important means to organize and manage text information effectively. [0003] Text classification is to classify texts into one or more predefined categories according to their content or attributes given a classification category. [0004] At present, the main text representation method used in the field of text classification is VSM (Vector Space Model), that is, after...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/21
Inventor 孙健梁雪芬艾丽丽隆克平徐杰王晓丽张毅姚洪哲李乾坤陈小英陈旭
Owner UNIV OF ELECTRONIC SCI & TECH OF CHINA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products