Text feature extracting method based on inter-class distinctness and intra-class high representation degree

A feature extraction and discrimination technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of being unable to select high category representativeness, small calculation amount, and large calculation amount of feature words, etc., to achieve Provides computing speed and efficiency, and the effect of simple calculations

Active Publication Date: 2016-08-24
CHENGDU WANGAN TECH DEV
View PDF2 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In order to solve the problem that the existing text classification feature selection method cannot select the feature words with high category representativeness and the large amount of calculation, the present invention provides a method based on inter-category discrimination and intra-category high representation, and the calculation amount is small The text feature extraction method of

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text feature extracting method based on inter-class distinctness and intra-class high representation degree
  • Text feature extracting method based on inter-class distinctness and intra-class high representation degree
  • Text feature extracting method based on inter-class distinctness and intra-class high representation degree

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0076] Assume that there are three preset categories, namely category A, category B, and category C, wherein category A, category B, and category C respectively contain 10 articles belonging to their respective categories. Suppose now feature word 1 appears in 5 of the 10 articles belonging to category A, and also appears in 5 of the 10 articles belonging to categories B and C respectively. The distribution of the remaining feature words in each category is shown in Table 4 below:

[0077]

Class A (10 articles)

Class B (10 articles)

Category C (10 articles)

Characteristic word 1

5

5

5

Characteristic word 2

2

8

9

Characteristic word 3

10

3

1

Characteristic word 4

5

2

7

Characteristic word 5

1

6

8

[0078] Characteristic word 6

2

7

3

[0079] Table 4

[0080] According to Table 4, calculate the correlation R between each word and each predetermin...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a text feature extracting method based on inter-class distinctness and intra-class high representation degree. The method comprises the following steps: preprocessing a training set text; calculating the class distinctness of each feature word through an improved feature selecting method so as to select feature words with more class representation, wherein the selected feature words are of high distinctness among different classes; and further screening the selected feature words which are of high class distinctness based on the intra-class distribution rate and information gain (IG) of the feature words. With the adoption of the method, the feature selection is carried out twice to select the feature words which are of high intra-class information entropy and high intra-class distribution rate, and thus the classifying efficiency and accuracy can be improved; in addition, the calculation is simple, so that the text classifying speed and accuracy can be improved.

Description

technical field [0001] The invention belongs to the technical field of text mining, and in particular relates to a text feature extraction method based on inter-category discrimination and intra-class high representation. Background technique [0002] In today's era of rapid growth of Internet information resources, in order to find the required information and resources more quickly and effectively, text classification technology has emerged as an important means of effectively organizing and managing text information. Text classification technology refers to the technology of classifying texts to be processed into one or more predefined categories according to their content or attributes. In the field of text classification, it is currently popular to use VSM vector space to represent text. In order to avoid the "high-dimensional disaster" of feature items generated when VSM space is established, the feature selection algorithm becomes particularly important. [0003] The...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 黄筱聪朱永强
Owner CHENGDU WANGAN TECH DEV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products