Improved text classification characteristic selection method

A feature selection method and feature selection technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as poor accuracy and weak characteristics, and achieve the effect of improving accuracy and avoiding dimension disasters

Active Publication Date: 2016-08-24
CHENGDU WANGAN TECH DEV
View PDF1 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0010] In order to solve the shortcomings of existing text classification feature selection methods such as poor accuracy and weak characteristics, the present invention proposes a text classification method based on improved feature selection

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Improved text classification characteristic selection method
  • Improved text classification characteristic selection method
  • Improved text classification characteristic selection method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0053] Step 1: Use web crawlers or manual collection to obtain a certain number of representative articles in multiple fields from the Internet, analyze and organize these articles, and classify them into corpus training sets according to categories, as the training sample set for the text classification system.

[0054] Segment the acquired text and remove stop words.

[0055] Suppose the training set E contains 3 categories: C 1 , C 2 , C 3 , the training set can be expressed as:

[0056] {E|{C 1 |d 11 , d 12 , d 13 ,...},{C 2 |d 21 , d 22 , d 23 ,...},{C 3 |d 31 , d 32 , d 33 ,...}}

[0057] Then after text preprocessing, the training set becomes:

[0058] {E|{C 1 |t 11 , t 12 , t 13 ,...},{C 2 |t 21 , t 22 , t 23 ,...},{C 3 |t 31 , t 32 , t 33 ,...}}

[0059] where t ij represents the text d ij (i=1, 2, 3; j=1, 2, . . . ) A collection of words left after word segmentation and stop word removal.

[0060] Step 2: Suppose there are only three w...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an improved text classification characteristic selection method. The improved text classification characteristic selection method comprises the steps of obtaining a training set text; performing word segmentation and stop word removal on the obtained training set text; improving a characteristic selection method, performing division on all word sets by using text frequencies of characteristic words, performing characteristic selection on a low-frequency word set by using an information gain value, and performing characteristic selection on a high-frequency word set through an improved x2 statistic method; and combining characteristic words of two parts to form a final classification characteristic word set. With the adoption of the method, more representative classification characteristic words can be selected through carrying out a characteristic selection process twice, so that the classification efficiency and accuracy are improved.

Description

technical field [0001] The invention belongs to the technical field of text mining, and in particular relates to an improved text classification feature selection method. Background technique [0002] With the development of information technology, the amount of information in today's world is also increasing at an alarming rate. How to quickly and effectively process a large number of text documents in a relatively short period of time has become a hot spot in current research. Traditional information retrieval technology can no longer meet people's increasing needs. At this time, text classification technology has emerged as the times require. Text classification technology can largely solve the problem of large and messy text documents, and help people search, query, Filter document information to improve the efficiency of information availability. At the same time, text classification is also an important means of text mining. [0003] Text classification is to assign ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
Inventor 朱永强黄筱聪
Owner CHENGDU WANGAN TECH DEV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products