Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Text classification method based on chi square statistics and SMO algorithm

A text classification and text technology, applied in text database clustering/classification, calculation, unstructured text data retrieval, etc., can solve the problems of many features and noise, and achieve the effect of improving classification accuracy and efficiency

Inactive Publication Date: 2014-08-20
SHANGHAI UNIV
View PDF3 Cites 31 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The main purpose of the present invention is to provide a text classification method based on chi-square statistics and SMO algorithm for the deficiencies in the prior art, which can overcome the defects that the text classification has many features and many noises caused by using all words as features. And can improve the accuracy and efficiency of text classification

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text classification method based on chi square statistics and SMO algorithm
  • Text classification method based on chi square statistics and SMO algorithm
  • Text classification method based on chi square statistics and SMO algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0034] The present invention will be further described below in conjunction with the accompanying drawings and specific examples.

[0035] like figure 1 Shown, a kind of text classification method based on Chi-square statistics and SMO algorithm of the present invention, concrete steps are as follows:

[0036] (1), collect Internet texts, and divide the texts into training texts and test texts: collect texts from the Internet, classify each text, and classify the texts that have been class-labeled as training texts, and classify the texts that have been class-labeled as The text to be classified, the text to be classified is used as the test text;

[0037] (2), preprocessing the training text to obtain the training text vocabulary, such as figure 2 As shown, the steps are as follows:

[0038] a), open the training document, and segment each training text;

[0039] b), For each word in the training text, judge whether it is a Chinese character, letter, or number, if so, c...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a text classification method based on chi square statistics and an SMO algorithm. The method comprises the steps that first, training texts are subjected to word segmentation, stop word removing and preprocessing, and then a chi square statistics quantity is used as a standard for selecting a set number of words to be used as feature words; then, the feature weight values of the training texts and testing texts are computed respectively; feature vectors of each training text and each testing text are converted into training document vector models and testing document vector models; and a trained classifier carries out classification on the feature vectors of the testing texts, and the classifying result of each testing text is obtained. According the method, the shortcomings that a lot of text classification features and a lot of noise exist due to the fact that all words are used as features can be overcome, and text classification accuracy and efficiency can be improved.

Description

technical field [0001] The invention relates to the technical field of natural language computer automatic processing, in particular to a text classification method based on chi-square statistics and SMO algorithm. Background technique [0002] In recent years, with the rapid development and popularization of Internet technology, the electronic resource information on the network has increased dramatically. Facing such a large amount of data information, how to effectively organize and manage these massive information, and obtain the information you need quickly and accurately? Information of real interest has become a major problem at present. Most of the network information is stored in the form of text, so the mining of text data has high potential value. As a typical text mining technology, text classification technology can organize and process a large amount of text information, facilitate information retrieval and analysis, and facilitate users to quickly and accurat...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/35
Inventor 武星裴孟齐
Owner SHANGHAI UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products