Characteristic extraction method of text classification on the basis of mutual information

A feature extraction, mutual information technology, applied in special data processing applications, instruments, electrical digital data processing and other directions, can solve problems such as difficult text processing

Inactive Publication Date: 2016-06-22
SYSU CMU SHUNDE INT JOINT RES INST +1
View PDF10 Cites 34 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The difficulty of text classification is that the content of text is natural

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Characteristic extraction method of text classification on the basis of mutual information
  • Characteristic extraction method of text classification on the basis of mutual information
  • Characteristic extraction method of text classification on the basis of mutual information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0025] Specific embodiments of the present invention will be described below.

[0026] A kind of feature extraction method of text classification based on mutual information provided by the present invention, comprises the following steps:

[0027] 1) Obtain a certain number of articles of various categories from crawlers on the Internet as a training data set for the text classification system;

[0028] 2) Preprocessing the training text: Segment the training data set. The word segmentation tool used is Stutter Segmentation, which is an open source Chinese word segmentation module developed by Python. Afterwards, these stop words are filtered out according to the stop word lexicon. , use the stutter module to tag the text after word segmentation.

[0029] 3) Feature extraction of the preprocessed text: according to (2) the preprocessed text, only the words whose parts of speech are nouns and verbs are left, which is the initial feature extraction. Calculate the remaining te...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a characteristic extraction method of text classification on the basis of mutual information. Text preprocessing work mainly comprises the following steps: removing a document sign, removing stop words, carrying out word segmentation, carrying out the labeling of the part of speech, carrying out statistics on word frequency, data cleaning and the like, and extracting a characteristic word according to a characteristic algorithm. A text classification stage is characterized in that a model parameter is mainly trained for a vectorized training set through a support vector machine algorithm, and a text which needs to be classified is subjected to machine learning classification. The scheme of the invention is applied, a situation that noise characteristics are brought into a machine learning process can be effectively avoided when the characteristic extraction of the text classification is carried out, text classification precision is improved, the scale of a characteristic library is greatly reduced, and memory occupation is lowered.

Description

technical field [0001] The invention belongs to the technical field of natural language processing, and specifically relates to a feature extraction method for text classification based on mutual information. Background technique [0002] With the rapid development of Internet, multimedia and storage technologies, more and more information (especially multimedia information) is being generated, disseminated and accumulated. The Internet makes information dissemination easier, and individual users can find and download the information they want very conveniently. Larger hard drives can store more information. Excluding the information resources on the World Wide Web, even the number of files accumulated on a PC may be tens of gigabytes. How to effectively manage and conveniently utilize this information is a big problem for individual users. According to statistics, although there are more and more multimedia information on the Internet, in the foreseeable future, text inf...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06K9/62
CPCG06F40/289G06F18/2411
Inventor 赵秉新印鉴
Owner SYSU CMU SHUNDE INT JOINT RES INST
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products