Characteristic extraction method of text classification on the basis of mutual information

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A feature extraction, mutual information technology, applied in special data processing applications, instruments, electrical digital data processing and other directions, can solve problems such as difficult text processing

Inactive Publication Date: 2016-06-22

SYSU CMU SHUNDE INT JOINT RES INST +1

View PDF10 Cites 34 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

The difficulty of text classification is that the content of text is natural language, which makes it difficult for computers to process text semantically

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0025] Specific embodiments of the present invention will be described below.

[0026] A kind of feature extraction method of text classification based on mutual information provided by the present invention, comprises the following steps:

[0027] 1) Obtain a certain number of articles of various categories from crawlers on the Internet as a training data set for the text classification system;

[0028] 2) Preprocessing the training text: Segment the training data set. The word segmentation tool used is Stutter Segmentation, which is an open source Chinese word segmentation module developed by Python. Afterwards, these stop words are filtered out according to the stop word lexicon. , use the stutter module to tag the text after word segmentation.

[0029] 3) Feature extraction of the preprocessed text: according to (2) the preprocessed text, only the words whose parts of speech are nouns and verbs are left, which is the initial feature extraction. Calculate the remaining te...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a characteristic extraction method of text classification on the basis of mutual information. Text preprocessing work mainly comprises the following steps: removing a document sign, removing stop words, carrying out word segmentation, carrying out the labeling of the part of speech, carrying out statistics on word frequency, data cleaning and the like, and extracting a characteristic word according to a characteristic algorithm. A text classification stage is characterized in that a model parameter is mainly trained for a vectorized training set through a support vector machine algorithm, and a text which needs to be classified is subjected to machine learning classification. The scheme of the invention is applied, a situation that noise characteristics are brought into a machine learning process can be effectively avoided when the characteristic extraction of the text classification is carried out, text classification precision is improved, the scale of a characteristic library is greatly reduced, and memory occupation is lowered.

Description

technical field [0001] The invention belongs to the technical field of natural language processing, and specifically relates to a feature extraction method for text classification based on mutual information. Background technique [0002] With the rapid development of Internet, multimedia and storage technologies, more and more information (especially multimedia information) is being generated, disseminated and accumulated. The Internet makes information dissemination easier, and individual users can find and download the information they want very conveniently. Larger hard drives can store more information. Excluding the information resources on the World Wide Web, even the number of files accumulated on a PC may be tens of gigabytes. How to effectively manage and conveniently utilize this information is a big problem for individual users. According to statistics, although there are more and more multimedia information on the Internet, in the foreseeable future, text inf...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/27G06K9/62

CPCG06F40/289G06F18/2411

Inventor赵秉新印鉴

OwnerSYSU CMU SHUNDE INT JOINT RES INST

Characteristic extraction method of text classification on the basis of mutual information

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology