Text classification method and device based on Hadoop

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A text classification and text technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as unbalanced training classifier data, overcome unbalanced distribution of training data, improve training efficiency, and improve uploading speed effect

Active Publication Date: 2014-05-21

云宏信息科技股份有限公司

View PDF2 Cites 25 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0008] The purpose of the present invention is to propose a text classification method and device based on Hadoop, which can solve the problem of unbalanced training classifier data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0034] In the following, the present invention will be further described in conjunction with the drawings and specific embodiments.

[0035] combine figure 1 Shown, a kind of text classification method based on Hadoop, it comprises the following steps:

[0036] Step S1 , perform word segmentation processing on the text used for training, and save each word-segmented text into a corresponding text file in a training data set. Open source word segmentation packages such as IK and ICTCLAS can be used to perform automatic Chinese word segmentation on the training text, and remove punctuation and stop words. The stop words here refer to words that appear frequently but have no practical meaning, such as "and", "的", "得" and so on. And the entries obtained after word segmentation are separated by spaces and output to the local training data set. For example, the sentence "explain machine learning concept" will become three words "interpretation", "machine learning" and "concept" af...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to a text classification method and device based on Hadoop. The text classification method based on the Hadoop comprises the steps that text which is processed in a word segmentation mode is stored in a training data set; the text of different classes is equal in quantity; text files of the training data set are written in a SequenceFile; statistics is carried out on the number of entries contained in the text and the classes of the entries through a MapReduce module, a TF-IDF value of each entry of the text is calculated according to a TF-IDF weighting module, the text is converted into one-dimensional vectors for a Bayes classifier according to the TF-IDF values, statistics is carried out on the lengths of the text of the text files, and weighting is carried out on the one-dimensional vectors according to the lengths; a classification model is obtained; the text to be classified is classified through the classification model. The text classification method based on the Hadoop can solve the problem that data obtained through a traditional training classifier are unbalanced.

Description

technical field [0001] The present invention relates to text classification technology. Background technique [0002] As an excellent classification algorithm, the Naive Bayesian classification algorithm is widely used in various classification tasks because of its high accuracy, easy understanding, and easy implementation. It is one of the most widely used text classification methods at present. [0003] In recent years, with the development of information technology, the task of document classification has shown new characteristics, which are mainly reflected in the following three aspects: First, there are massive new data that need to be processed every day, and these data are usually above TB level. At the same time, the amount of data shows a rapid growth trend. Second, the existing data that can be used to train classifiers often has data imbalance. Not only is there an imbalance between different categories of training data, but there is also an imbalance between di...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

CPCG06F16/355

Inventor万睿张国强谢浩安

Owner云宏信息科技股份有限公司

Text classification method and device based on Hadoop

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements:Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology