Text classification method and device based on Hadoop

A text classification and text technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as unbalanced training classifier data, overcome unbalanced distribution of training data, improve training efficiency, and improve uploading speed effect

Active Publication Date: 2014-05-21
云宏信息科技股份有限公司
View PDF2 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] The purpose of the present invention is to propose a text classification method ...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text classification method and device based on Hadoop
  • Text classification method and device based on Hadoop
  • Text classification method and device based on Hadoop

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0034] In the following, the present invention will be further described in conjunction with the drawings and specific embodiments.

[0035] combine figure 1 Shown, a kind of text classification method based on Hadoop, it comprises the following steps:

[0036] Step S1 , perform word segmentation processing on the text used for training, and save each word-segmented text into a corresponding text file in a training data set. Open source word segmentation packages such as IK and ICTCLAS can be used to perform automatic Chinese word segmentation on the training text, and remove punctuation and stop words. The stop words here refer to words that appear frequently but have no practical meaning, such as "and", "的", "得" and so on. And the entries obtained after word segmentation are separated by spaces and output to the local training data set. For example, the sentence "explain machine learning concept" will become three words "interpretation", "machine learning" and "concept" af...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a text classification method and device based on Hadoop. The text classification method based on the Hadoop comprises the steps that text which is processed in a word segmentation mode is stored in a training data set; the text of different classes is equal in quantity; text files of the training data set are written in a SequenceFile; statistics is carried out on the number of entries contained in the text and the classes of the entries through a MapReduce module, a TF-IDF value of each entry of the text is calculated according to a TF-IDF weighting module, the text is converted into one-dimensional vectors for a Bayes classifier according to the TF-IDF values, statistics is carried out on the lengths of the text of the text files, and weighting is carried out on the one-dimensional vectors according to the lengths; a classification model is obtained; the text to be classified is classified through the classification model. The text classification method based on the Hadoop can solve the problem that data obtained through a traditional training classifier are unbalanced.

Description

technical field [0001] The present invention relates to text classification technology. Background technique [0002] As an excellent classification algorithm, the Naive Bayesian classification algorithm is widely used in various classification tasks because of its high accuracy, easy understanding, and easy implementation. It is one of the most widely used text classification methods at present. [0003] In recent years, with the development of information technology, the task of document classification has shown new characteristics, which are mainly reflected in the following three aspects: First, there are massive new data that need to be processed every day, and these data are usually above TB level. At the same time, the amount of data shows a rapid growth trend. Second, the existing data that can be used to train classifiers often has data imbalance. Not only is there an imbalance between different categories of training data, but there is also an imbalance between di...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/355
Inventor 万睿张国强谢浩安
Owner 云宏信息科技股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products