Big data text classifying method based on cloud computing

A text classification and big data technology, applied in text database clustering/classification, calculation, electrical digital data processing, etc., can solve the problems of training data limitation, calculation speed limitation, etc., to improve efficiency and accuracy, and calculation speed limitation And the effect of improving the problem of training data limitation

Inactive Publication Date: 2015-06-10
INNER MONGOLIA UNIV OF SCI & TECH
View PDF3 Cites 30 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] The present invention provides a large data text classification method based on cloud computing...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Big data text classifying method based on cloud computing
  • Big data text classifying method based on cloud computing
  • Big data text classifying method based on cloud computing

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] In order to make the objects and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the examples. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0034] Such as figure 1 As shown, the embodiment of the present invention provides a kind of big data text classification method based on cloud computing, comprising the following steps:

[0035] S1, the preprocessing stage mainly completes uploading local files to the HDFS distributed file system, formatting small files and merging them into a large file, setting global IDs for text, text word segmentation, stop word removal, and word frequency in files Statistics and other functions.

[0036] (1) File upload and merge

[0037] Upload the text set to be processed from the local to the distributed file system HDFS. Hadoop defines an abstract ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a big data text classifying method based on cloud computing. The method comprises the following steps: respectively pre-processing training texts with class labels and without class labels to obtain corresponding training data sets; respectively carrying out feature selection on the training data sets to obtain corresponding dimensionally reduced training data sets; respectively calculating the dimensionally reduced training data sets according to a TFIDF weighted model, and respectively converting the training data sets to corresponding one-dimensional vectors; calculating the one-dimensional vectors with class labels according to a bayesian algorithm to obtain the prior probability of each class and the prior probability that each entry belongs to each class, and initializing the parameters of a bayesian classifier; utilizing an EM algorithm to optimize the parameters of the bayesian classifier so as to obtain a classifying model; carrying out text classification on the to-be-classified texts through the classifying model. Through combining a traditional naive bayesian classifying technology and Hadoop and EM algorithms, calculating speed limitation and training data limitation problems in actual application are improved, and the efficiency and the accuracy of the classifier are improved.

Description

technical field [0001] The invention relates to the field of data mining, in particular to a cloud computing-based big data text classification method. Background technique [0002] The Naive Bayesian classification algorithm has become the focus of simplicity and efficiency. Compared with other classification algorithms, it has lower time complexity and higher efficiency. It is widely used in various classification tasks and is currently one of the most widely used text classification methods. [0003] In recent years, with the development of information technology, the task of document classification has shown new characteristics, which are mainly reflected in the following two aspects: First, the data is growing explosively, and a large amount of information is represented by unstructured or semi-structured files such as text. Second, (2) A good classifier needs a large number of labeled samples for training, but the information provided by the labeled samples may be subj...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 孟海东肖银龙宋宇辰任敬佩
Owner INNER MONGOLIA UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products