A mapreduce parallelized big data text classification method

A text classification and classification method technology, applied in the computer field, can solve the problems of poor classification performance and low discrimination, and achieve the effect of improving discrimination and classification performance

Active Publication Date: 2019-02-01
杭州亚龙智能科技有限公司
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] In order to overcome the disadvantages of poor classification performance and low discrimination of existing big data text classification methods, the present invention provides a MapReduce parallelized big data text classification method with good classification performance and high discrimination

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A mapreduce parallelized big data text classification method
  • A mapreduce parallelized big data text classification method
  • A mapreduce parallelized big data text classification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0028] The present invention will be further described below in conjunction with the accompanying drawings.

[0029] refer to Figure 1 ~ Figure 4 , a MapReduce parallelized big data text classification method, according to the Bayesian algorithm and the characteristics of the MapReduce programming model, each step of text classification is realized in parallel under the Hadoop platform. It includes four steps of data preprocessing, feature extraction, text vector representation and classification of text classification, specifically including the following processes:

[0030] The first step: data preprocessing. It includes two processes of word segmentation and removal of stop words;

[0031] The second step: feature extraction. Process the training data set and filter out the feature items (words) with the strongest distinguishing ability and the most representative;

[0032] The third step: the implementation of the Bayesian algorithm. Classify the test data set using ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A MapReduce parallel big data text classification method comprises the following steps: firstly, establishing a reference test data set for text classification, and performing data preprocessing, comprising word segmentation, unused word removal and word root recovery; randomly dividing the reference test data set into a train text and a test text, and adopting a vector space model to establish the reference test data set into a text representation model; secondly, according to the text representation model, adopting CDMT to perform characteristic selection on the reference test data set; thirdly, adopting a Bayes classifier to perform training study on the reference test data set to obtain a classification result. The invention provides the MapReduce parallel big data text classification method with good classification performance and higher differentiation.

Description

technical field [0001] The present invention relates to the field of computers, and more specifically relates to machine learning and methods for classifying large data texts. Background technique [0002] With the popularity of Internet applications, it is more convenient to transmit information on the Internet, and the amount of information on the Internet is growing at an unprecedented rate. There is no doubt that the significance of studying text classification methods is extremely important. In the past, people used artificial classification methods, although the accuracy rate was high, but the efficiency was low. Because this classification method relies on personal experience, different people may classify the same data, and the results may be different, and even the same person may classify differently each time. Faced with the huge amount of data on the Internet today, it is obviously unrealistic to hand over the classification work to manual work. Therefore, aut...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35
CPCG06F16/353
Inventor 朱信忠徐慧英赵建民陈远超
Owner 杭州亚龙智能科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products