A mapreduce parallelized big data text classification method

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A text classification and classification method technology, applied in the computer field, can solve the problems of poor classification performance and low discrimination, and achieve the effect of improving discrimination and classification performance

Active Publication Date: 2019-02-01

杭州亚龙智能科技有限公司

View PDF3 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0008] In order to overcome the disadvantages of poor classification performance and low discrimination of existing big data text classification methods, the present invention provides a MapReduce parallelized big data text classification method with good classification performance and high discrimination

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0028] The present invention will be further described below in conjunction with the accompanying drawings.

[0029] refer to Figure 1 ~ Figure 4 , a MapReduce parallelized big data text classification method, according to the Bayesian algorithm and the characteristics of the MapReduce programming model, each step of text classification is realized in parallel under the Hadoop platform. It includes four steps of data preprocessing, feature extraction, text vector representation and classification of text classification, specifically including the following processes:

[0030] The first step: data preprocessing. It includes two processes of word segmentation and removal of stop words;

[0031] The second step: feature extraction. Process the training data set and filter out the feature items (words) with the strongest distinguishing ability and the most representative;

[0032] The third step: the implementation of the Bayesian algorithm. Classify the test data set using ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

A MapReduce parallel big data text classification method comprises the following steps: firstly, establishing a reference test data set for text classification, and performing data preprocessing, comprising word segmentation, unused word removal and word root recovery; randomly dividing the reference test data set into a train text and a test text, and adopting a vector space model to establish the reference test data set into a text representation model; secondly, according to the text representation model, adopting CDMT to perform characteristic selection on the reference test data set; thirdly, adopting a Bayes classifier to perform training study on the reference test data set to obtain a classification result. The invention provides the MapReduce parallel big data text classification method with good classification performance and higher differentiation.

Description

technical field [0001] The present invention relates to the field of computers, and more specifically relates to machine learning and methods for classifying large data texts. Background technique [0002] With the popularity of Internet applications, it is more convenient to transmit information on the Internet, and the amount of information on the Internet is growing at an unprecedented rate. There is no doubt that the significance of studying text classification methods is extremely important. In the past, people used artificial classification methods, although the accuracy rate was high, but the efficiency was low. Because this classification method relies on personal experience, different people may classify the same data, and the results may be different, and even the same person may classify differently each time. Faced with the huge amount of data on the Internet today, it is obviously unrealistic to hand over the classification work to manual work. Therefore, aut...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityPatents(China)

IPC IPC(8): G06F16/35

CPCG06F16/353

Inventor朱信忠徐慧英赵建民陈远超

Owner杭州亚龙智能科技有限公司

A mapreduce parallelized big data text classification method

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology