Spark platform based high efficiency text classification method

A text classification and high-efficiency technology, applied in the field of big data processing, can solve the problems of not being able to use PCs, low resource utilization, and increased network transmission, and achieve the goal of improving cluster resource utilization, promoting improvement, and improving accuracy Effect

Inactive Publication Date: 2016-07-06
HUNAN UNIV
View PDF4 Cites 48 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0027] At present, most of the machine learning algorithms are still serial. When the amount of data is not large, serial can be used; but with the advent of cloud computing and the era of big data, the data is growing exponentially, and the traditional serial algorithm obviously cannot meet the requirements. Processing requirements, and the previous grid computing and parallel computing resources utilization rate is not high, resulting in hig

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Spark platform based high efficiency text classification method
  • Spark platform based high efficiency text classification method
  • Spark platform based high efficiency text classification method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0050] see figure 1 , the high-efficiency text classification method based on Spark platform in the present embodiment comprises the following steps:

[0051] Step 101: Construct the HDFS file system and the Spark platform with the virtual machine on the physical server, and upload the data set to the HDFS file system.

[0052] Step 102: Submit jobs to the Spark cluster through the client, Spark reads data from the HDFS file system, converts the input data into a resilient distributed dataset (RDD) and starts a certain number of partitions according to the number of partitions in the RDD set by th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention provides a Spark based high efficiency text classification method. The method comprises: constructing an HDFS file system with a virtual machine and a Spark platform on a physical server, and uploading a data set into the HDFS file system; enabling the Spark platform to read data from the HDFS file system, and converting the data into RDD and storing the RDD into a memory; dividing all tasks into different stages, and then running each task; preprocessing the RDD; performing training; and testing a classification model. The method provided by the present invention makes up the defects of a naive Bayes model and further improves the processing speed; the method also effectively promotes data mining and machine learning and promotes conversion from a conventional data mining algorithm to a parallel data mining algorithm; the method improves classification precision of improving the Bayes algorithm; the method promotes improvement of a Spark platform based algorithm; and finally, the method improves cluster resource utilization.

Description

technical field [0001] The invention relates to the technical field of big data processing, in particular to a high-efficiency text classification method based on the Spark platform. Background technique [0002] With the rapid development of information technology and the gradual widespread use of the Internet, the Internet has now become the most important source of information. Especially with the advent of the era of cloud computing and big data, the data on the Internet is growing exponentially. They have the following characteristics: large amount of data, high dimensionality, complex and irregular structure, and contain a lot of noise data, but they contain a lot of commercial value. Facing such huge and complex information, how to quickly organize, manage, utilize, and dig out valuable information is some very important challenges. [0003] Most of the data today is stored on the Internet in the form of text. Text classification technology is an important basis fo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 唐卓鲁彬李肯立李巧巧陈建国熊燎特
Owner HUNAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products