A large-scale data distributed clustering processing method based on mapreduce

A large-scale data and distributed clustering technology, applied in special data processing applications, database models, relational databases, etc., can solve the problems of reducing the overall efficiency of parallel clustering, high similarity time consumption, and high computational overhead, etc. Achieve the effect of improving the efficiency of parallel clustering, reducing the number of clustering iterations, and fast convergence.

Active Publication Date: 2019-06-25
北京点为信息科技有限公司
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The K-Means method combines the Canopy method Canopy-Kmeans, uses the characteristics of Canopy to calculate the similarity of objects, and preprocesses the data. The advantage is that the initial clustering center point can be given to avoid falling into local optimum, but the disadvantage is that the distance between objects can be calculated. The time consumption of the similarity is relatively large
The method based on data density calculation is to calculate the density of all data, and then select the data with the highest density as the cluster center point to avoid the problem of random selection, and it is more accurate, but the traditional calculation cost is also large, and it is easy to cause node clustering. The load is heavy, reducing the overall efficiency of parallel clustering

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A large-scale data distributed clustering processing method based on mapreduce
  • A large-scale data distributed clustering processing method based on mapreduce
  • A large-scale data distributed clustering processing method based on mapreduce

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] The specific implementation manners of the present invention will be described in detail below in conjunction with the accompanying drawings.

[0038] Such as figure 1 As shown, the Hadoop distributed cluster environment in this embodiment has 3 servers, which constitute 3 nodes, including a master node Master for issuing orders and distributing tasks, and 2 sub-node slaves for receiving tasks distributed by the master node and according to the master node. Node Master requests to process running tasks, and all nodes are connected through high-speed Ethernet. The master node Master starts the entire cluster environment according to the user's application request. The slave node slave and the master node Master are the main body of the Hadoop distributed cluster environment parallel system, responsible for the processing and operation of the entire Hadoop distributed cluster. Such as figure 2 As shown, in this embodiment: 1) receive data to be processed according to u...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Provided by the present invention is a MapReduce-based distributed cluster processing method for large-scale data, which comprises: sampling large-scale data according to an equal-scale non-repetition principle; inputting the sampled data into a MapReduce distributed parallel framework, and calculating the local density and average density of the sampled data; finding all sampled data having a local density greater than the average density to serve as a candidate point set of initial cluster center points for each cluster, and feeding the candidate point set back to a master node, wherein every two adjacent candidate points at a distance from each other which is greater than twice that of a set range are selected to serve as the initial cluster center points; using the MapReduce distributed parallel framework to perform a parallel clustering task, wherein an average value of the distance between the data is calculated for each cluster in order to update the cluster center points; child nodes applying an error sum of squares criterion function so as to determine whether to continue iteration; the child nodes performing clustering on the large-scale data according to the cluster center points. By means of the present invention, parallel clustering is implemented, thereby reducing the number of clustering iterations, while increasing clustering accuracy and the efficiency of parallel clustering.

Description

technical field [0001] The invention belongs to the technical field of parallel clustering, in particular to a large-scale data distributed clustering processing method based on MapReduce. Background technique [0002] With the rapid development of information technology, the scale of data continues to increase, and the use of parallel mechanisms to effectively mine and analyze large-scale data sets can promote the development and progress of Internet technology. Cluster analysis is an important data processing technology, and one of the important topics in the field of machine learning and artificial intelligence. It is widely used in data mining, information retrieval and other research. The main task is to divide the data set into multiple subsets, so that the similarity between the data objects in the subset is high, and the difference between the data objects in different subsets is relatively large. Due to the increase of data scale, the traditional stand-alone cluste...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/26
CPCG06F16/285
Inventor 高天寒孔雪
Owner 北京点为信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products