A large-scale data distributed clustering processing method based on mapreduce

A large-scale data and distributed clustering technology, applied in special data processing applications, database models, relational databases, etc., can solve the problems of reducing the overall efficiency of parallel clustering, high similarity time consumption, and high computational overhead, etc. Achieve the effect of improving the efficiency of parallel clustering, reducing the number of clustering iterations, and fast convergence.

A large-scale data and distributed clustering technology, applied in special data processing applications, database models, relational databases, etc., can solve the problems of reducing the overall efficiency of parallel clustering, high similarity time consumption, and high computational overhead, etc. Achieve the effect of improving the efficiency of parallel clustering, reducing the number of clustering iterations, and fast convergence.

CN107291847BActive Publication Date: 2019-06-25北京点为信息科技有限公司

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A large-scale data distributed clustering processing method based on mapreduce
  • A large-scale data distributed clustering processing method based on mapreduce
  • A large-scale data distributed clustering processing method based on mapreduce

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] The specific implementation manners of the present invention will be described in detail below in conjunction with the accompanying drawings.

[0038] Such as figure 1 As shown, the Hadoop distributed cluster environment in this embodiment has 3 servers, which constitute 3 nodes, including a master node Master for issuing orders and distributing tasks, and 2 sub-node slaves for receiving tasks distributed by the master node and according to the master node. Node Master requests to process running tasks, and all nodes are connected through high-speed Ethernet. The master node Master starts the entire cluster environment according to the user's application request. The slave node slave and the master node Master are the main body of the Hadoop distributed cluster environment parallel system, responsible for the processing and operation of the entire Hadoop distributed cluster. Such as figure 2 As shown, in this embodiment: 1) receive data to be processed according to u...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a large-scale data distributed clustering processing method based on MapReduce, which includes sampling large-scale data on the principle of non-repetition of equal scales; inputting sampled data into the MapReduce distributed parallel framework and calculating the local density and sum of the sampled data. Average density; find all sampled data whose local density is greater than the average density as a set of candidate points for the initial cluster center point of each cluster and feed it back to the main node, and select the distance between every two adjacent candidate points to be greater than 2 times the setting All candidate points in the range are used as the initial clustering center points; the MapReduce distributed parallel framework is used to perform parallel clustering tasks, and the average distance between data is calculated for each cluster to update the clustering center point; the error sum of squares criterion function is applied to the child nodes Determine whether to continue iteration; each sub-node clusters large-scale data based on the cluster center point. The invention realizes parallel clustering, reduces the number of clustering iterations, and improves clustering accuracy and parallel clustering efficiency.

Description

technical field [0001] The invention belongs to the technical field of parallel clustering, in particular to a large-scale data distributed clustering processing method based on MapReduce. Background technique [0002] With the rapid development of information technology, the scale of data continues to increase, and the use of parallel mechanisms to effectively mine and analyze large-scale data sets can promote the development and progress of Internet technology. Cluster analysis is an important data processing technology, and one of the important topics in the field of machine learning and artificial intelligence. It is widely used in data mining, information retrieval and other research. The main task is to divide the data set into multiple subsets, so that the similarity between the data objects in the subset is high, and the difference between the data objects in different subsets is relatively large. Due to the increase of data scale, the traditional stand-alone cluste...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
25 Jun 2019
Publication
CN107291847B
IPC
G06F16/26
CPC
G06F16/285
Inventors
高天寒; 孔雪