Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

MapReduce-based distributed cluster processing method for large-scale data

A large-scale data, distributed clustering technology, applied in the direction of electrical digital data processing, special data processing applications, database models, etc., can solve the problem of reducing the overall efficiency of parallel clustering, large similarity time consumption, and large computing overhead problem, to achieve the effect of improving the efficiency of parallel clustering, reducing the number of clustering iterations, and fast convergence

Active Publication Date: 2017-10-24
北京点为信息科技有限公司
View PDF3 Cites 24 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The K-Means method combines the Canopy method Canopy-Kmeans, uses the characteristics of Canopy to calculate the similarity of objects, and preprocesses the data. The advantage is that the initial clustering center point can be given to avoid falling into local optimum, but the disadvantage is that the distance between objects can be calculated. The time consumption of the similarity is relatively large
The method based on data density calculation is to calculate the density of all data, and then select the data with the highest density as the cluster center point to avoid the problem of random selection, and it is more accurate, but the traditional calculation cost is also large, and it is easy to cause node clustering. The load is heavy, reducing the overall efficiency of parallel clustering

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • MapReduce-based distributed cluster processing method for large-scale data
  • MapReduce-based distributed cluster processing method for large-scale data
  • MapReduce-based distributed cluster processing method for large-scale data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] The specific implementation manners of the present invention will be described in detail below in conjunction with the accompanying drawings.

[0038] Such as figure 1 As shown, the Hadoop distributed cluster environment in this embodiment has 3 servers, which constitute 3 nodes, including a master node Master for issuing orders and distributing tasks, and 2 sub-node slaves for receiving tasks distributed by the master node and according to the master node. Node Master requests to process running tasks, and all nodes are connected through high-speed Ethernet. The master node Master starts the entire cluster environment according to the user's application request. The slave node slave and the master node Master are the main body of the Hadoop distributed cluster environment parallel system, responsible for the processing and operation of the entire Hadoop distributed cluster. Such as figure 2 As shown, in this embodiment: 1) receive data to be processed according to u...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a MapReduce-based distributed cluster processing method for large-scale data. The method comprises the steps that the large-scale data is sampled according to equal-scale non-repetition principles; sampled data is input into a MapReduce distributed parallel framework, and local density and average density of the sampled data are calculated; all the sampled data with the local density greater than the average density is found out to serve as a candidate point set of initial cluster center points of all clusters, the candidate point set is fed back to a master node, and every two adjacent candidate points in a distance two times that of a set range among all candidate points are selected to service as the initial cluster center points; the MapReduce distributed parallel framework is utilized to perform a parallel clustering task, and an average value of the distance between the data is calculated for each cluster to update the cluster center points; whether child nodes continue iteration is judged by use of a sum-of-squared-error criterion function; and all the child nodes perform clustering on the large-scale data according to the cluster center points. Through the method, parallel clustering is realized, clustering iteration times are reduced, and clustering accuracy and parallel clustering efficiency are improved.

Description

technical field [0001] The invention belongs to the technical field of parallel clustering, in particular to a large-scale data distributed clustering processing method based on MapReduce. Background technique [0002] With the rapid development of information technology, the scale of data continues to increase, and the use of parallel mechanisms to effectively mine and analyze large-scale data sets can promote the development and progress of Internet technology. Cluster analysis is an important data processing technology, and one of the important topics in the field of machine learning and artificial intelligence. It is widely used in data mining, information retrieval and other research. The main task is to divide the data set into multiple subsets, so that the similarity between the data objects in the subset is high, and the difference between the data objects in different subsets is relatively large. Due to the increase of data scale, the traditional stand-alone cluste...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/285
Inventor 高天寒孔雪
Owner 北京点为信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products