Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

GMM clustering method for high-dimensional massive data under hadoop framework

A massive data and clustering method technology, applied in the computer field, can solve the problems of low scalability and scalability, not solving the dimensional disaster of large-scale data processing, and unable to effectively deal with massive data, etc., to solve the problem of I/O Bottleneck, the effect of clustering efficiency improvement

Active Publication Date: 2015-09-30
西安电子科技大学青岛计算技术研究院
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The shortcomings of this method are: in the clustering process, it cannot effectively deal with massive data, and the algorithm efficiency is limited by time and space complexity.
However, the disadvantage of this clustering method is that this clustering method does not solve the problem of dimensionality disaster in large-scale data processing, and its scalability and scalability are not strong.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • GMM clustering method for high-dimensional massive data under hadoop framework
  • GMM clustering method for high-dimensional massive data under hadoop framework
  • GMM clustering method for high-dimensional massive data under hadoop framework

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0053] The present invention will be further described below in conjunction with the accompanying drawings.

[0054] refer to figure 1 , the present invention comprises the following steps:

[0055] Step 1, set up a local area network

[0056] Connect multiple computers to the same LAN, and each computer acts as a node to establish a cluster that can communicate with each other.

[0057] Step 2, build Hadoop platform

[0058] Configure the Hadoop0.20.2 file for each node in the cluster. Through the settings of the attribute parameters dfs.namenode and dfs.datanode in the file, the cluster contains a name node and multiple data nodes; through the attribute parameters mapred.jobtracker and The setting of mapred.tasktracker enables the cluster to include a scheduling node and multiple task nodes, and establishes an open source Hadoop platform.

[0059] The specific steps to establish the Hadoop platform are as follows: first, install the ubuntu10.04 operating system for each ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a high-dimensional mass data GMM (Gaussian Mixture Model) clustering method under a Hadoop framework. The high-dimensional mass data GMM clustering method is used for clustering high-dimensional mass data by structuring the clustering problem of mass data on a distributed platform, mainly aiming at the defects of the existing clustering algorithms. The high-dimensional mass data GMM clustering method comprises the following specific steps of: 1. constructing a local area network; 2. establishing a Hadoop platform; 3. uploading data to a cluster; 4. initially clustering, 5. calculating the parameter and the discrimination function of each cluster; 6. discriminating whether clustering is completed or not; 7. clustering again; 8. calculating the mean value and the weight of each class in a new cluster; 9. calculating the variance of each class in the new cluster; and 10. outputting clustering results. By using the characteristics of a MapReduce operation model in the Hadoop framework, using a Map parallelizing method to process parallelizable parts in the cluster and adopting two Map / Reduce tasks to respectively calculate the mean value and the variance, high-efficiency and high-accuracy clustering can be realized, and the scalability and the fault tolerance are better.

Description

technical field [0001] The invention belongs to the field of computer technology, and further relates to a Gaussian Mixture Model (GMM) clustering method for high-dimensional and massive data under the Hadoop framework in the field of data mining. The present invention can conveniently and efficiently complete the clustering of high-dimensional and massive data, and overcome the problems of inefficiency and dimension disaster in the processing of massive data in the stand-alone mode. technical background [0002] A computing framework widely used in the process of massive data processing MR "Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters [J]. Communications of the ACM, 2005, 51 (1): 107-113" the The computing framework was invented by Google. It is a parallel programming model that has emerged in recent years. It puts parallelization, fault tolerance, data distribution, load balancing, etc. in a library, and boils down all operations of the syste...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 崔江涛李林司蓁彭延国史玮陈煜崔小利王博
Owner 西安电子科技大学青岛计算技术研究院
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products