High-dimensional mass data GMM (Gaussian Mixture Model) clustering method under Hadoop framework

A mass data and clustering method technology, applied in the computer field, can solve problems such as poor scalability and scalability, failure to solve the dimension disaster of large-scale data processing, and inability to effectively deal with massive data, etc., to achieve I/O The effect of improving bottleneck and clustering efficiency

Active Publication Date: 2013-05-01
西安电子科技大学青岛计算技术研究院
View PDF2 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The shortcomings of this method are: in the clustering process, it cannot effectively deal with massive data, and the algorithm efficiency is limited by time and space complexity.
However, the disa...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • High-dimensional mass data GMM (Gaussian Mixture Model) clustering method under Hadoop framework
  • High-dimensional mass data GMM (Gaussian Mixture Model) clustering method under Hadoop framework
  • High-dimensional mass data GMM (Gaussian Mixture Model) clustering method under Hadoop framework

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0053] The present invention will be further described below in conjunction with the drawings.

[0054] Reference figure 1 , The present invention includes the following steps:

[0055] Step 1. Build a local area network

[0056] Connect multiple computers to the same local area network, and each computer acts as a node to establish a cluster that can communicate with each other.

[0057] Step 2. Establish Hadoop platform

[0058] Configure the Hadoop 0.20.2 file for each node in the cluster, and set the attribute parameters dfs.namenode and dfs.datanode in the file to make the cluster include one name node and multiple data nodes; pass the attribute parameters mapred.jobtracker and The setting of mapred.tasktracker makes the cluster include a scheduling node and multiple task nodes, and an open source Hadoop platform is established.

[0059] The specific steps for establishing the Hadoop platform are as follows: first install the ubuntu10.04 operating system for each node in the clust...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a high-dimensional mass data GMM (Gaussian Mixture Model) clustering method under a Hadoop framework. The high-dimensional mass data GMM clustering method is used for clustering high-dimensional mass data by structuring the clustering problem of mass data on a distributed platform, mainly aiming at the defects of the existing clustering algorithms. The high-dimensional mass data GMM clustering method comprises the following specific steps of: 1. constructing a local area network; 2. establishing a Hadoop platform; 3. uploading data to a cluster; 4. initially clustering, 5. calculating the parameter and the discrimination function of each cluster; 6. discriminating whether clustering is completed or not; 7. clustering again; 8. calculating the mean value and the weight of each class in a new cluster; 9. calculating the variance of each class in the new cluster; and 10. outputting clustering results. By using the characteristics of a MapReduce operation model in the Hadoop framework, using a Map parallelizing method to process parallelizable parts in the cluster and adopting two Map/Reduce tasks to respectively calculate the mean value and the variance, high-efficiency and high-accuracy clustering can be realized, and the scalability and the fault tolerance are better.

Description

Technical field [0001] The invention belongs to the field of computer technology, and further relates to a Gaussian Mixtures Model (GMM) clustering method for high-dimensional and massive data under the Hadoop framework in the data mining field. The invention can conveniently and efficiently complete the clustering of high-dimensional and massive data areas, and overcome the low-efficiency and dimensional disaster problems of massive data processing in the stand-alone mode. technical background [0002] A computing framework MR widely used in the process of massive data processing "Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters[J]. Communications of the ACM, 2005, 51(1): 107-113" The computing framework was invented by Google. In recent years, the emerging parallel programming model puts parallelization, fault tolerance, data distribution, load balancing, etc. in a library, and summarizes all operations on data by the system into two steps: The Map (...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 崔江涛李林司蓁彭延国史玮陈煜崔小利王博
Owner 西安电子科技大学青岛计算技术研究院
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products