Massive high-dimension data clustering method for MapReduce platform

A high-dimensional data and clustering method technology, applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as poor performance and insufficient memory, and achieve good customizability, high scalability and Customizability, data size reduction effect
CN102222092BInactive Publication Date: 2013-02-27FUDAN UNIV

Patent Information

Authority / Receiving Office
CN ยท China
Patent Type
Patents(China)
Current Assignee / Owner
FUDAN UNIV
Publication Date
2013-02-27
Estimated Expiration
Not applicable ยท inactive patent

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention belongs to the technical fields of cloud computing and data mining, and particularly discloses a massive high-dimension data clustering method for a MapReduce platform. In the method, each dimension of raw data is split, and clustering is performed by utilizing small split non-null grids instead of points in the raw data so as to reduce a data scale. The clustering is realized by utilizing an open source of MapReduce, so that the whole clustering process can be finished in parallel on a distributed cluster, and the limitations of a single-machine algorithm to storage and computation are broken. In the clustering process, the thought of a K-mediods algorithm is adopted, and a highly-efficient Euclidean distance computation method is put forward. The method is applied to the processing of massive high-dimension data. A user can perform manual regulation on the algorithm according to the computational capability of the cluster, the expected time of the algorithm and requirements on clustering accuracy. The needs of different users are satisfied.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention belongs to the technical field of cloud computing and data mining, and in particular relates to a method for clustering massive high-dimensional data using a MapReduce distributed computing framework. Background technique

[0002] The analysis of high-dimensional data has always been a difficult problem in data mining. When the dimension reaches a certain height, many clustering methods that are effective for low-dimensional data are no longer applicable. For massive high-dimensional data, analysis and mining are more related to the limitations of memory and hard disk.

[0003] In recent years, research on MapReduce and its open-source version Hadoop has been very active. Many stand-alone algorithms are re-implemented on Hadoop, which provides high availability and scalability for various algorithms to process massive data.

[0004] Mahout is an open source project based on Hadoop under Apache, which provides the implementation of some ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More