Massive high-dimension data clustering method for MapReduce platform

A technology of high-dimensional data and clustering method, applied in electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as poor performance and insufficient memory, and achieve good customizability, high scalability and Customizability, the effect of reducing data size

Inactive Publication Date: 2011-10-19
FUDAN UNIV
View PDF2 Cites 33 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] However, the clustering methods for coordinates in Mahout, such as K-means, are often oriented to low-dimensional dat

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Massive high-dimension data clustering method for MapReduce platform
  • Massive high-dimension data clustering method for MapReduce platform
  • Massive high-dimension data clustering method for MapReduce platform

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment approach

[0048] The input data is 2000 music feature files, which are extracted from 2000 Chinese songs. Each song is divided into about 5000 frames, each frame has 26 attributes, represented by floating point numbers, and all frames are required to be clustered into 1500 classes. We regard this about 10 million frames as a set of points, and the 26 attributes of each point are used as 26-dimensional coordinates, and the clustering is performed according to the following steps:

[0049] (1) First cut each dimension into 10 grids (N=10), get all non-empty 26-dimensional grids, and remove the duplicates. Since N=10, the coordinate value of each dimension is an integer in the range [0,9].

[0050] (2) Randomly select 1500 grids from all the 26-dimensional grids output in step (1) as the initial center points.

[0051] (3) Cluster all 26-dimensional grids output from step (1) on the MapReduce distributed platform. When calculating the distance, use the ASCII codes of 0, 2, 4...50 in the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the technical fields of cloud computing and data mining, and particularly discloses a massive high-dimension data clustering method for a MapReduce platform. In the method, each dimension of raw data is split, and clustering is performed by utilizing small split non-null grids instead of points in the raw data so as to reduce a data scale. The clustering is realized by utilizing an open source of MapReduce, so that the whole clustering process can be finished in parallel on a distributed cluster, and the limitations of a single-machine algorithm to storage and computation are broken. In the clustering process, the thought of a K-mediods algorithm is adopted, and a highly-efficient Euclidean distance computation method is put forward. The method is applied to the processing of massive high-dimension data. A user can perform manual regulation on the algorithm according to the computational capability of the cluster, the expected time of the algorithm and requirements on clustering accuracy. The needs of different users are satisfied.

Description

technical field [0001] The invention belongs to the technical field of cloud computing and data mining, and in particular relates to a method for clustering massive high-dimensional data using a MapReduce distributed computing framework. Background technique [0002] The analysis of high-dimensional data has always been a difficult problem in data mining. When the dimension reaches a certain height, many clustering methods that are effective for low-dimensional data are no longer applicable. For massive high-dimensional data, analysis and mining are more related to the limitations of memory and hard disk. [0003] In recent years, research on MapReduce and its open-source version Hadoop has been very active. Many stand-alone algorithms are re-implemented on Hadoop, which provides high availability and scalability for various algorithms to process massive data. [0004] Mahout is an open source project based on Hadoop under Apache, which provides the implementation of some ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 廖松博何震瀛汪卫
Owner FUDAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products