Massive high-dimension data clustering method for MapReduce platform

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A technology of high-dimensional data and clustering method, applied in electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as poor performance and insufficient memory, and achieve good customizability, high scalability and Customizability, the effect of reducing data size

Inactive Publication Date: 2011-10-19

FUDAN UNIV

View PDF2 Cites 33 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] However, the clustering methods for coordinates in Mahout, such as K-means, are often oriented to low-dimensional data. When clustering massive high-dimensional data, there are often problems such as insufficient memory and poor performance.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

specific Embodiment approach

[0048] The input data is 2000 music feature files, which are extracted from 2000 Chinese songs. Each song is divided into about 5000 frames, each frame has 26 attributes, represented by floating point numbers, and all frames are required to be clustered into 1500 classes. We regard this about 10 million frames as a set of points, and the 26 attributes of each point are used as 26-dimensional coordinates, and the clustering is performed according to the following steps:

[0049] (1) First cut each dimension into 10 grids (N=10), get all non-empty 26-dimensional grids, and remove the duplicates. Since N=10, the coordinate value of each dimension is an integer in the range [0,9].

[0050] (2) Randomly select 1500 grids from all the 26-dimensional grids output in step (1) as the initial center points.

[0051] (3) Cluster all 26-dimensional grids output from step (1) on the MapReduce distributed platform. When calculating the distance, use the ASCII codes of 0, 2, 4...50 in the...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention belongs to the technical fields of cloud computing and data mining, and particularly discloses a massive high-dimension data clustering method for a MapReduce platform. In the method, each dimension of raw data is split, and clustering is performed by utilizing small split non-null grids instead of points in the raw data so as to reduce a data scale. The clustering is realized by utilizing an open source of MapReduce, so that the whole clustering process can be finished in parallel on a distributed cluster, and the limitations of a single-machine algorithm to storage and computation are broken. In the clustering process, the thought of a K-mediods algorithm is adopted, and a highly-efficient Euclidean distance computation method is put forward. The method is applied to the processing of massive high-dimension data. A user can perform manual regulation on the algorithm according to the computational capability of the cluster, the expected time of the algorithm and requirements on clustering accuracy. The needs of different users are satisfied.

Description

technical field [0001] The invention belongs to the technical field of cloud computing and data mining, and in particular relates to a method for clustering massive high-dimensional data using a MapReduce distributed computing framework. Background technique [0002] The analysis of high-dimensional data has always been a difficult problem in data mining. When the dimension reaches a certain height, many clustering methods that are effective for low-dimensional data are no longer applicable. For massive high-dimensional data, analysis and mining are more related to the limitations of memory and hard disk. [0003] In recent years, research on MapReduce and its open-source version Hadoop has been very active. Many stand-alone algorithms are re-implemented on Hadoop, which provides high availability and scalability for various algorithms to process massive data. [0004] Mahout is an open source project based on Hadoop under Apache, which provides the implementation of some ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

Inventor廖松博何震瀛汪卫

OwnerFUDAN UNIV

Massive high-dimension data clustering method for MapReduce platform

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

specific Embodiment approach

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology