Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Hierarchical clustering method based on Hadoop and HBase

A hierarchical clustering and algorithm technology, applied in special data processing applications, instruments, electrical digital data processing, etc., to achieve the effect of improving scalability and big data processing capabilities, improving scalability and processing big data capabilities

Active Publication Date: 2017-03-08
SOUTH CHINA UNIV OF TECH
View PDF1 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

But currently there is no hierarchical clustering algorithm that can support single-linkage, complete-linkage and average-linkage multiple clustering methods on Hadoop and HBase platforms

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Hierarchical clustering method based on Hadoop and HBase
  • Hierarchical clustering method based on Hadoop and HBase
  • Hierarchical clustering method based on Hadoop and HBase

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0020] In order to make the technical solutions and advantages of the present invention clearer, further detailed description will be given below in conjunction with the accompanying drawings, but the implementation and protection of the present invention are not limited thereto. It should be pointed out that, if there are symbols or processes in the following that are not specifically described in detail, those skilled in the art can understand or implement them with reference to the prior art.

[0021] 1. Parallel calculation algorithm of distance matrix

[0022] The parallel calculation algorithm of the distance matrix aims to improve the calculation speed of the distance matrix and quickly import it into HBase. In the process of clustering, the hierarchical clustering algorithm needs to rely on a space complexity of O(n 2 ) distance matrix, in this method, a Hadoop-based parallel computing algorithm is designed and implemented for the calculation of the distance matrix, s...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a hierarchical clustering method based on Hadoop and HBase. The method comprises the following steps: computing a distance matrix through Hadoop, and introducing a method for converting a result into Bulk Load for HFile file into the HBase, wherein the HBase is used for storing the distance matrix and mainly divided into two tables, one table is ordered according to cluster ID pair, and the other table is ordered according to the distance between the clusters, thereby conveniently taking out two nearest clusters to combine in each iteration; and finally realizing one multi-thread algorithm and combining with the cache technology to process the distance matrix in the HBase in an unifned manner to realize the hierarchical clustering algorithm and reserving multiple adjustable parameters. The algorithm simultaneously supports three clustering methods of single-linkage, complete-linkage and average-linkage. The scheme provided by the invention uses the parallel computation capacity of the Hadoop and the mass data storage capacity of the HBase, thereby improving the big data processing capacity and the expansibility of the hierarchical clustering algorithm.

Description

technical field [0001] The invention relates to the technical field of hierarchical clustering algorithm, Hadoop and HBase, in particular to the design and realization of a hierarchical clustering method based on Hadoop and HBase. Background technique [0002] As a simple and widely accepted clustering algorithm, hierarchical clustering algorithm has been applied in many fields, such as information retrieval and bioinformatics. The advantage of the hierarchical clustering algorithm is that it expresses the clustering results in a more detailed way. It organizes the clustering relationship between clusters into a dendrogram, and users can clearly know how each cluster is clustered together. Many other clustering algorithms do not give such results. Moreover, compared with other clustering algorithms such as k-means, the hierarchical clustering algorithm does not require the user to specify the number of clusters in advance. Although the hierarchical clustering algorithm has...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/2282G06F16/25Y02D10/00
Inventor 刘发贵周晓场
Owner SOUTH CHINA UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products