Distributed density peak value clustering algorithm based on z value

A density peak and clustering algorithm technology, applied in the field of big data processing, can solve problems such as increased computing overhead, large randomness of seed objects, and unbalanced load of computer clusters

Inactive Publication Date: 2017-12-26
SHENYANG POLYTECHNIC UNIV
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In order to improve the efficiency of the algorithm, the paper "EDDPC: An Efficient Distributed Density Central Clustering Algorithm" uses Voronoi segmentation technology to divide the data set into disjoint groups, and then send them to different machines for execution, but the grouping method is insufficient The reason is that the randomness of the seed object is very large, which may cause unbalanced load in the computer cluster
Secondly, when calculating the density value ρ and the repelling group value δ, there are still a large number of redundant copies, which increases the computational overhead

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed density peak value clustering algorithm based on z value
  • Distributed density peak value clustering algorithm based on z value
  • Distributed density peak value clustering algorithm based on z value

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] The specific embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings.

[0038] refer to Figure 1 ~ Figure 4 , using the z-value-based density peak clustering method to cluster high-dimensional big data, including the following steps:

[0039] Step 1: Data set selection.

[0040] This implementation uses three data sets of KDD'99_10%, FCoverType and facial. The KDD’99_10% data set is a data set composed of 494,021 data points with 42 attributes such as connection time and transmission data volume. This implementation intercepts 34 real-valued attributes. FCoverType is a dataset consisting of 581,012 data points of 54 attributes including latitude and longitude. The Facial dataset is a dataset consisting of 27,936 face images, each of which includes 300 pixels.

[0041] Step 2: Construction of software and hardware environment.

[0042] Step 2.1: Build a hardware computing platform.

[0043] Under the Ubu...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A distributed density peak value clustering algorithm based on a z value is disclosed. The algorithm comprises the following steps of (1) preparing a data set; (2) constructing software and hardware environments; (3) preprocessing data; (4) sampling the data set; (5) determining a cut-off distance parameter value of a density value calculating formula in a density clustering algorithm based on the z value and selecting a subgroup quantile; (6) according to the size of the z value, sending points in the data set to different groups; (7) calculating a density value in the distributed density peak value clustering algorithm based on the z value; (8) calculating a global outlier in the distributed density peak value clustering algorithm based on the z value; and (9) under a Hadoop environment, using the density peak value clustering algorithm based on the z value to carry out large data clustering. A z value characteristic is used, a filtering strategy is adopted during data interaction among subgroups, a lot of ineffective distance calculating and data transmission cost are reduced, and execution efficiency of the algorithm is effectively increased.

Description

technical field [0001] The invention relates to the field of big data processing, relates to a distributed density clustering algorithm, in particular to a distributed density peak clustering algorithm based on z value. Background technique [0002] Cluster analysis is one of the widely studied problems in the fields of data mining and pattern recognition. The density peak clustering algorithm (Density Peaks Clustering, DPC) published in the academic journal "Science" is a typical density-based clustering algorithm. The algorithm clusters the data set according to the property that each cluster has a density maximum point. The algorithm can find clusters of any shape and does not depend on the dimension of the data set; the implementation of the algorithm only needs to calculate two attribute values ​​of each point: (1) the density value ρ (by a certain range (2) Repulsion group value δ (characterized by the minimum value of the distance from a point whose density value is...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62G06F17/30
CPCG06F16/285G06F18/23
Inventor 段勇卢晶
Owner SHENYANG POLYTECHNIC UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products