Improved k-means clustering method based on distributed computing platform

A technology of distributed computing and clustering method, applied in the field of k-means clustering, which can solve the problems of not ensuring relatively uniform distribution of cluster centers, not well solving randomness, and not much difference.

Inactive Publication Date: 2016-12-07
SHANGHAI LINGKE SAFETY GUARD TECH
View PDF0 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Aiming at the problem of selecting the maximum value of the minimum spanning tree sum in the prior art without ensuring the relatively uniform distribution of the cluster centers, the present invention proposes that if the distances between the points in the selected cluster center point set cannot be guaranteed to be similar, here it is necessary to set A threshold, then remove the cluster center point set, select the point set corresponding to the minimum spanning tree weight and the maximum value from the remaining point sets as the cluster center point set, and judge the difference between the points in the cluster center point set at this time Whether the distance is guaranteed to be not much different, repeat the above process, which can effectively reduce the number of iterations
Aiming at the problem that the prior art two-time selection does not solve the problem of randomness well, the present invention selects the initial center through Closcal, and reduces the problems caused by randomness as much as possible by repeating n times, which can better reduce iterations

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Improved k-means clustering method based on distributed computing platform
  • Improved k-means clustering method based on distributed computing platform
  • Improved k-means clustering method based on distributed computing platform

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0077] Below in conjunction with specific embodiment, further illustrate the present invention, should be understood that these embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various equivalent forms of the present invention All modifications fall within the scope defined by the appended claims of the present application.

[0078] Basic idea: The present invention is an improved k-means clustering method based on a distributed computing platform. Because the k-means algorithm randomly selects the initial center, it leads to the optimal solution of the final cluster center localization. The processing speed of massive data is slow, and the number of data iterations There are too many problems and the relationship between vectors is not considered, so the distributed computing platform Spark is introduced to solve the p...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an improved k-means clustering method based on a distributed computing platform, and introduces a distributed computing platform Spark for the problem of slow processing of mass data, the Kruskal's algorithm for the problem of too many iterations and the Tanimoto distance for the problem of giving no consideration to the correlation among features of a vector. First, a minimum spanning tree is constructed for the randomly selected k points by using the Kruskal's algorithm, the corresponding weight sum is obtained, and the process is repeated for n times. Then, according to weight sums obtained within the n times, the maximum weight is selected thereform and that distance values between edges composed of the k points are not much different is ensured. In this way, the relatively uniform distribution of cluster centers can be guaranteed. Finally, a clustering algorithm is performed using k-means algorithm improved by using the Tanimoto distance.

Description

technical field [0001] The invention relates to an improved k-means clustering method suitable for distributed computing platform Spark in machine learning, and belongs to the technical field of data mining. Background technique [0002] The rapid development of Internet technology and information technology has led to a sharp increase in information resources, which has caused a serious problem of information overload. How to dig out hidden and useful information from massive data has aroused more and more people's attention, and machine learning technology has emerged from this. Cluster analysis is a very important part of it. It forms a collection of abstract or physical objects into multiple classes, so that objects of the same class have a high degree of similarity, and objects of different classes have as low a degree of similarity as possible. In the field of machine learning, partition-type clustering, density-type clustering, and network-type clustering algorithms...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62G06F17/30
CPCG06F16/35G06F2216/03G06F18/23213
Inventor 纪小展张成徐平平戴磊
Owner SHANGHAI LINGKE SAFETY GUARD TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products