Mass data clustering analysis method and device
A technology of cluster analysis and massive data, applied in the field of data analysis, can solve problems such as inability to identify, achieve the effect of ensuring load balancing and improving computing efficiency
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Example Embodiment
[0058] Example 1
[0059] A method for clustering analysis of massive data includes the following steps:
[0060] S1. The original data is processed based on the GeoHash coding algorithm of overlapping partitions, and the partition corresponding to each data in the original data is determined;
[0061] S2, in each partition, cluster the data in the partition in parallel, and save the cluster ID;
[0062] S3. After merging the partition results, the global cluster ID can be obtained.
[0063] The purpose of the present invention is to realize a DBSCAN algorithm based on parallel computing and solve the problem that the traditional density clustering algorithm cannot perform mass data analysis. The invention proposes efficient overlapping partitioning and cluster merging strategies, which can quickly split data and merge clusters, and the method fully considers load balancing, and can achieve efficient operations in a distributed framework, thereby supporting massive data The clustering...
Example Embodiment
[0064] Example 2
[0065] This embodiment is further on the basis of Embodiment 1. Furthermore, the GeoHash coding based on overlapping partitions is named OverLap-GeoHash algorithm. During the execution of the entire algorithm, the DBSCAN algorithm has the highest time complexity and space complexity. According to the barrel principle, in order to ensure the efficiency of parallel clustering, the data needs to be divided into equal parts as much as possible.
[0066] The GeoHash algorithm is a spatial encoding algorithm, often used for two-dimensional latitude and longitude data, which can map the latitude and longitude data to a one-dimensional value or string. This article expands it to multi-dimensional data, and combines it with the overlap partition strategy to make certain improvements. The data can be mapped to a one-dimensional value, which is the ID code of the partition. If the point to be coded is an overlap point, it will be mapped For multiple values, each value corr...
Example Embodiment
[0075] Example 3
[0076] This implementation is further on the basis of Embodiment 2. In the step S1, the GeoHash encoding algorithm processes the original data, and the method for determining the partition corresponding to each data in the original data includes the following steps:
[0077] S101. Initialize the Hash value as a binary number 0, the number of iterations to 0, a given number of iterations N, the upper and lower bounds of each dimension;
[0078] S102. For any data D, the selected dimension is the number of iterations modulo the number of dimensions. When the value of the data D in the dimension is not greater than the midpoint of the upper and lower bounds of the dimension, the Hash value is shifted to the left by one bit, and then Update the upper bound of the dimension to the midpoint of the original dimension, and the number of iterations plus 1; when the value of the data D in this dimension is greater than the midpoint of the upper and lower bounds of the dimens...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap