Mass data clustering analysis method and device

A technology of cluster analysis and massive data, applied in the field of data analysis, can solve problems such as inability to identify, achieve the effect of ensuring load balancing and improving computing efficiency

Inactive Publication Date: 2020-01-21
CHENGDU SEFON SOFTWARE CO LTD
View PDF0 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to provide a massive data clustering analysis method and device, which solves the problem that with the development of the era of big data, the characteristics of data and the amount of d

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Mass data clustering analysis method and device
  • Mass data clustering analysis method and device
  • Mass data clustering analysis method and device

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0058] Example 1

[0059] A method for clustering analysis of massive data includes the following steps:

[0060] S1. The original data is processed based on the GeoHash coding algorithm of overlapping partitions, and the partition corresponding to each data in the original data is determined;

[0061] S2, in each partition, cluster the data in the partition in parallel, and save the cluster ID;

[0062] S3. After merging the partition results, the global cluster ID can be obtained.

[0063] The purpose of the present invention is to realize a DBSCAN algorithm based on parallel computing and solve the problem that the traditional density clustering algorithm cannot perform mass data analysis. The invention proposes efficient overlapping partitioning and cluster merging strategies, which can quickly split data and merge clusters, and the method fully considers load balancing, and can achieve efficient operations in a distributed framework, thereby supporting massive data The clustering...

Example Embodiment

[0064] Example 2

[0065] This embodiment is further on the basis of Embodiment 1. Furthermore, the GeoHash coding based on overlapping partitions is named OverLap-GeoHash algorithm. During the execution of the entire algorithm, the DBSCAN algorithm has the highest time complexity and space complexity. According to the barrel principle, in order to ensure the efficiency of parallel clustering, the data needs to be divided into equal parts as much as possible.

[0066] The GeoHash algorithm is a spatial encoding algorithm, often used for two-dimensional latitude and longitude data, which can map the latitude and longitude data to a one-dimensional value or string. This article expands it to multi-dimensional data, and combines it with the overlap partition strategy to make certain improvements. The data can be mapped to a one-dimensional value, which is the ID code of the partition. If the point to be coded is an overlap point, it will be mapped For multiple values, each value corr...

Example Embodiment

[0075] Example 3

[0076] This implementation is further on the basis of Embodiment 2. In the step S1, the GeoHash encoding algorithm processes the original data, and the method for determining the partition corresponding to each data in the original data includes the following steps:

[0077] S101. Initialize the Hash value as a binary number 0, the number of iterations to 0, a given number of iterations N, the upper and lower bounds of each dimension;

[0078] S102. For any data D, the selected dimension is the number of iterations modulo the number of dimensions. When the value of the data D in the dimension is not greater than the midpoint of the upper and lower bounds of the dimension, the Hash value is shifted to the left by one bit, and then Update the upper bound of the dimension to the midpoint of the original dimension, and the number of iterations plus 1; when the value of the data D in this dimension is greater than the midpoint of the upper and lower bounds of the dimens...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a mass data clustering analysis method and a mass data clustering analysis device, and aims to realize a DBSCAN algorithm based on parallel computing and solve the problem thata traditional density clustering algorithm cannot perform mass data analysis. According to the invention, an efficient overlapping partitioning and class cluster merging strategy is provided; data splitting and class cluster merging can be quickly carried out; according to the method, load balancing is fully considered, efficient operation can be achieved under a distributed framework, therefore,clustering of mass data is supported, the problem that mass data analysis cannot be conducted through a traditional DBSCAN is efficiently solved, and therefore the method has high performance and practical value.

Description

technical field [0001] The invention relates to the field of data analysis, in particular to a massive data clustering analysis method and device. Background technique [0002] With the development of social economy and the popularization of telephone and Internet, the crime rate of telecommunication fraud continues to rise, and because telecommunication fraud relies on the means of communication at the border, the scope of social harm caused by telecommunication fraud is wider. Different from general criminal cases, there is a certain threshold for telecom fraud, which is usually committed by gangs. Therefore, identifying criminal gangs through the suspect’s phone calls and network behavior data has become an effective way for public security organs to curb telecom fraud crimes. [0003] With the advent of the era of big data, data mining has become a sharp tool in the field of public security. Through data mining to mine the data distribution rules of criminal suspects, t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/906G06F16/901
CPCG06F16/9014G06F16/9024G06F16/906
Inventor 查文宇曾理徐浩王纯斌赵神州张艳清
Owner CHENGDU SEFON SOFTWARE CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products