Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

OPTICS point sorting clustering method based on Spark memory computing big data platform

A big data platform and data technology, applied in computing, electrical digital data processing, special data processing applications, etc., can solve problems such as easy memory overflow, long time, inability to run downtime, etc., to improve computing efficiency and improve efficiency. Effect

Active Publication Date: 2017-05-17
CHONGQING UNIV OF POSTS & TELECOMM
View PDF5 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The technical problem to be solved by the present invention is, in view of the disadvantages of the prior art, such as easy memory overflow, too long time and inability to run downtime when performing clustering when processing large batches of data, the present invention proposes a method based on Spark (based on memory computing platform) The OPTICS (Ordering Points To Identify The Clustering Structure) algorithm of the big data platform can handle large batches of data sets, and can get the cluster sorting in a very short time

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • OPTICS point sorting clustering method based on Spark memory computing big data platform
  • OPTICS point sorting clustering method based on Spark memory computing big data platform
  • OPTICS point sorting clustering method based on Spark memory computing big data platform

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] RDD (Resilient Distributed Data Set) is a special collection with a fault-tolerant mechanism, which can be distributed on the nodes of the cluster, and various parallel operations can be performed by means of functional editing and operation collections. RDD is a special collection with a fault-tolerant mechanism. It provides a read-only shared memory that can only be transformed from existing RDDs, and then loads all data into memory for multiple reuse. RDD is distributed and can be distributed on multiple machines for calculation, and RDD is elastic. When the memory is insufficient during the calculation, it will exchange data with the disk. Because Spark technology is based on memory calculations, most of the clustering methods based on Spark use to divide the data set first, aggregate the sample points into small classes on each small data set, and then continuously merge the small data sets. The small categories are aggregated to obtain the final large category. I...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides an OPTICS clustering algorithm based on a Spark big data platform, and relates to a computer information obtaining and processing technology. Parallel data is structurally partitioned, the optimal data set partitioning is obtained, a corresponding RDD is generated, neighbor sample numbers and core distances are calculated in parallel, partitions are subjected to parallel execution of the OPTICS algorithm to obtain a cluster sequence of the partitions, and the cluster sequence is obtained persistently; clusters are given to the partitions according to the cluster sequence, and samples can obtain global cluster numbers by combining the partitions. By means of the Spark distributed parallel technology, the optimal partitioning structure is found, and the cluster sequence of the partitions is obtained through parallel calculation. According to the OPTICS cluster sequence, a user can observe the inherent clustering structure of a data set from different levels of structures, the method can process a large data set which cannot be processed by a serial algorithm, and the time for obtaining the clustering result is greatly shortened.

Description

technical field [0001] The invention relates to the technical fields of computer data mining and computer information processing. Background technique [0002] With the rapid development of the field of computer information, a large amount of data is collected from all aspects of life, and the scale of various information on the Internet is also increasing geometrically. Rapid analysis from massive data can extract information hidden in The information in the data is becoming more and more important. [0003] Clustering analysis is a main method of data analysis. Clustering is the process of classifying data objects, so that objects in the same cluster have a high degree of similarity, while objects in different clusters are highly different. Different from the classification process, the clustering does not rely on the pre-defined classes and class labels, and the classification criteria and the number of types in the clustering process are unknown. Cluster analysis metho...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/2465
Inventor 胡峰瞿原邓维斌于洪张清华
Owner CHONGQING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products