Non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data

A technology of large-scale data and clustering methods, applied in database models, relational databases, electronic digital data processing, etc.

Inactive Publication Date: 2015-08-19
BEIJING UNIV OF TECH
View PDF0 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

One data segmentation is completed once the data to be processed is scanned once

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data
  • Non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data
  • Non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] The present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.

[0023] In the following example, the data sequence D={d1, d2, d3, d4, d5, d6, d7, d8, d9}, there are four known clusters, namely C={C_1={d1, d3, d5 }, C_2={d2,d6}, ​​C_3={d4,d9}, C_4={d7,d8}}, and the similarity between the data in the cluster is greater than or equal to 0.8, and the similarity between the data in the cluster is less than 0.8. In order to obtain correct clustering results, the similarity threshold input during the specific operation is set to 0.8. The steps to use the quicksort-based non-recursive clustering method on this data sequence are as follows:

[0024] Step 1: Input the user similarity threshold K=0.8 and the initial data sequence D to be processed containing 9 data samples;

[0025] Step 2: Define the indicator pointers of the head and tail of the data sequence to be processed as start and end respectively, and ass...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A non-recursive clustering algorithm based on quicksort (NR-CAQS) suitable for large data belongs to the technical field of data mining. The algorithm is characterized by using a two-layer circulation to realize data clustering, defining two positioning pointers in advance, randomly selecting one benchmark data to be viewed as representative data of a cluster from a data sequence, and exchanging to the rightmost side of the data to be processed, and simultaneously defining a scanning process pointer and initializing, scanning the data to be processed and calculating a similarity value of residual data and the benchmark data, and comparing with a user threshold, adjusting the position of the residual data in a sequence according to the comparison result, exchanging the data whose similarity value is more than the user threshold to the left side of the sequence, and exchanging the data whose similarity value is less than the user threshold to the right side of the sequence to finish data partitioning, finally resetting the positioning pointer, positioning new data to be processed and returning to a outer circulation to continuously execute until total data sequence clustering is finished. The algorithm is applied to cluster spherical data and a large data set which has high time requirements.

Description

technical field [0001] A fast clustering method suitable for large-scale data belongs to the research field of clustering in data mining. In particular, it relates to a clustering method suitable for a higher requirement on time. Background technique [0002] With the popularization of mobile computing technology and the rise of the Internet of Things, massive amounts of data are generated, especially multimedia data such as text, images, and videos. As stated in "IDC Predictions 2014", in 2014, the size of the "digital universe"—that is, all digital information created, copied, and consumed in a year—will continue to expand, reaching about 6ZB (6 trillion trillion bytes) by more than 50%. megabytes). Analyzing and mining these big data in a reasonable and acceptable time becomes the biggest challenge in the field of IT. Clustering or cluster analysis in the field of data mining is often used for data preprocessing, which is a common form of exploratory data analysis and ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/285
Inventor 冀俊忠高明霞宋辰刘金铎
Owner BEIJING UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products