Clustering method and system based on big data parallel computation

A technology of parallel computing and clustering methods, applied in computing, relational databases, database models, etc., can solve problems such as unstable initial point selection, large computational load, and easy to fall into local optimal solutions.

Inactive Publication Date: 2017-12-08
GUANGZHOU TEDAO INFORMATION TECH CO LTD
View PDF5 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] Existing article clustering technologies such as kmeans, hierarchical clustering, SOM, and FCM are all based on word frequency and probability to classify and integrate articles, and there are uncontrollable errors. Among them, the initial point of k-means clustering algorithm The selection is unstable and is randomly selected, which causes the instability of the clustering results; although hierarchical clustering does not need to determine the number of categories, once a split or merge is performed, it cannot be corrected, and the clustering quality is limited; FCM clustering The class algorithm is sensitive to the initial clustering center, needs to manually determine the number of clusters, and is easy to fall into a local optimal solution; the SOM clustering algorithm has a strong theoretical connection with the actual brain processing
But the processing time is longer, further research is needed to adapt it to large databases
Moreover, the existing article clustering technology is not a parallel computing version. The general probability model is used to obtain weights, and the error is relatively high. The global feature vector is used for association, and the amount of calculation is huge.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Clustering method and system based on big data parallel computation
  • Clustering method and system based on big data parallel computation
  • Clustering method and system based on big data parallel computation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0054] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0055] see figure 1 , the present invention provides a kind of clustering method based on big data parallel computing, comprising the following steps:

[0056] S100, receiving the data to be aggregated that is collected in parallel by multiple threads of the large cluster.

[0057] In the embodiment of the present invention, please refer to figure 2 , the first data collection end, the second data collection end and the third data collection end carry out t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a clustering method based on big data parallel computation. The clustering method comprises following steps of receiving data to be aggregated which is acquired by multiple threads of a larger cluster in parallel; saving the data to be aggregated in a first database; extracting the data characteristic of the data to be aggregated, calling cluster models in parallel during the same period by multiple threads, calculating and analyzing the aggregation class of the data to be aggregated independently in a distributed manner, and carrying out same class aggregation; saving the data which is subjected to same class aggregation in a second database; and storing the data which is subjected to same class aggregation in a memory, and establishing a cluster data index. The invention also discloses a clustering system based on big data parallel computation, the Text fingerprint is accurately positioned, dimension reduction is simple, and the clustering topic accuracy is improved.

Description

technical field [0001] The invention relates to the field of text mining and automatic clustering, in particular to a clustering method and system based on big data parallel computing. Background technique [0002] Existing article clustering technologies such as kmeans, hierarchical clustering, SOM, and FCM are all based on word frequency and probability to classify and integrate articles, and there are uncontrollable errors. Among them, the initial point of k-means clustering algorithm The selection is unstable and is randomly selected, which causes the instability of the clustering results; although hierarchical clustering does not need to determine the number of categories, once a split or merge is performed, it cannot be corrected, and the clustering quality is limited; FCM clustering The class algorithm is sensitive to the initial cluster center, needs to manually determine the number of clusters, and is easy to fall into a local optimal solution; the SOM clustering al...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/27G06F16/2272G06F16/285
Inventor 晋彤李永康
Owner GUANGZHOU TEDAO INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products