Distributed clustering method facing to internet micro-content

A distributed clustering and Internet technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of high maintenance cost, not, not ideal, etc., to achieve wide application range, simple operation and low maintenance cost small effect

Inactive Publication Date: 2008-05-14
ZHEJIANG UNIV
View PDF0 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] At present, there are many clustering methods for Internet micro-content, such as the more mature Bayesian, KNN, SVM, etc., but the Bayesian method needs specific corpus support, the maintenance cost is relatively high, and the effect of clustering is the same as that of the corpus. Scale and quality have a great relationship, and it is not an ideal clustering method; the other two clustering methods KNN and SVM need to first calculate the similarity between all micro-content, when the order of magnitude of micro-content is massive, such as thousands Ten thousand level o(10 8 ), then the order of time required to calculate the similarity between all micro-contents is o(10 16 ), which is obviously unbearable for users, so it is not an ideal clustering method;

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed clustering method facing to internet micro-content
  • Distributed clustering method facing to internet micro-content
  • Distributed clustering method facing to internet micro-content

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] In the clustering application system oriented to Internet micro-content, the distributed clustering method provided by the present invention can be used to quickly and accurately cluster massive micro-content. Taking the blog comment spam clustering system as an example, the specific The implementation steps are as follows:

[0034] 1) The main control machine first performs segmentation operation on the blog comment source file to obtain multiple small source data files. The specific process is as follows:

[0035] For the input large blog comment source file, write it into multiple small files according to the fixed number of records in each file, and write one blog comment in each small file. The fixed number of comments is determined by the specific implementation of meta-clustering The configuration of the clustering machine is determined by the operation. Figure 2 shows the structural diagram of the segmentation module, where Split_1, Split_k, and Split_n in Figur...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a distributed clustering method for Internet micro-content. The present invention adopts a multi-machine distributed clustering method. The main control machine divides the micro-content to be processed into multiple small files, and distributes these small files to multiple clustering machines for clustering operations. A single clustering machine performs meta-clustering on each assigned small file, and then merges these meta-clustering result files to obtain the corresponding stand-alone clustering merged file, and then sends it to the main control machine. After receiving the stand-alone clustering and merging files sent by each clustering machine, the master control machine extracts micro-content representative points from each stand-alone clustering and merging file, performs meta-clustering on these micro-content representative points again, and generates a new clustering category items, and merge the corresponding categories to get the final clustering result. The invention can accurately and rapidly cluster massive Internet micro-contents, and is an efficient and practical distributed clustering method.

Description

technical field [0001] The invention relates to technologies related to clustering processing of massive Internet micro-contents, in particular to a distributed clustering method for Internet micro-contents. Background technique [0002] In recent years, with the continuous increase of computer broadband users, various Internet applications continue to emerge, the Internet has quickly entered the WEB2.0 era, and WEB2.0 applications such as blogs, podcasts, and Witkeys have developed rapidly. Taking blog applications as an example, according to authoritative Research institutions predict that the number of bloggers worldwide will exceed 100 million this year, and will continue to grow. With the continuous growth of the number of blog users, micro-content such as user comments and messages is also increasing explosively. Many of the micro-content are It is advertising, a large number of repeated recommendations and other spam information. Their existence seriously affects the ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30H04L29/06
Inventor 陈珂陈刚汪源胡天磊寿黎但
Owner ZHEJIANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products