Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Improved Kmeans clustering method based on SimHash

A kmeans clustering and clustering technology, which is applied in the field of computer networks, can solve problems such as difficulty in guaranteeing clustering quality, high computational overhead, and long time consumption

Active Publication Date: 2017-04-05
CHINA INTERNET NETWORK INFORMATION CENTER
View PDF2 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The clustering problem is a typical partition-based problem. The simplest and most commonly used algorithm in the partition-based clustering algorithm is the Kmeans clustering algorithm, but the pure Kmeans algorithm has two obvious disadvantages: one is that it needs to Specify the number of clusters beforehand; if the number of clusters is not known in advance, the clustering quality is difficult to guarantee; second, it is necessary to iteratively calculate the centroid. When the sample data is large, not only the calculation cost is large but also time-consuming , cannot be applied to large amounts of data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Improved Kmeans clustering method based on SimHash
  • Improved Kmeans clustering method based on SimHash
  • Improved Kmeans clustering method based on SimHash

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment

[0053] The text of the key information of the news web pages to be clustered exists in the local folder in the text file format (.txt), such as figure 1 As shown in , the text content format of key news information is: title, time, source and text, and the four are separated by newlines.

[0054] Specify the directory of the files to be clustered, run the improved Kmeans clustering algorithm, and generate "xxx_cluster_result.txt", the file format is: cluster label\tpage number in the cluster (page numbers are separated by spaces), and the clustering result file is as follows figure 2 shown.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses an improved Kmeans clustering method based on SimHash. The method includes the following steps: 1. using the SimHash algorithm to compute the fingerprint of each to-be-clustered file, generating one SimHash table; 2. selecting one to-be-clustered file, based on the fingerprint of the file, checking the SimHash table to obtain one similar file set S0; 3. putting the files from the S0 that have an Overlap value with the to-be-clustered file greater than a threshold [alpha] to a similar file S1; 4. back to a cluster C0 belonged to the S1, computing the distances between the to-be-clustered file and the centers of mass of all clusters in the C0, classifying the to-be-clustered file to a cluster I which is closest to the to-be-clustered file and smaller than the threshold and updating the center of mass of the cluster I, or creating a new cluster k, and taking the to-be-clustered file as a first element of the cluster k; if C0 is empty, creating a cluster j, taking the to-be-filed cluster as a first element of the cluster j. According to the invention, the method increases the effects of clustering.

Description

technical field [0001] The invention relates to an improved Kmeans clustering method based on SimHash, belonging to the technical field of computer networks. Background technique [0002] In today's era of explosive growth of Internet information, deduplication and clustering of information, as an important technical means, has been researched and favored by more and more scholars. The deduplication and clustering of text information is an important part of it. Therefore, the present invention only involves deduplication and clustering of text information. The clustering of text information mainly involves two aspects: the first is the text representation mode, that is, what to use to represent the text; the second is to choose the appropriate clustering algorithm. At present, there are many kinds of text representation modes, and the popular ones are Vector Space Model (VSM), probability model, language model, etc.; similarly, in the selection of clustering algorithm, the...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06K9/62
CPCG06F18/23211
Inventor 李晓东向菁菁耿光刚
Owner CHINA INTERNET NETWORK INFORMATION CENTER
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products