File clustering method based on information bottleneck theory

An information bottleneck and document clustering technology, which is applied in electronic digital data processing, special data processing applications, instruments, etc., can solve the problems of difficulty in guaranteeing clustering accuracy and low time complexity, and achieves high accuracy and simple principle. , fast effect

Inactive Publication Date: 2009-11-04
BEIHANG UNIV
View PDF0 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The time complexity of incremental clustering is low, but it is often closely related to the order of document sequences. Different orders may lead to different clustering results, so the accuracy of clustering is difficult to guarantee

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • File clustering method based on information bottleneck theory
  • File clustering method based on information bottleneck theory
  • File clustering method based on information bottleneck theory

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0020] The present invention uses the information bottleneck theory to calculate the "similarity" relationship between documents, and uses an incremental clustering algorithm to cluster the documents, which ensures that the method has a relatively low time complexity and is suitable for time performance For applications with higher requirements, at the same time, a sequence clustering algorithm is used to adjust the incremental clustering results to ensure that the clustering process can obtain high accuracy. A large number of experiments show that this method has better performance than classical clustering algorithms such as K-Means algorithm and AIB algorithm.

[0021] The present invention is a document clustering method based on the information bottleneck theory. On the one hand, the method utilizes the information bottleneck theory to calculate the similarity d between documents; The class result C; the processing steps in the clustering process are:

[0022] Step 1, us...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a file clustering method based on an information bottleneck theory. The method firstly utilizes the information bottleneck theory to calculate the similarity between files; increment clustering algorithm is used for clustering files; minimum shared information loss is calculated on clustering result; if the minimum shared information loss satisfies a set threshold, the file is combined with the nearest cluster, otherwise a new cluster is created to store the file. Sequence clustering method is adopted for adjusting the clustering result to improve clustering accuracy, each file is sampled in sequence during adjusting process, and sampling frequency is set to control adjusting intensity. The adjusting policy contains all sample files and contributes to improving clustering accuracy.

Description

technical field [0001] The invention relates to a clustering method for electronic documents. More specifically, it refers to a document clustering method based on information bottleneck theory. Background technique [0002] The explosive growth of information in the Internet has brought inconvenience to the management and use of information. In order to reveal the potentially valuable information or structure hidden behind Web data, Web mining technology has achieved rapid development and wide application in recent years. Document clustering is one of the most important tools in the field of Web mining. Its purpose is to divide a set of documents into several clusters, requiring the text content in the same cluster to have a high similarity, while the similarity between different clusters as small as possible. Each clustering process mainly includes two parts, the calculation of text content similarity and the text clustering method. [0003] Most clustering procedures ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 刘永利熊璋任捷欧阳元新
Owner BEIHANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products