Hierarchy clustering method of successive dichotomy for document in large scale

A hierarchical clustering and large-scale technology, applied in the field of text information, can solve problems such as slow speed, and achieve fast speed and good effect

Inactive Publication Date: 2007-07-25
FUDAN UNIV
View PDF0 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The advantage of this algorithm is that the clustering effect is better, but the disadvantage is that the speed is very slow.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Hierarchy clustering method of successive dichotomy for document in large scale
  • Hierarchy clustering method of successive dichotomy for document in large scale
  • Hierarchy clustering method of successive dichotomy for document in large scale

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0015] The basic process is to express the text as a space vector, calculate the similarity between two texts, obtain a graph, and use the "sequential binary" hierarchical clustering algorithm for clustering.

[0016] 1. Vector space representation of text.

[0017] Assuming that there are n articles now, a total of m words appear. Then each article is represented by an m-dimensional vector, and n articles form an m×n matrix, denoted as M. m ij Indicates the tfidf value of the i-th word in the j-th article: M ij = tf ij × log n df i , where tf ij Indicates the frequency of the i-th word appearing in the j-th article, df i Indicates the number of articles containing the i-th word. In order to eliminate the difference in the length of the text, after the text is expressed as a vector, it is then normaliz...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A method for clustering large-capacity text includes presenting vector space of text, calculating similarity between each two texts, embedding pattern into dimensional space and using K-means algorithm to cluster texts to be two types, carrying out successive bisect till requirement is satisfied and pattern is not divided any more.

Description

technical field [0001] The invention belongs to the technical field of text information, and in particular relates to a large-scale text clustering method. Background technique [0002] With the popularity of the Internet, more and more people like to use the Internet as a media for expressing their opinions. Many forums, blogs, and chat rooms provide a wealth of public opinion information, how to use computers to automatically analyze this information has become a very important issue. Text clustering is a technology that can use computers to automatically classify text information. After clustering, those articles belonging to the same topic will be classified into the same category, which is convenient for users to search and read. At present, there are mainly the following text clustering methods: [0003] 1. K-means is a fast clustering algorithm based on optimization criteria. The algorithm randomly finds k initial class centers at the beginning. Then assign each t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/28
Inventor 黄萱菁赵林钱线
Owner FUDAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products