Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Document clustering method

A text clustering and text mining technology, applied in the field of text clustering, can solve problems such as limiting the popularization and application of algorithms

Active Publication Date: 2014-04-09
南方电网互联网服务有限公司
View PDF2 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

So far, there is no algorithm that can automatically group the features on the data set, which limits the application of the algorithm to a large extent.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document clustering method
  • Document clustering method
  • Document clustering method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0027] see figure 1 , an embodiment of the present invention provides a text clustering method, which at least includes the following steps.

[0028] S101, in the first document set D 1 In , the Latent Dirichlet Allocation (LDA) algorithm with a preset number of topics K is trained to obtain parameters β and φ.

[0029] In an embodiment of the present invention, the first document set D1 includes N non-repetitive features, denoted as V 1 …V N , the first d...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A document clustering method is used for conducting document mining on a document set of a potential Dirichlet distribution model. The document clustering method at least comprises the following steps: conducting training on the potential Dirichlet distribution algorithm in a first document set D1 to obtain parameters beta and phi, wherein the theme number K is preset in the potential Dirichlet distribution algorithm; according to the parameter phi, utilizing the information entropy theory for filtering the first document set D1 to obtain a second document set D2; according to the parameter beta, grouping the second document set D2 to generate a third document set D3 containing grouping information; operating the FG-Kmeans algorithm on the third document set D3 to obtain a finally-clustered clustering center set C and a mark matrix U. According to the document clustering method, documents are grouped according to the potential Dirichlet distribution algorithm, the FG-Kmeans algorithm is utilized for processing the grouped documents, the problem of high-dimensional and sparse data in document mining is well solved, and the concept of feature grouping is introduced into feature space, so that information contained in the feature space is more rich.

Description

technical field [0001] The invention relates to the field of data mining, in particular to a text clustering method. Background technique [0002] With the advent of the era of big data, people are faced with the severe challenge of data clustering on high-dimensional data. Excessively high dimension directly leads to the sparseness of data, which is especially obvious in text mining. Clustering algorithm is an effective method for clustering high-dimensional sparse data. As a clustering algorithm, FG-Kmeans algorithm (Chen, X., Ye, Y., Xu, X., Huang, J.Z.: A feature group weighting method for subspace clustering of high-dimensional data.Pattern Recognition45(1)(2012)) successfully introduced the concept of group into soft clustering. In the FG-Kmeans algorithm, features are divided into several according to similarity The algorithm weights the features and groups at the same time, and finds out the more important feature groups in each cluster and the more important featu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 蔡业首陈小军管婷婷黄哲学
Owner 南方电网互联网服务有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products