Iteration text clustering method based on self-adaptation subspace study

A technology of subspace learning and text clustering, applied in the field of iterative text clustering, which can solve problems such as overfitting and limited application scope.

Active Publication Date: 2013-09-04
广东南方报业传媒集团新媒体有限公司
View PDF2 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the limitation of NAML is that its optimization process must depend on multiple key parameters, which can easily lead to overfitting when the data is insufficient.
[0008] Although the idea of ​​adaptive dimensionality reductio

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Iteration text clustering method based on self-adaptation subspace study
  • Iteration text clustering method based on self-adaptation subspace study
  • Iteration text clustering method based on self-adaptation subspace study

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0065] Such as figure 1 As shown, the iterative text clustering method based on adaptive subspace learning includes the following steps:

[0066] (1) Clustering initialization of the text vector space: from the word segmentation expressions of all documents in the text corpus, a set of representative terms is selected using the mutual information method to form a term index; then each document is represented according to the term index is a text vector, the dimension of the text vector corresponds to the size of the selected term index, and the value of each element of the vector is represented by tfidf weight; all documents in the text corpus constitute an original text vector space; in the original In the text vector space, the affine propagation clustering algorithm is adopted to generate the specified K initial clusters (K-AP), and each document obtains its initial category, and the category information of all document clusters is summarized to form an initial category ind...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an iteration text clustering method based on self-adaptation subspace study. The method includes the following steps: (1) initiation: text linguistic data is expressed as a text vector space, initial K clusters are generated through an affine propagation clustering method, and all text clustering categories are expressed as an initial category affiliation indication matrix; and (2) iteration between the subspace projection and the clusters: the initial category affiliation indication matrix is used as prior knowledge, a maximum average neighborhood edge is used as a target to solve a subspace projection matrix, the text vector space is projected to a subspace, K clusters are generated through the affine propagation clustering method in the subspace, and a category affiliation indication matrix is updated; and a convergent function is calculated based on the subspace projection matrix and the category affiliation indication matrix till the function is converged, iteration exits, and text clustering is finished. The iteration text clustering method does not limit the capacity and distribution of text data, subspace solution and clusters are fused under a uniform frame, and an overall optimal clustering result is obtained through an iteration strategy.

Description

technical field [0001] The present invention relates to the field of machine learning and pattern recognition, in particular to an iterative text clustering method based on adaptive subspace learning, which is an adaptive subspace learning method based on the maximization of the average neighborhood edge, and adopts an iterative strategy Use it to solve text clustering problems. Background technique [0002] With the popularization and development of Internet technology and database technology, people can easily acquire and store large amounts of data. Most of the data in reality exists in the form of text. As a means, text clustering can organize, summarize and navigate text information, and help to accurately obtain the required information from the vast text information resources. Therefore, in recent years, it has been Gain widespread attention. [0003] In text clustering, text is often represented by Vector Space Model (VSM), but this representation is characterized ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06K9/66
Inventor 吴娴杨兴锋张东明何崑
Owner 广东南方报业传媒集团新媒体有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products