Text clustering method based on random neighbor embedding

A technology of random nearest neighbor embedding and text clustering, which is used in text database clustering/classification, unstructured text data retrieval, special data processing applications, etc., and can solve problems such as high clustering accuracy, fast running speed, and slow running

Active Publication Date: 2016-11-09
YANCHENG INST OF TECH
View PDF1 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The existing clustering algorithms are difficult to meet the following two requirements when processing text data: (1) high clustering acc

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text clustering method based on random neighbor embedding
  • Text clustering method based on random neighbor embedding
  • Text clustering method based on random neighbor embedding

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0038] Example:

[0039] like figure 1 As shown, a text clustering method based on random neighbor embedding includes the following steps:

[0040] S01: Preprocessing the text set, expressing the text set as a standardized word-text co-occurrence matrix;

[0041] S02: Embed high-dimensional text data into low-dimensional space through t-distributed stochastic neighbor embedding (t-SNE), so that the distance between the low-dimensional embedding points corresponding to the text with low similarity in high-dimensional space is relatively far, and the text with high similarity is relatively far away. The low-dimensional embedding points corresponding to the text are relatively close;

[0042] S03: Use multiple low-dimensional embedded points as the initial centroid of the K-means algorithm, and use the K-means algorithm for clustering according to the low-dimensional space mapping point coordinates.

[0043] The construction of standardized word-text co-occurrence matrix is ​​...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a text clustering method based on random neighbor embedding. The text clustering method includes the following step that the text set is preprocessed and the text set is expressed as a normalized word-text co-occurrence matrix; high-dimensional text data is embedded into low-dimensional space by t-Distribution random neighbor embedding (t-SNE) to make the low-dimensional embedding points corresponding to the text with low similarity in high-dimensional space be far in distance and the low-dimensional embedding points corresponding to the text with high similarity be close in distance; and the plurality of low-dimensional embedding points are taken as the initial mass center of the K-means algorithm, and the K-means algorithm is adopted to perform clustering based on the low-dimensional space mapping point coordinates. According to the invention, the dimension disaster problem caused by text high-dimension sparse characteristics can be solved, the dimension of the text data is reduced, the operation time of the clustering algorithm is shortened, and the precision of the clustering algorithm is improved.

Description

technical field [0001] The invention relates to a text clustering integration method, in particular to a text clustering method based on random neighbor embedding. Background technique [0002] With the rapid growth of network information and the maturity of technologies such as search engines, the main problem facing human society is no longer information scarcity, but how to improve the efficiency of information acquisition and information access. At present, most of the information on the Internet is presented in the form of text, so how to effectively organize large-scale text collections has become a very challenging problem. [0003] Text / document clustering (text / document clustering) is based on the well-known clustering assumption: the similarity of texts of the same type is greater, and the similarity of texts of different types is smaller. As one of the most important unsupervised machine learning methods, clustering does not require training, nor does it require ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 徐森徐静花小朋李先锋徐秀芳安晶皋军曹瑞
Owner YANCHENG INST OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products