Text clustering method based on random neighbor embedding

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology of random nearest neighbor embedding and text clustering, which is used in text database clustering/classification, unstructured text data retrieval, special data processing applications, etc., and can solve problems such as high clustering accuracy, fast running speed, and slow running

Active Publication Date: 2016-11-09

YANCHENG INST OF TECH

View PDF1 Cites 18 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

The existing clustering algorithms are difficult to meet the following two requirements when processing text data: (1) high clustering accuracy; (2) fast running speed

Overall, fast clustering algorithms sacrifice accuracy, while high-precision clustering algorithms run slowly

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0039] like figure 1 As shown, a text clustering method based on random neighbor embedding includes the following steps:

[0040] S01: Preprocessing the text set, expressing the text set as a standardized word-text co-occurrence matrix;

[0041] S02: Embed high-dimensional text data into low-dimensional space through t-distributed stochastic neighbor embedding (t-SNE), so that the distance between the low-dimensional embedding points corresponding to the text with low similarity in high-dimensional space is relatively far, and the text with high similarity is relatively far away. The low-dimensional embedding points corresponding to the text are relatively close;

[0042] S03: Use multiple low-dimensional embedded points as the initial centroid of the K-means algorithm, and use the K-means algorithm for clustering according to the low-dimensional space mapping point coordinates.

[0043] The construction of standardized word-text co-occurrence matrix is as follows: figur...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a text clustering method based on random neighbor embedding. The text clustering method includes the following step that the text set is preprocessed and the text set is expressed as a normalized word-text co-occurrence matrix; high-dimensional text data is embedded into low-dimensional space by t-Distribution random neighbor embedding (t-SNE) to make the low-dimensional embedding points corresponding to the text with low similarity in high-dimensional space be far in distance and the low-dimensional embedding points corresponding to the text with high similarity be close in distance; and the plurality of low-dimensional embedding points are taken as the initial mass center of the K-means algorithm, and the K-means algorithm is adopted to perform clustering based on the low-dimensional space mapping point coordinates. According to the invention, the dimension disaster problem caused by text high-dimension sparse characteristics can be solved, the dimension of the text data is reduced, the operation time of the clustering algorithm is shortened, and the precision of the clustering algorithm is improved.

Description

technical field [0001] The invention relates to a text clustering integration method, in particular to a text clustering method based on random neighbor embedding. Background technique [0002] With the rapid growth of network information and the maturity of technologies such as search engines, the main problem facing human society is no longer information scarcity, but how to improve the efficiency of information acquisition and information access. At present, most of the information on the Internet is presented in the form of text, so how to effectively organize large-scale text collections has become a very challenging problem. [0003] Text / document clustering (text / document clustering) is based on the well-known clustering assumption: the similarity of texts of the same type is greater, and the similarity of texts of different types is smaller. As one of the most important unsupervised machine learning methods, clustering does not require training, nor does it require ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

CPCG06F16/35

Inventor 徐森徐静花小朋李先锋徐秀芳安晶皋军曹瑞

Owner YANCHENG INST OF TECH

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Text clustering method based on random neighbor embedding

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology