Text data stream clustering algorithm based on affinity propagation

A technology of data stream clustering and neighbor propagation, which is applied in electrical digital data processing, special data processing applications, and computing, etc. Problems such as local solution, a priori parameters—the average clustering dimension is difficult to determine, etc.

Active Publication Date: 2015-07-15
HEFEI UNIV OF TECH
View PDF5 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] This algorithm also has the following disadvantages: the number of clusters needs to be determined in advance for each clustering, and the number of clusters cannot be changed as the category changes.
The algorithm can achieve better clustering results for spherical clusters, but it is difficult to cluster into clusters of arbitrary shapes.
There are also studies that propose an HPStream algorithm, which uses high-dimen

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text data stream clustering algorithm based on affinity propagation
  • Text data stream clustering algorithm based on affinity propagation
  • Text data stream clustering algorithm based on affinity propagation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0044] In this embodiment, a text data flow clustering algorithm based on neighbor propagation——OWAP-s algorithm is carried out according to the following steps:

[0045] Step 1. Perform dimensionality reduction processing on the text data set to obtain the corresponding text vector set;

[0046] In order to cope with the high-dimensional and sparse characteristics of text data, the following dimensionality reduction method is adopted:

[0047] First, build a word index by building the entire document, and then convert the obtained to . Among them, index refers to the serial number of the word, and value refers to the value. Since the indexes of all documents are arranged from small to large, we search for the indexes in the vectors of the two documents in order when calculating the similarity. If the index values ​​​​of the two documents are equal, then the two documents are indexed accordingly. The values ​​are multiplied together and accumulated until the similarity betw...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a text data stream clustering algorithm based on affinity propagation. The text data stream clustering algorithm is characterized by including the following steps: 1, carrying out dimension reduction processing on a text data set to obtain a corresponding text vector set; 2, obtaining clustering centers of all moments, and completing the clustering algorithm. By means of the text data stream clustering algorithm, the accuracy and the robustness of the algorithm can be improved without assigning the number of clusters in advance, and therefore the requirements for solving practical problems are met.

Description

technical field [0001] The invention relates to a clustering algorithm of text data flow based on neighbor propagation. Background technique [0002] With the advent of the big data era, a large amount of unstructured data has been generated on the network. Faced with these unstructured data, which are generated in real time, have a huge amount of data, and have complex structures, people urgently need to extract valuable information and knowledge from them. Text data stream clustering technology is a common method for analyzing these unstructured data. It has achieved good application results in news filtering, topic detection and tracking (TDT), user feature recommendation, etc., and has quickly become a current research hotspot. Since text data has high-dimensional sparse features, how to improve the efficiency and accuracy of clustering algorithms is very important. In 2005, Shi Zhong proposed the OSKM algorithm, which is an extension of the k-means algorithm. It divid...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 倪丽萍李一鸣倪志伟伍章俊
Owner HEFEI UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products