A spark-based clustering method for high-dimensional sparse text data

A text data and clustering method technology, applied in text database clustering/classification, unstructured text data retrieval, special data processing applications, etc. Satisfy computing requirements, etc.

Active Publication Date: 2020-09-29
芽米科技(广州)有限公司
View PDF4 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

For example, when calculating the similarity matrix M(n*n) between samples (n is very large) or storing a similar size similarity matrix, there may be inoperable problems
At the same time, the traditional spectral clustering algorithm also needs a lot of storage space and computing time when calculating the K eigenvectors of the Laplace matrix.
These outstanding problems make spectral clustering more and more unsatisfactory to meet the computing requirements proposed by the current situation of rapid increase in data volume.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A spark-based clustering method for high-dimensional sparse text data
  • A spark-based clustering method for high-dimensional sparse text data
  • A spark-based clustering method for high-dimensional sparse text data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0043] The technical solutions in the embodiments of the present invention will be described clearly and in detail below with reference to the drawings in the embodiments of the present invention. The described embodiments are only some of the embodiments of the invention.

[0044]Technical scheme of the present invention is as follows:

[0045] figure 1 It is a flow chart of the present invention, comprising the following steps:

[0046] 1. The stage of loading data is as shown in the figure, such as figure 2 shown;

[0047] At this stage, the data source to be processed (source UCI data platform) needs to be read into the elastic distributed data set (RDD), then loaded into the high-dimensional distributed vector set data P, and divided into training set A 1 and test set A 2 ,

[0048] Download the RCV1 data set from the UCI experimental data platform (URL: http: / / archive.ics.uci.edu / ml / ), the form of the data set is {decision label, condition attribute 1, condition a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention requests to protect a high-dimensional sparse text data clustering method based on Spark. The method comprises the following steps of: 1) a RDD (Resilient Distributed Dataset) is used for reading a dataset; 2) a RDD interface is used for designing a distributed spare vector set; 3) a similarity between the distributed spare vector set and an integral dataset of a node where the distributed spare vector set is positioned is calculated, similar matrixes are obtained by abstraction according to numbers, the stored similar matrixes are subjected to symmetrization, and the normalized forms and the Laplace matrix forms of the similar matrixes are obtained; 4) SVD (Singular Value Decomposition) is used for decomposing the normalized Laplace matrix in the 3); 5) a new matrix constructed in the 4) is used into a K-means model as a sample to be trained; and 6) an established model is used for clustering a test set. By use of the method, the operation performance of a traditional clustering algorithm under a big dataset is improved.

Description

technical field [0001] The invention relates to the fields of text data clustering, machine learning and distributed computing, in particular to a Spark-based high-dimensional sparse text data clustering method. . Background technique [0002] With the advent of the big data era, the Internet has accumulated more and more network data. These accumulated data have reached the limit that ordinary computers can handle. In order to deal with increasingly difficult data processing problems, all walks of life have turned their attention to the Spark-based distributed processing platform and parallel sparse data set storage technology. [0003] Spark is a distributed programming framework for big data similar to Hadoop, but there are some useful differences between the two that make Spark superior for certain workloads, in other words, Spark enables in-memory distributed datasets, which can optimize iterative workloads in addition to being able to provide interactive queries. T...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35
CPCG06F16/35
Inventor 王进黄超莫倩雯陈乔松邓欣欧阳卫华胡峰李智星雷大江
Owner 芽米科技(广州)有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products