Spark framework-based parallelization method of text clustering model PW-LDA

A text clustering and model technology, applied in text database clustering/classification, unstructured text data retrieval, etc., can solve problems such as poor performance, hard disk I/O time-consuming, etc., to achieve large-scale data, speed up The program runs and the effect of high algorithm complexity

Active Publication Date: 2019-04-02
SUN YAT SEN UNIV
View PDF5 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

MapReduce technology performs well in the parallelization of many programs, but it does not perform well in some large matrix operations. At the same time, the Hadoop framework saves intermediate data in the hard disk, and the hard disk I / O in the process of repeatedly reading and writing data (Input / Output) takes a lot of time

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Spark framework-based parallelization method of text clustering model PW-LDA
  • Spark framework-based parallelization method of text clustering model PW-LDA
  • Spark framework-based parallelization method of text clustering model PW-LDA

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0024] Figure 1 to Figure 3 For the parallelization method of a kind of text clustering model PW-LDA based on Spark framework of the present invention, mainly comprise the following steps:

[0025] S1: Load the corpus data of scientific literature and initialize it as a distributed data type object of Spark.

[0026] S2: Segment the text in the imported corpus through the Map method, and preprocess the stop words to obtain training samples.

[0027] S3: Use Spark's Word2Vec interface to perform word vector training on the training samples.

[0028] S4: According to the result of Word2Vec, use the Partition algorithm to extract the target segment from the text of the training sample and realize the parallelism of the algorithm through the Map method.

[0029] S5: Use Spark's GraphX-based LDA interface to train the target segment extracted by the Partition algorithm to obtain a topic-word matrix.

[0030] S6: Calculate the topic vector according to the topic-word matrix obta...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the field of text topic clustering, in particular to a Spark framework-based parallelization method of a text clustering model PW-LDA. The method mainly comprises the steps ofdata loading, text data preprocessing, word vector training, Partion target segment extraction, LDA training, topic vector calculation, text clustering and the like. According to the method, a Sparkframework is used, the modules in the model are subjected to parallel design and implementation through the MapReduce technology and the GraphX technology, program operation is greatly accelerated, and therefore the feasibility of real-time operation is provided for the modules.

Description

technical field [0001] The present invention relates to the field of text topic clustering, and more specifically, relates to a parallelization method of a text clustering model PW-LDA based on a Spark framework. Background technique [0002] The PW-LDA model is a new text clustering model, which is a combination of topic model LDA (Latent Dirichlet Allocation) and word embedding model Word2Vec. The topic model is a probabilistic model. Compared with the traditional vector space model, the document is no longer simply analyzed in the word frequency space, but the topic space is introduced to realize the dimensionality reduction of the document analysis from the word frequency space to the topic. space. The word embedding model is also a probabilistic model, which makes the word sequence probability corresponding to the vector conform to the actual text by calculating the vector of the word. The partition algorithm is also proposed in PW-LDA. According to the vector results...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35
Inventor 陆遥夏中舟吴峻峰张勇瑞
Owner SUN YAT SEN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products