Spark framework-based parallelization method of text clustering model PW-LDA
Patent Information
- Authority / Receiving Office
- CN Β· China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SUN YAT SEN UNIV
- Publication Date
- 2019-04-02
Smart Images

Figure 1 
Figure 2 
Figure 3
Abstract
Description
technical field
[0001] The present invention relates to the field of text topic clustering, and more specifically, relates to a parallelization method of a text clustering model PW-LDA based on a Spark framework. Background technique
[0002] The PW-LDA model is a new text clustering model, which is a combination of topic model LDA (Latent Dirichlet Allocation) and word embedding model Word2Vec. The topic model is a probabilistic model. Compared with the traditional vector space model, the document is no longer simply analyzed in the word frequency space, but the topic space is introduced to realize the dimensionality reduction of the document analysis from the word frequency space to the topic. space. The word embedding model is also a probabilistic model, which makes the word sequence probability corresponding to the vector conform to the actual text by calculating the vector of the word. The partition algorithm is also proposed in PW-LDA. According to the vector results...