Spark framework-based parallelization method of text clustering model PW-LDA

A text clustering and model technology, applied in text database clustering/classification, unstructured text data retrieval, etc., can solve problems such as poor performance, hard disk I/O time-consuming, etc., to achieve large-scale data, speed up The program runs and the effect of high algorithm complexity
CN109558482AActive Publication Date: 2019-04-02SUN YAT SEN UNIV

Patent Information

Authority / Receiving Office
CN Β· China
Patent Type
Applications(China)
Current Assignee / Owner
SUN YAT SEN UNIV
Publication Date
2019-04-02

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention relates to the field of text topic clustering, in particular to a Spark framework-based parallelization method of a text clustering model PW-LDA. The method mainly comprises the steps ofdata loading, text data preprocessing, word vector training, Partion target segment extraction, LDA training, topic vector calculation, text clustering and the like. According to the method, a Sparkframework is used, the modules in the model are subjected to parallel design and implementation through the MapReduce technology and the GraphX technology, program operation is greatly accelerated, and therefore the feasibility of real-time operation is provided for the modules.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The present invention relates to the field of text topic clustering, and more specifically, relates to a parallelization method of a text clustering model PW-LDA based on a Spark framework. Background technique

[0002] The PW-LDA model is a new text clustering model, which is a combination of topic model LDA (Latent Dirichlet Allocation) and word embedding model Word2Vec. The topic model is a probabilistic model. Compared with the traditional vector space model, the document is no longer simply analyzed in the word frequency space, but the topic space is introduced to realize the dimensionality reduction of the document analysis from the word frequency space to the topic. space. The word embedding model is also a probabilistic model, which makes the word sequence probability corresponding to the vector conform to the actual text by calculating the vector of the word. The partition algorithm is also proposed in PW-LDA. According to the vector results...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More