Spark framework-based parallelization method of text clustering model PW-LDA

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
A text clustering and model technology, applied in text database clustering/classification, unstructured text data retrieval, etc., can solve problems such as poor performance, hard disk I/O time-consuming, etc., to achieve large-scale data, speed up The program runs and the effect of high algorithm complexity

Active Publication Date: 2019-04-02

SUN YAT SEN UNIV

View PDF5 Cites 6 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

MapReduce technology performs well in the parallelization of many programs, but it does not perform well in some large matrix operations. At the same time, the Hadoop framework saves intermediate data in the hard disk, and the hard disk I / O in the process of repeatedly reading and writing data (Input / Output) takes a lot of time

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0024] Figure 1 to Figure 3 For the parallelization method of a kind of text clustering model PW-LDA based on Spark framework of the present invention, mainly comprise the following steps:

[0025] S1: Load the corpus data of scientific literature and initialize it as a distributed data type object of Spark.

[0026] S2: Segment the text in the imported corpus through the Map method, and preprocess the stop words to obtain training samples.

[0027] S3: Use Spark's Word2Vec interface to perform word vector training on the training samples.

[0028] S4: According to the result of Word2Vec, use the Partition algorithm to extract the target segment from the text of the training sample and realize the parallelism of the algorithm through the Map method.

[0029] S5: Use Spark's GraphX-based LDA interface to train the target segment extracted by the Partition algorithm to obtain a topic-word matrix.

[0030] S6: Calculate the topic vector according to the topic-word matrix obta...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention relates to the field of text topic clustering, in particular to a Spark framework-based parallelization method of a text clustering model PW-LDA. The method mainly comprises the steps ofdata loading, text data preprocessing, word vector training, Partion target segment extraction, LDA training, topic vector calculation, text clustering and the like. According to the method, a Sparkframework is used, the modules in the model are subjected to parallel design and implementation through the MapReduce technology and the GraphX technology, program operation is greatly accelerated, and therefore the feasibility of real-time operation is provided for the modules.

Description

technical field [0001] The present invention relates to the field of text topic clustering, and more specifically, relates to a parallelization method of a text clustering model PW-LDA based on a Spark framework. Background technique [0002] The PW-LDA model is a new text clustering model, which is a combination of topic model LDA (Latent Dirichlet Allocation) and word embedding model Word2Vec. The topic model is a probabilistic model. Compared with the traditional vector space model, the document is no longer simply analyzed in the word frequency space, but the topic space is introduced to realize the dimensionality reduction of the document analysis from the word frequency space to the topic. space. The word embedding model is also a probabilistic model, which makes the word sequence probability corresponding to the vector conform to the actual text by calculating the vector of the word. The partition algorithm is also proposed in PW-LDA. According to the vector results...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F16/35

Inventor陆遥夏中舟吴峻峰张勇瑞

OwnerSUN YAT SEN UNIV

Spark framework-based parallelization method of text clustering model PW-LDA

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology