Parallel PLSA (Probabilistic Latent Semantic Analysis) method based on Hadoop

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A functional and overall technology, applied in the field of parallel implementation of PLSA, can solve the problems of slow running of PLSA, reduce the overall running time and improve the efficiency of operation

Inactive Publication Date: 2012-11-14

NANJING UNIV +1

View PDF2 Cites 11 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] Purpose of the invention: for the problems and deficiencies in the above-mentioned prior art, the purpose of the invention is to provide a Hadoop-based parallelized PLSA method to solve the problem that PLSA runs slowly when processing massive amounts of data

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0018] Below in conjunction with accompanying drawing and specific embodiment, further illustrate the present invention, should be understood that these embodiments are only for illustrating the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various aspects of the present invention Modifications in equivalent forms all fall within the scope defined by the appended claims of this application.

[0019] The present invention mainly realizes the parallelization of the algorithm through mapreduce, and its process is as follows figure 1 shown. and, if figure 2 As shown, for each probability result, the result obtained in the previous iteration can be used as the input to the next iteration, so as to realize the continuous update of the result until convergence.

[0020] Method 1 of the present invention is described in detail below:

[0021] The main idea of t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a parallel PLSA (Probabilistic Latent Semantic Analysis) method based on Hadoop. The method comprises the following steps: data is stored in a distributed data storage environment; a probabilistic model file needing to be subjected to calculation updating is segmented to be used as the input of each mapper; each iteration updating process in the whole EM (expectation-maximization) process is calculated through mapreduce, and each iteration updating process in the EM process is calculated through the sending of a map function at a mapper end, a reduce function at a reducer end and a key value pair; the iteration result updated each time is used as the input of the next iteration; and the iteration is carried out until all the results are converged. By the parallelization carried out through mapreduce, the PLSA can be applied into a large amount of data, the entire operation time is reduced, and the calculation efficiency is improved.

Description

technical field [0001] The invention relates to a method for realizing PLSA in parallel, in particular to a Hadoop-based parallel PLSA method. Background technique [0002] PLSA, that is, probability-based shallow semantic analysis, is a statistical method used to analyze two-mode or co-occurrence data. It is based on the original LSA model and adds a suitable probability model. By introducing a latent class model and performing a mixture decomposition, the latent information about the data is obtained. [0003] Hadoop, a distributed system infrastructure developed by the Apache Foundation, adopts the programming model of MapReduce. Users can develop distributed programs without knowing the underlying details of the distribution. Make full use of the power of the cluster for high-speed computing and storage. [0004] PLSA may require a large number of iterations, making the overall running time very long, so it is generally only suitable for smaller data sets. If it is u...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F9/38G06F17/30

Inventor高阳金龑杨育彬商琳

OwnerNANJING UNIV

Parallel PLSA (Probabilistic Latent Semantic Analysis) method based on Hadoop

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology