Parallel PLSA (Probabilistic Latent Semantic Analysis) method based on Hadoop

A functional and overall technology, applied in the field of parallel implementation of PLSA, can solve the problems of slow running of PLSA, reduce the overall running time and improve the efficiency of operation

Inactive Publication Date: 2012-11-14
NANJING UNIV +1
View PDF2 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Purpose of the invention: for the problems and deficiencies in the above-mentioned prior art, the purpose of the invention is

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Parallel PLSA (Probabilistic Latent Semantic Analysis) method based on Hadoop
  • Parallel PLSA (Probabilistic Latent Semantic Analysis) method based on Hadoop
  • Parallel PLSA (Probabilistic Latent Semantic Analysis) method based on Hadoop

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0018] Below in conjunction with accompanying drawing and specific embodiment, further illustrate the present invention, should be understood that these embodiments are only for illustrating the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various aspects of the present invention Modifications in equivalent forms all fall within the scope defined by the appended claims of this application.

[0019] The present invention mainly realizes the parallelization of the algorithm through mapreduce, and its process is as follows figure 1 shown. and, if figure 2 As shown, for each probability result, the result obtained in the previous iteration can be used as the input to the next iteration, so as to realize the continuous update of the result until convergence.

[0020] Method 1 of the present invention is described in detail below:

[0021] The main idea of ​​t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a parallel PLSA (Probabilistic Latent Semantic Analysis) method based on Hadoop. The method comprises the following steps: data is stored in a distributed data storage environment; a probabilistic model file needing to be subjected to calculation updating is segmented to be used as the input of each mapper; each iteration updating process in the whole EM (expectation-maximization) process is calculated through mapreduce, and each iteration updating process in the EM process is calculated through the sending of a map function at a mapper end, a reduce function at a reducer end and a key value pair; the iteration result updated each time is used as the input of the next iteration; and the iteration is carried out until all the results are converged. By the parallelization carried out through mapreduce, the PLSA can be applied into a large amount of data, the entire operation time is reduced, and the calculation efficiency is improved.

Description

technical field [0001] The invention relates to a method for realizing PLSA in parallel, in particular to a Hadoop-based parallel PLSA method. Background technique [0002] PLSA, that is, probability-based shallow semantic analysis, is a statistical method used to analyze two-mode or co-occurrence data. It is based on the original LSA model and adds a suitable probability model. By introducing a latent class model and performing a mixture decomposition, the latent information about the data is obtained. [0003] Hadoop, a distributed system infrastructure developed by the Apache Foundation, adopts the programming model of MapReduce. Users can develop distributed programs without knowing the underlying details of the distribution. Make full use of the power of the cluster for high-speed computing and storage. [0004] PLSA may require a large number of iterations, making the overall running time very long, so it is generally only suitable for smaller data sets. If it is u...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F9/38G06F17/30
Inventor 高阳金龑杨育彬商琳
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products