Parallel data processing method based on latent dirichlet allocation model

A technique of implicit Dirichlet and distribution model, applied in the direction of electric digital data processing, special data processing applications, instruments, etc., can solve problems such as high data sparsity, large loss of information, unfavorable text information processing, etc.

Inactive Publication Date: 2009-02-04
INST OF SOFTWARE - CHINESE ACAD OF SCI
View PDF0 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Common text data representation methods can be divided into two main types, one type is only the text data representation method of basic word information, and its representative is the vector space method (Vector Space Model) (reference: Salton, G. The SMART Retrieval System. Englewood Cliffs, Prentice-Hall, 1971.), but this kind of method has the problems of large information loss and high data sparsity, which are not good for text information processing

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Parallel data processing method based on latent dirichlet allocation model
  • Parallel data processing method based on latent dirichlet allocation model
  • Parallel data processing method based on latent dirichlet allocation model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0130] The three parallel data processing schemes designed by the present invention are aimed at the two most widely used high-performance computing environments, namely, the first is a multi-core (including multi-processor) parallel architecture on a single machine and a cluster parallel architecture on multiple machines. Multi-core design and implementation have been widely adopted in various types of computers at present, and the present invention can be directly used on such machines; the use of the present invention for multi-machine clusters such as network topologies such as Figure 6 As shown, it is composed of 2 basic components, namely: a master control node and several computing nodes. Only one main control node is needed, which is mainly responsible for functions such as interface interaction, data distribution, and result summary. There are multiple computing nodes (in principle, there is no limit to the number) and different types of computers can be selected. The com...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a parallel data processing method based on the hidden Dirichlet distribution model, which belongs to the data mining field. The method includes three solutions, including the multi-process parallel processing, the multi-thread parallel processing and the composite multi-process multi-thread processing; the data DM for being processed is divided into data segments in equal or unequal length in the three solutions; each data segment is provided with an index; each computer process/thread processes the corresponding data segment through applying the index, so as to obtain the subject information of each data item and generate the local sufficient statistic; when the whole DM is processed, the global sufficient statistic is obtained through the merge of the local sufficient statistics so that the current Mi model is obtained through the estimation until the model becomes convergence. The parallel data processing method can utilize the multi-kernel parallel frame of a single computer and the cluster large-scale parallel capability of multi-computer to realize the high-speed processing of the large-scale text sets and effectively reduce the memory usage during the parallel processing process.

Description

Technical field [0001] The invention relates to a text data mining method, in particular to an efficient data processing method based on implicit topic text representation, and belongs to the field of computer data mining. Background technique [0002] Computer data mining [0003] Computer data mining refers to the intelligent information processing process that uses computers to obtain effective, useful and understandable information or knowledge from a large amount of data. The early computer data mining mainly focused on the mining of regular numerical data in the database system. With the continuous expansion of the Internet scale and the great enrichment of applications, computer data mining has gradually turned to Internet information processing. The data carried on the Internet is very different from the data in the database system: First, the data on the Internet is mainly text written in natural language, while the data in the database system is mainly numerical; second...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 李文波孙乐
Owner INST OF SOFTWARE - CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products