Parallel data processing method based on latent dirichlet allocation model

A technique of implicit Dirichlet and distribution model, which is applied in electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as high data sparsity, large loss of information, and unfavorable text information processing.

Inactive Publication Date: 2010-06-16
INST OF SOFTWARE - CHINESE ACAD OF SCI
View PDF3 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] Common text data representation methods can be divided into two main types, one type is only the text data representation method of basic word information, and its representative is the vector space method (Vector Space Model) (reference: Salton, G. The SMART Retrieval System. Englewood Cliffs, Prentice-Hall, 1971.), but this kind of method has the problems of large information loss and high data sparsity, which are not good for text information processing

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Parallel data processing method based on latent dirichlet allocation model
  • Parallel data processing method based on latent dirichlet allocation model
  • Parallel data processing method based on latent dirichlet allocation model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0129] The three parallel data processing schemes designed by the present invention are aimed at the two most widely used high-performance computing environments, that is, the first one is a multi-core (including multi-processor) parallel architecture on a single machine and a cluster parallel architecture on multiple machines. Multi-core design and implementation have been generally adopted in current various types of computers, and the present invention can be directly used on this type of machine; Figure 6 As shown, it consists of two basic components: a master control node and several computing nodes. There is only one master control node, which is mainly responsible for interface interaction, data distribution, result summary and other functions. There are multiple computing nodes (in principle, there is no limit on the number) and different types of computers can be selected. The computing nodes undertake the main computing workload of the solving task. The master cont...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a parallel data processing method based on the hidden Dirichlet distribution model, which belongs to the data mining field. The method includes three solutions, including the multi-process parallel processing, the multi-thread parallel processing and the composite multi-process multi-thread processing; the data DM for being processed is divided into data segments in equal or unequal length in the three solutions; each data segment is provided with an index; each computer process / thread processes the corresponding data segment through applying the index, so as to obtainthe subject information of each data item and generate the local sufficient statistic; when the whole DM is processed, the global sufficient statistic is obtained through the merge of the local sufficient statistics so that the current Mi model is obtained through the estimation until the model becomes convergence. The parallel data processing method can utilize the multi-kernel parallel frame ofa single computer and the cluster large-scale parallel capability of multi-computer to realize the high-speed processing of the large-scale text sets and effectively reduce the memory usage during theparallel processing process.

Description

technical field [0001] The invention relates to a text data mining method, in particular to a high-efficiency data processing method based on hidden subject text representation, and belongs to the field of computer data mining. Background technique [0002] computer data mining [0003] Computer data mining refers to the intelligent information processing process that uses computers to obtain effective, useful, and understandable information or knowledge from large amounts of data. Early computer data mining mainly focused on mining regular numerical data in database systems. With the continuous expansion of the Internet scale and the great variety of applications, computer data mining has gradually turned to Internet information processing. The data carried by the Internet is very different from the data in the database system: First, the data on the Internet is mainly text written in natural language, while the data in the database system is mainly numerical; It mainly s...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 李文波孙乐
Owner INST OF SOFTWARE - CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products