Generation method and device of topic model and acquisition method and device of topic distribution

A topic model and topic distribution technology, applied in the computer field, can solve the problems of low accuracy and stability of topic distribution

Active Publication Date: 2015-04-22
BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
View PDF3 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In view of this, the embodiment of the present invention provides a method and device for generating a topic model, and a method and device for obtaining a topic distribution, so as to solve the problem of obtaining the accuracy and stability of the topic distribution of a text using a traditional topic model in the prior art. lower sex problems

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Generation method and device of topic model and acquisition method and device of topic distribution
  • Generation method and device of topic model and acquisition method and device of topic distribution
  • Generation method and device of topic model and acquisition method and device of topic distribution

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0077] The embodiment of the present invention provides a method for generating a topic model, please refer to figure 1 , which is a schematic flowchart of Embodiment 1 of the method for generating a topic model provided by the embodiment of the present invention. As shown in the figure, the method includes the following steps:

[0078] S101. Obtain a first posterior probability parameter of a word pair in a training sample.

[0079] Specifically, the prior probability parameter of the Dirichlet distribution of the word pair in the training sample is obtained; according to the sum of the random number and the prior probability parameter of the Dirichlet distribution, the Dirichlet distribution of the word pair in the training sample is obtained The first posterior probability parameter of the distribution is used as the first posterior probability parameter of the word pairs in the training sample.

[0080] Or, according to the number of occurrences of the word pairs in the t...

Embodiment 2

[0093] Based on the first embodiment above, the embodiment of the present invention specifically describes the methods of S101 to S104 in the first embodiment. Please refer to figure 2 , which is a schematic flowchart of Embodiment 2 of the method for generating a topic model provided by an embodiment of the present invention. As shown in the figure, the method includes the following steps:

[0094]S201. Obtain word pairs according to the text set.

[0095] Preferably, the short texts in the training samples can be traversed, and word segmentation is performed on the traversed short texts, so as to obtain a set of lemmas corresponding to each short text. A word pair is determined based on any two different entries in the entry set corresponding to each short text, so the word pair refers to a combination of any two entries in the same short text.

[0096] Wherein, if the word pair contains punctuation marks, numbers or stop words, the word pair is removed.

[0097] Prefera...

Embodiment 3

[0152] Based on the first and second embodiments above, this embodiment of the present invention provides a method for obtaining topic distribution, please refer to image 3 , which is a schematic flowchart of a method for obtaining topic distribution provided by an embodiment of the present invention. As shown in the figure, the method includes the following steps:

[0153] S301. Acquire text to be processed.

[0154] S302. Obtain at least one word pair according to the text to be processed;

[0155] S303. Using a pre-generated topic model, obtain an expectation of a topic distribution of each word pair; wherein, the topic model is generated by using the above-mentioned method for generating a topic model.

[0156] S304. Obtain the topic distribution of the text to be processed according to the expectation of the topic distribution of each word pair.

[0157] Preferably, the text to be processed may include, but not limited to, query text, comment information, microblogs, e...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a generation method and a generation device of a topic model and an acquisition method and an acquisition device of topic distribution. On the one hand, a first posterior probability parameter of word pairs in a training sample is obtained, and therefore a candidate expectation of the topic distribution of the word pairs in the training sample is obtained according to the first posterior probability parameter, and on the other hand, a convergence degree of the topic model is obtained according to the candidate expectation of the topic distribution of the word pairs in the training sample, and then if the convergence degree of the topic model meets a terminal condition, the candidate expectation of the topic distribution of the word pairs in the training sample is used as a target expectation of the topic distribution. Each word pair comprises two different word entries in the training sample. Accordingly, the generation method and the generation device of the topic model and the acquisition method and the acquisition device of the topic distribution are used to solve the problems that a method of using a traditional topic model to obtain topic distribution of a text is low in accuracy and stability in the prior art.

Description

【Technical field】 [0001] The present invention relates to the field of computer technology, in particular to a method and device for generating a topic model, and a method and device for obtaining topic distribution. 【Background technique】 [0002] In the field of machine learning and natural language processing, it is often necessary to mine the potential semantic relationship between words in the text domain, that is, the topic, from a large amount of text. Through the learning and prediction of the topic model, the topic distribution of the text can be obtained, which can be used to implement text clustering and be applied to subsequent tasks such as classification, retrieval, expansion, and recommendation. [0003] In the prior art, traditional topic models, such as Probability Latent Semantic Analysis (PLSA) algorithm, Non-negative Matrix Factorization (NMF) algorithm, Latent Dirichlet Allocation (Latent Dirichlet Allocation , LDA) algorithm adopts the concept of bag o...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/24532G06F16/27
Inventor 石磊蒋佳军
Owner BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products