Unlock instant, AI-driven research and patent intelligence for your innovation.

Method and device for generating LDA topic model based on bilingual parallel corpora

A topic model and parallel corpus technology, applied in natural language data processing, special data processing applications, instruments, etc., can solve problems such as inaccurate word probability values ​​in topic models

Active Publication Date: 2020-05-08
传神联合(北京)信息技术有限公司
View PDF13 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In order to solve the problem of inaccurate word probability values ​​in the topic model obtained by using the unsupervised training method, the embodiment of the present invention provides a method and device for generating an LDA topic model based on bilingual parallel corpus

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for generating LDA topic model based on bilingual parallel corpora
  • Method and device for generating LDA topic model based on bilingual parallel corpora
  • Method and device for generating LDA topic model based on bilingual parallel corpora

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0048]In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

[0049] Such as figure 1 As shown, the schematic flow chart of the method for generating the LDA topic model based on the bilingual parallel corpus provided by the embodiment of the present invention includes:

[0050] Step 100: Perform LDA topic modeling on the first language document library and the second language document library parallel to th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The embodiment of the invention provides a method and device for generating an LDA topic model based on bilingual parallel corpora, and the method comprises the steps: respectively carrying out the LDA topic modeling of a first language document library and a second language document library corresponding to the first language document library in parallel, and obtaining a first language topic model and a second language topic model; performing word alignment on the first language topic model and the second language topic model to obtain a word alignment relationship; based on the word alignment relationship, performing topic alignment on a first language topic model and a second language topic model to obtain all aligned first language topics and second language topics; and for a pluralityof groups of aligned words under all aligned first language topics and second language topics, adjusting a probability value of each group of aligned words belonging to the respective language topic,and performing normalization processing on the probability values to obtain a new LDA topic model. According to the embodiment of the invention, the precision of the topic model is improved.

Description

technical field [0001] The present invention relates to the technical field of natural language processing, and more specifically, to a method and device for generating an LDA topic model based on bilingual parallel corpus. Background technique [0002] The LDA (Latent Dirichlet Allocation) topic model can give the topic of each document in the document library in the form of a probability distribution, and is a statistical model used to discover abstract topics in the document library. The basic idea is that a document can contain multiple topics, and each topic belongs to the document with a certain probability; each word in the document is generated by one of the topics, and each word belongs to a topic with a certain probability . [0003] The topic distribution generated by the LDA topic model can be regarded as the semantic representation of the document. The vector representation of the document is generated by using the topic vector and the document topic distributi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/34G06F40/289
CPCG06F16/345
Inventor 毛红保
Owner 传神联合(北京)信息技术有限公司