Unlock instant, AI-driven research and patent intelligence for your innovation.

Method for determining optimal topic number of LDA topic model based on vocabulary similarity

A topic model and determination method technology, which is applied in the fields of digital data processing, character and pattern recognition, special data processing applications, etc. Model clustering effect, effect of solving selection problem

Active Publication Date: 2019-10-18
WUHAN UNIV
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

At present, it is generally believed that the biggest problem of the LDA topic model sampled by Gibbs is that the optimal number of topics cannot be determined. In most cases, the number of topics is artificially set through experience. The number of topics is very important to the iterative process and results. Less will have a great impact on the model, resulting in accuracy errors in the final document distribution

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for determining optimal topic number of LDA topic model based on vocabulary similarity
  • Method for determining optimal topic number of LDA topic model based on vocabulary similarity
  • Method for determining optimal topic number of LDA topic model based on vocabulary similarity

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023] In order to facilitate those of ordinary skill in the art to understand and implement the present invention, the present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the implementation examples described here are only used to illustrate and explain the present invention, and are not intended to limit this invention.

[0024] please see figure 1 , a kind of LDA subject model optimal subject number determination method based on lexical similarity provided by the present invention, comprises the following steps:

[0025] Step 1: Select the initial k value as the initial topic number of the LDA topic model;

[0026] Step 2: Carry out document topic separation, sample topics until convergence;

[0027] In this embodiment, firstly, the text data to be analyzed is preprocessed, word-segmented and stop words are removed. Then apply the LDA model, according to the Gibbs sampling...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for determining optimal topic number of an LDA topic model based on vocabulary similarity, which comprises the steps of extracting topic words by utilizing an LDA model, and searching an optimal topic number based on similarity between word vectors. The method comprises the following steps: firstly, carrying out word segmentation and other preprocessing on text data, and carrying out topic modeling on a text by applying an LDA topic model to obtain corresponding word distribution under each topic; converting word distribution into word vector distribution, andanalyzing topic quality and determining an optimal topic number by utilizing similarity among vectors and based on LDA semantic association. According to the method provided by the invention, the optimal topic number can be automatically determined, the limitation of manual setting is avoided, and the method better serves the clustering analysis of the microblog text data.

Description

technical field [0001] The invention belongs to the technical field of natural language processing, and relates to a natural language processing model, in particular to a method for determining the optimal topic number of an LDA topic model based on lexical similarity. Background technique [0002] With the rapid development of the Internet, Weibo, as an open platform for user communication and information dissemination, is becoming more and more popular. Mining user interests and preferences and analyzing user preference behavior characteristics play a very important role in public opinion monitoring, network security management, and commercial value promotion. However, each user browses thousands of microblogs every day, and the massive amount of microblog information increases the difficulty for users to obtain the information they need, which affects user experience. Accurately obtaining user preferences is the key to proactively pushing content of interest to users on ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35G06F17/27G06K9/62
CPCG06F16/355G06F40/284G06F40/289G06F18/22
Inventor 王中元许强胡瑞敏朱荣
Owner WUHAN UNIV