Subtopic Discovery Method in Semi-structured Short Text Collection Based on Mutually Constrained Topic Model

A topic model and discovery method technology, which is applied in unstructured text data retrieval, text database clustering/classification, character and pattern recognition, etc., can solve problems such as high noise and high sparseness, and achieve high topic consistency and accuracy Good sex, clear theme effect

Active Publication Date: 2020-05-19
TIANJIN UNIV OF SCI & TECH
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The purpose of the present invention is to overcome the deficiencies of the prior art, and propose a method for discovering subtopics in semi-structured short texts based on a mutual constraint topic model. Semantic means to obtain an effective topic structure in short text collections, which solves the problems of high sparsity and high noise faced by existing semi-structured short text topic semantic modeling techniques

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Subtopic Discovery Method in Semi-structured Short Text Collection Based on Mutually Constrained Topic Model
  • Subtopic Discovery Method in Semi-structured Short Text Collection Based on Mutually Constrained Topic Model
  • Subtopic Discovery Method in Semi-structured Short Text Collection Based on Mutually Constrained Topic Model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0034] Embodiments of the present invention will be described in further detail below in conjunction with the accompanying drawings.

[0035] The design idea of ​​the present invention is: when learning the latent semantic representation of short texts and hashtags, using the topic constraint relationship between a single short text and hashtags, introducing a generation process of mutual constraints between hashtags and short texts in the traditional topic model, In this way, a latent semantic representation that is consistent with both is learned. This semantic space can guarantee the semantic consistency of short texts and hashtags. After obtaining the semantic representation of the hashtag and the text, the vocabulary of the text where the hashtag is located is used to jointly describe the semantics of the hashtag. The subtopics under a certain topic are obtained by clustering topic labels; the subtopics are represented by topic label clusters.

[0036] A method for disc...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a method for discovering subtopics in a semi-structured short text collection based on a mutual constraint topic model. Its main technical features are: data cleaning of short text collections containing topic tags; A short text containing a specified seed topic tag; generate an input file from the cleaned data; input the input file into a mutually constrained topic model for model training; obtain the semantic vector representation of the topic tag in the collection and the average semantic vector representation of the text And the lexical vector representation of the text where the topic tag is located; the three vector representations are sequentially connected as a complete semantic representation of a topic tag; the Kmeans clustering method is used for clustering, and the centroid of the clustered category is output as a subtopic. The invention has a reasonable design, adopts mutual constraint latent topic modeling, and solves the problems of high sparseness and high noise faced by the existing semi-structured short text topic semantic modeling technology.

Description

technical field [0001] The invention belongs to the technical field of data mining, in particular to a method for discovering subtopics in a semi-structured short text set based on a mutual constraint topic model. Background technique [0002] The exploration and automatic modeling of the topic structure of microblog short texts has increasingly become a hot research topic, and this technology is very important for the acquisition of automatic information knowledge. However, due to the short length of microblog short text itself, sparse vocabulary, and irregular writing, etc., the serious problems of high sparsity and high noise in the data make it difficult for traditional topic models (such as LDA, PLSA) to directly model microblog short texts. Semantic information on topics in this book. In response to the above problems, researchers use data augmentation methods to convert short texts into long texts for modeling. The typical technical solutions are as follows: gather s...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35G06F40/30G06K9/62
CPCG06F16/35G06F40/30G06F18/23213
Inventor 王嫄星辰杨巨成
Owner TIANJIN UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products