Supercharge Your Innovation With Domain-Expert AI Agents!

Subtopic discovery method for semi-structure short text set based on mutual-constraint topic model

A topic model and discovery method technology, which is applied in unstructured text data retrieval, text database clustering/classification, character and pattern recognition, etc., can solve problems such as high sparseness and high noise, and achieve high topic consistency and topic Clear and accurate results

Active Publication Date: 2017-12-08
TIANJIN UNIV OF SCI & TECH
View PDF5 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The purpose of the present invention is to overcome the deficiencies of the prior art, and propose a method for discovering subtopics in semi-structured short texts based on a mutual constraint topic model. Semantic means to obtain an effective topic structure in short text collections, which solves the problems of high sparsity and high noise faced by existing semi-structured short text topic semantic modeling techniques

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Subtopic discovery method for semi-structure short text set based on mutual-constraint topic model
  • Subtopic discovery method for semi-structure short text set based on mutual-constraint topic model
  • Subtopic discovery method for semi-structure short text set based on mutual-constraint topic model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0034] The embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.

[0035] The design idea of ​​the present invention is to use the topic constraint relationship between a single short text and the topic tag when learning short text and the latent semantic representation of topic tags, and introduce the generation process of topic tags and short text constraints into the traditional topic model, So as to learn the same latent semantic representation of the two. This semantic space can ensure the semantic consistency of short text and topic tags. After obtaining the semantic representation of the topic tag and the text, the vocabulary of the text where the topic tag is located is used to jointly describe the semantics of the topic tag. A sub-topic under a topic is obtained by clustering topic tags; the sub-topic is represented by a topic tag cluster.

[0036] A method of discovering sub-topics in a semi-structure...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a subtopic discovery method for a semi-structure short text set based on a mutual constraint topic model. The main technical characteristics are as follows: applying data cleaning to a short text set containing topic tags; extracting a short text containing specified seed topic tags for a certain topic based on seed topic tags; generating an input file for the cleaned data; inputting the input file to a mutual constraint topic model for model training; obtaining a semantic vector representation of the topic tags in the set, an average semantic vector representation of the text where the topic tags are and a lexical vector representation of the text where the topic tags are; orderly connecting the three vector representations as a complete semantic representation for one topic tag; and clustering by use of a Kmeans clustering method and outputting the centroid of a class obtained by clustering as a subtopic. According to the invention, the design is rational; mutual constraint potential topic modelling is adopted to solve the problems of high sparsity and great noise existing in the traditional semi-structure short text topic semantic modelling technology.

Description

Technical field [0001] The invention belongs to the field of data mining technology, and in particular is a method for discovering subtopics in a semi-structured short text set based on a mutual constraint topic model. Background technique [0002] The exploration and automatic modeling of the topic structure of short microblog texts has increasingly become a hot research topic. This technology is very important for the acquisition of automatic information and knowledge. However, due to the short length of Weibo short text itself, sparse vocabulary, and irregular writing, etc., serious problems of high sparseness and high noise in the data have been caused, making it difficult for traditional topic models (such as LDA, PLSA) to directly model and get Weibo short texts The topic semantic information in this book. In response to the above problems, the researchers adopted a data expansion method to transform short texts into long texts for modeling. The typical technical solutions...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27G06K9/62
CPCG06F16/35G06F40/30G06F18/23213
Inventor 王嫄星辰杨巨成
Owner TIANJIN UNIV OF SCI & TECH
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More