Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A short text topic extraction method based on word vector enhancement

A technology of short text and word vector, which is applied in the fields of instruments, computing, electrical and digital data processing, etc., to achieve the effect of improving universality

Active Publication Date: 2018-12-25
WUHAN UNIV
View PDF6 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

And design a new topic model that can distinguish the difference in word meaning while using word vectors to enhance topic modeling, so as to overcome the noise problem of polysemous words

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A short text topic extraction method based on word vector enhancement
  • A short text topic extraction method based on word vector enhancement
  • A short text topic extraction method based on word vector enhancement

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036]1. The method proposed by the present invention and the benchmark topic model can verify the high efficiency of the method of the present invention through experimental comparison. The data set used in the experiment of the present invention is the news description of 31,150 English news articles extracted from the RSS of three popular newspaper websites (New York Times nyt.com, USA Today usatoday.com, Reuters reuters.com), because they are typical short text. News categories are: Sports, Business, USA, Health, Technology, World and Entertainment. In order to ensure the accuracy of the experiment, the present invention has done the following preprocessing work:

[0037] 1. The average minimum distance based on word vectors: the present invention uses word vectors to measure the distance between short texts, and proposes an average minimum distance based on word vectors, which can be used as a general short text distance evaluation standard without being affected by shor...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a short text topic extraction method based on word vector enhancement, in particular to a new short text topic extraction model, which is called CRFTM (Condition Random FieldRegulated Topic Model). Firstly, the invention designs a universal short text distance measurement mode--an average minimum distance based on a word vector, which alleviates the sparsity problem by converging short text into pseudo documents. Secondly, CRFTM also uses conditional random field (CRF) regularization model to enhance the semantics of semantically related words, so that they can be assigned to the same topic with higher probability. The experimental results on the news data set show that the short text topic extraction method of the invention is superior to the five benchmark topicmodels in the index of topic coherence.

Description

technical field [0001] The invention belongs to the technical field of short text topic extraction algorithms. This technology is a new short text topic extraction method based on word vector enhancement, which combines the advantages of distributed representation of words and semantic enhancement based on conditional random fields. Background technique [0002] With the rise of social networks, short texts have become the main carrier of information transmission on the Internet. For example, the title of the web page, the main content of Weibo, Zhihu, Facebook and other websites are all presented in the form of short text. The topic model is a probabilistic and statistical model used to discover abstract "topics" in document collections, which can help ordinary users mine valuable information resources from massive short text data through simple topics or keywords. It has very important application significance to reduce the reading burden of users and improve the reading ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
CPCG06F40/216
Inventor 彭敏高望胡刚谢倩倩李冬
Owner WUHAN UNIV
Features
  • Generate Ideas
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More