Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Microblog topic clustering method based on word vector and singe-pass fusion

A clustering method and word vector technology, applied in the field of microblog topic clustering, can solve the problems of large dimensionality, high computing overhead, and multiple data dimensions, and achieve the effect of improving the effect and reducing the efficiency

Pending Publication Date: 2020-09-22
深兰人工智能应用研究院(山东)有限公司
View PDF1 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

But sometimes we don't know how many clusters should be clustered, but we hope that the algorithm can give a reasonable number of clusters, and it is often difficult to pre-estimate and give the k value at the beginning
[0011] 2. Random k center points affect the result
The disadvantage of the above technology is that the traditional Single-pass incremental clustering algorithm often relies on "term frequency" (Term Frequency, abbreviated as TF) statistics, and introduces the inverse article frequency (IDF) to calculate the feature weight (TF-IDF). Space vector representation, but its dimension is large, and the calculation cost is high; the semantic ambiguity of natural language cannot be distinguished, and the semantic information between text word orders is ignored; at the same time, the influence of context is ignored, which will affect the results of information retrieval. Recall and Precision
[0016] The traditional single-pass clustering algorithm calculates semantic similarity based on the space vector of feature words, which is easy to cause problems such as too many data dimensions and lack of context semantics. This paper proposes to introduce Wikipedia's word embedding word2vec and single-pass algorithm for clustering algorithm

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Microblog topic clustering method based on word vector and singe-pass fusion
  • Microblog topic clustering method based on word vector and singe-pass fusion
  • Microblog topic clustering method based on word vector and singe-pass fusion

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0049] The LDA topic model is more sensitive to topic discovery when targeting longer texts such as news reports, and the topic discovery effect is better, but for short microblog texts, because the number of words is short, it contains more irrelevant information such as noise, and the number of feature words Therefore, this paper improves the single-pass algorithm, uses the improved single-pass algorithm to cluster the topic clusters of the microblog texts, and finally uses the LDA topic model to discover the topics of the same cluster of texts . The implementation process is as follows: preprocessing the acquired microblog data and constructing a vocabulary database; performing Word2vec word vector mapping on feature words; clustering microblog texts using single-pass fused with Word2vec word vectors; using LDA topic model to Clustering into topic discovery.

[0050] 1. Filter noise data

[0051] Data noise mainly includes advertisements, emoticons, special characters, pi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a microblog topic clustering method based on word vector and single-pass fusion. The method comprises the steps of: preprocessing obtained microblog data, and constructing a word list library; carrying out Word2vec word vector mapping on the feature words; clustering microblog texts by adopting a single-pass fused with Word2vec word vectors; and performing topic discovery on the clustering cluster by using an LDA topic model. According to the method, text deep semantic information can be effectively mined on the basis of a single-pass incremental clustering algorithm fused with Word2vec word vectors, it can be avoided that the VSM dimension is too high to influence the computer processing speed, and meanwhile the problem that distribution of feature words among classes and distribution of the feature words in documents inside the classes are ignored in a traditional TF-IDF statistical method is effectively solved.

Description

technical field [0001] The invention belongs to the field of natural language processing, and in particular relates to a microblog topic clustering method based on word vector and single-pass fusion. Background technique [0002] With the rapid development of network technology and the comprehensive popularization of mobile Internet, traditional news media represented by newspapers, TV, and magazines can no longer meet the needs of audiences for information. More and more netizens pay attention to it. As an emerging platform of new electronic media, Weibo is favored by more and more users because of its unique flexibility and convenience. With the continuous growth of the number of users, the amount of microblog data is also increasing day by day. At the same time, due to the lack of supervision of microblog and emerging electronic media, sensitive information such as false news, violence, reaction and terrorism are spread wantonly on the Internet, which is harmful to the h...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35G06F40/284G06K9/62
CPCG06F16/35G06F40/284G06F18/2132G06F18/25
Inventor 陈海波
Owner 深兰人工智能应用研究院(山东)有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products