Time window based LDA microblog topic trend detection method and apparatus

A time window and detection method technology, applied in special data processing applications, instruments, electrical digital data processing and other directions, can solve the problems of information dispersion, lack of hot events, unfavorable hot topic modeling and analysis, etc., to achieve great practical characteristics, The effect of topic accuracy improvement

Inactive Publication Date: 2016-02-17
TIANJIN UNIV
View PDF6 Cites 42 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the number of topic results obtained at this time is still large, and the information is relatively scattered, which is not conducive to the modeling and analysis of hot topics, and lacks a powerful indicator to express the development trend information of hot events, which cannot be used by researchers to analyze hot events based on time evolution law of

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Time window based LDA microblog topic trend detection method and apparatus
  • Time window based LDA microblog topic trend detection method and apparatus
  • Time window based LDA microblog topic trend detection method and apparatus

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0052] A time-window-based LDA microblog topic trend detection method, see figure 1 , the detection method includes the following steps:

[0053] 101: Obtaining microblog data sets through web crawlers;

[0054] For example: build a crawler program for Sina Weibo, crawl Weibo content for a certain period of time on Sina Weibo, and retain information such as publishing time, publishing author, title, and text content. This step is well known to those skilled in the art, and will not be described in detail in this embodiment of the present invention.

[0055] 102: Carry out preprocessing such as text segmentation and removal of stop words on the microblog data set to obtain a word set;

[0056] This step is specifically: use the existing Chinese lexical analysis system to segment the obtained microblog data set; then use the "HIT stop word list" to remove stop words and filter, and only keep the nouns and verbs in the word segmentation results . The embodiment of the present i...

Embodiment 2

[0067] The following is combined with specific calculation formulas, examples, and attached figure 1 The scheme in Example 1 is described in detail, see the following description for details:

[0068] 201: Construct a crawler program for Sina Weibo, crawl Weibo content for a certain period of time on Sina Weibo, and retain information such as publishing time, publishing author, title, text content, etc.;

[0069] 202: Use the Chinese Lexical Analysis System ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System) developed by the Institute of Computing Technology, Chinese Academy of Sciences, use the provided API interface ICTCLAS5.0 for word segmentation processing, and use some special words, such as: emotional words, network words, etc. Add it to the tokenizer as a user dictionary to get a better word segmentation effect.

[0070] 203: Perform stop word screening on word segmentation results;

[0071] That is, remove words with no real meaning and high...

Embodiment 3

[0097] Attached below figure 2 and 3 , concrete example, carry out feasibility verification to the scheme in embodiment 1 and 2, see the following description for details:

[0098] Using web crawlers to collect Weibo content published on Sina Weibo from September to October 2011, a total of 25,495 pieces, retaining information such as publication time, publication author, title, text content, etc., and performing preprocessing such as Chinese word segmentation and stop words; Afterwards, the global time is divided into 4 time windows, as shown in Table 1, a total of 150 topics are extracted using the LDA topic model in each time window; after the similarity calculation of the topic results, K-means clustering is performed, and If the number of clusters is set to 2, the clustering result will be 2 hot topics. After returning to the document data, it is determined that topic 1 is the "child trafficking" incident, and topic 2 is the "Tiangong-1" incident. For example figure 2...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a time window based LDA microblog topic trend detection method and apparatus. The method comprises: extracting a topic word from a word set by utilizing an LDA model in each time window, and obtaining global topics; performing similarity computing on the global topics, and performing K-means clustering to obtain hot topics conforming to public opinion analysis; extracting feature words of each hot topic in each time window in sequence in combination with the hot topic through the LDA topic model; and in combination with results of the feature words, computing a popular value of the hot topic in each time window, and drawing a trend graph of the hot topic. The apparatus comprises a first acquisition module, a second acquisition module, an extraction module and a drawing module. According to the detection method and apparatus, the precision of microblog topic detection is improved, so that a trend index is more expressive, and a more accurate basis is provided for analyzing a hot topic trend.

Description

technical field [0001] The invention belongs to the fields of data mining, natural language processing and information retrieval, and specifically relates to short text processing, topic detection and tracking, and related fields of network public opinion analysis, and in particular to a time window-based LDA microblog topic trend detection method and device . Background technique [0002] Topic Detection and Tracking (TDT) technology was initiated by the U.S. Defense Advanced Research Projects Agency (DARPA) and the National Institute of Standards and Technology (NIST) to develop a series of time-based information organization technologies to help people deal with information overload. question. The research and start of TDT in foreign countries is relatively early, and first-class universities such as CMU and Cambridge, as well as IBM, have achieved good results in TDT evaluation. Successively, TDT topic detection technology has been applied to practice; an event detecti...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F17/30
Inventor 侯德俊尚鸿运喻梅缑小路胡悦高玥
Owner TIANJIN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products