Chinese microblog topic detection method and system based on semanteme, time and social relation

A technology for social relations and microblog topics, applied in the field of natural language processing and information retrieval, it can solve the problems of short text, polysemy topic detection results, dimension disaster, etc., to speed up microblog search and shorten microblog search time , the effect of improving user experience

Pending Publication Date: 2019-11-22
BEIJING UNIV OF POSTS & TELECOMM
View PDF1 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The technical problem to be solved by the present invention is how to solve the problems of dimension disaster, sparse features, polysemy, etc. and inaccurate topic detection results caused by the messy and short text content of Weibo.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese microblog topic detection method and system based on semanteme, time and social relation
  • Chinese microblog topic detection method and system based on semanteme, time and social relation
  • Chinese microblog topic detection method and system based on semanteme, time and social relation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0034] This embodiment provides a Chinese microblog topic detection method based on semantics, time and social relations, which is used for the identification and acquisition of Chinese microblog topics, such as figure 1 As shown, the method includes the following steps:

[0035] S1: Preprocessing of Weibo data: Remove invalid information, useless characters and stop words in the text of the existing Weibo data set, and construct the input of the pre-trained language model BERT (Bidirectional Encoder Representation from Transformers), that is, the Weibo data Preprocess into a text font

[0036] The microblog data is stored in the MySQL database, and the text of the microblog is used as subsequent input, and each microblog is processed as a string. Use the tool to separate the words in each microblog with spaces, and store each microblog in a list; remove the stop words from the microblogs separated by spaces, and judge each microblog in turn after reading the stop word list ...

Embodiment 2

[0060] The present invention proposes a Chinese microblog topic detection system based on semantics, time and social relations, which includes three modules such as figure 2 Shown:

[0061] Data preprocessing module: remove invalid information, useless characters and stop words in the text of the existing microblog dataset, and construct the input of the pre-trained language model BERT (Bidirectional Encoder Representation from Transformers), that is, preprocess the microblog data into text word set.

[0062] Text representation learning module: This invention proposes to use a powerful pre-training model to learn the semantic representation of Chinese microblog short texts. Use the preprocessed microblog text word set to pre-train the BERT model, and through the BERT model based on the MLM (Masked Language Model) training mechanism, the microblog text vector representation with rich semantic information can be obtained.

[0063] Topic detection module: use the proposed tex...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a Chinese microblog topic detection method and system based on semantics, time and social relations, which is used for solving the problem that in topic detection, microblog data is poor in topic detection effect due to the defects of short text, spoken language, polysemy and the like. The method comprises the steps of collecting the microblog data of related topics at a certain time interval, performing pre-training on acquired microblog data by using a pre-training language model BERT (Binary Encoder Transformers), and performing pre-training on the acquired microblogdata by using the pre-training language model BERT to obtain pre-trained microblog data; conducting vectorization representation on the microblog text through a pre-trained BERT model, and acquiring microblog semantic representation based on context semantics; proposing a text clustering algorithm comprehensively considering a time factor and a forwarding relationship between microblogs so that the problem that the traditional microblog topic detection only considers text semantic similarity is solved. The invention is mainly used for microblog search tasks, and the topic detection results ofrelated microblogs are used for improving the microblog search hit rate.

Description

technical field [0001] The invention belongs to the field of natural language processing and information retrieval, relates to topic detection and tracking technology, and is mainly aimed at topic detection of Chinese microblog data. [0002] technical background [0003] In recent years, due to the widespread popularization and rapid development of network technology, the speed of information dissemination on the network and the scale of the amount of information in the network are unprecedentedly huge. As a new social network media, Weibo has gradually become an important source of information for people. Because the content of microblog is very short, and microblog information can be released on various terminals, a large amount of microblog data will be generated in a short period of time on the microblog platform. If we only manually process these huge and disorganized information content on Weibo, it will not only greatly increase the workload, but also it is difficult...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35
CPCG06F16/353G06F16/355
Inventor 杜军平薛哲程鹏超寇菲菲
Owner BEIJING UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products