Chinese microblog topic detection method and system based on semanteme, time and social relation

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technology for social relations and microblog topics, applied in the field of natural language processing and information retrieval, it can solve the problems of short text, polysemy topic detection results, dimension disaster, etc., to speed up microblog search and shorten microblog search time , the effect of improving user experience

Pending Publication Date: 2019-11-22

BEIJING UNIV OF POSTS & TELECOMM

View PDF1 Cites 6 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0006] The technical problem to be solved by the present invention is how to solve the problems of dimension disaster, sparse features, polysemy, etc. and inaccurate topic detection results caused by the messy and short text content of Weibo.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0034] This embodiment provides a Chinese microblog topic detection method based on semantics, time and social relations, which is used for the identification and acquisition of Chinese microblog topics, such as figure 1 As shown, the method includes the following steps:

[0035] S1: Preprocessing of Weibo data: Remove invalid information, useless characters and stop words in the text of the existing Weibo data set, and construct the input of the pre-trained language model BERT (Bidirectional Encoder Representation from Transformers), that is, the Weibo data Preprocess into a text font

[0036] The microblog data is stored in the MySQL database, and the text of the microblog is used as subsequent input, and each microblog is processed as a string. Use the tool to separate the words in each microblog with spaces, and store each microblog in a list; remove the stop words from the microblogs separated by spaces, and judge each microblog in turn after reading the stop word list ...

Embodiment 2

[0060] The present invention proposes a Chinese microblog topic detection system based on semantics, time and social relations, which includes three modules such as figure 2 Shown:

[0061] Data preprocessing module: remove invalid information, useless characters and stop words in the text of the existing microblog dataset, and construct the input of the pre-trained language model BERT (Bidirectional Encoder Representation from Transformers), that is, preprocess the microblog data into text word set.

[0062] Text representation learning module: This invention proposes to use a powerful pre-training model to learn the semantic representation of Chinese microblog short texts. Use the preprocessed microblog text word set to pre-train the BERT model, and through the BERT model based on the MLM (Masked Language Model) training mechanism, the microblog text vector representation with rich semantic information can be obtained.

[0063] Topic detection module: use the proposed tex...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention provides a Chinese microblog topic detection method and system based on semantics, time and social relations, which is used for solving the problem that in topic detection, microblog data is poor in topic detection effect due to the defects of short text, spoken language, polysemy and the like. The method comprises the steps of collecting the microblog data of related topics at a certain time interval, performing pre-training on acquired microblog data by using a pre-training language model BERT (Binary Encoder Transformers), and performing pre-training on the acquired microblogdata by using the pre-training language model BERT to obtain pre-trained microblog data; conducting vectorization representation on the microblog text through a pre-trained BERT model, and acquiring microblog semantic representation based on context semantics; proposing a text clustering algorithm comprehensively considering a time factor and a forwarding relationship between microblogs so that the problem that the traditional microblog topic detection only considers text semantic similarity is solved. The invention is mainly used for microblog search tasks, and the topic detection results ofrelated microblogs are used for improving the microblog search hit rate.

Description

technical field [0001] The invention belongs to the field of natural language processing and information retrieval, relates to topic detection and tracking technology, and is mainly aimed at topic detection of Chinese microblog data. [0002] technical background [0003] In recent years, due to the widespread popularization and rapid development of network technology, the speed of information dissemination on the network and the scale of the amount of information in the network are unprecedentedly huge. As a new social network media, Weibo has gradually become an important source of information for people. Because the content of microblog is very short, and microblog information can be released on various terminals, a large amount of microblog data will be generated in a short period of time on the microblog platform. If we only manually process these huge and disorganized information content on Weibo, it will not only greatly increase the workload, but also it is difficult...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F16/35

CPCG06F16/353G06F16/355

Inventor 杜军平薛哲程鹏超寇菲菲

Owner BEIJING UNIV OF POSTS & TELECOMM

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Chinese microblog topic detection method and system based on semanteme, time and social relation

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology