Short-text data stream classification method based on short-text expansion and concept drift detection

A technology of concept drift and classification method, which is applied in the field of classification of short text data streams, and can solve problems such as concept drift of short text data streams and difficulty in obtaining classification results

Active Publication Date: 2018-02-09
HEFEI UNIV OF TECH
View PDF6 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Due to the phenomenon of concept drift often occurring in short text data streams, existing data stream

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Short-text data stream classification method based on short-text expansion and concept drift detection
  • Short-text data stream classification method based on short-text expansion and concept drift detection
  • Short-text data stream classification method based on short-text expansion and concept drift detection

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0069] In this example, if figure 1 As shown, a short text data stream classification method based on topic model and concept drift detection is carried out as follows:

[0070] Step 1: Extract keywords according to the class label distribution of the short text data stream, and obtain the external corpus C' from the knowledge base Wikipedia, and then construct the LDA topic model M according to the external corpus C':

[0071] Step 1.1: Given a set of short text data streams D={d 1 , d 2 ,...,d m ,...,d |D|}, m=1, 2, ..., |D|, |D| represents the total number of short texts in the short text data stream D, d m Indicates the mth short text and has d m ={W m ,y m}, W m with y m Respectively represent the mth short text d in the short text data stream D m The set of words and class labels, and satisfy y m ∈Y, Y represents a set of class labels, denoted as Y={y 1 ,y 2 ,...,y x ,...,y X}, x = 1, 2, ..., x, y x Indicates the xth class label of the class label set Y, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a short-text data stream classification method based on topic models and concept drift detection. The method includes: 1, acquiring an external corpus from a knowledge libraryto construct the LDA topic model; 2, dividing a short-text data stream into data blocks according to a sliding window mechanism, and using the LDA topic model to expand short text in the data blocks to obtain an expanded data stream; 3, constructing the online BTM topic model for each data block in the expanded short-text data stream, and obtaining a topic representation of each piece of short text; 4, selecting data blocks of Q topic representations to construct a classifier to use the same to predict a class label of a newly arrived data block; 5, dividing the data blocks of the Q topic representations into category clusters according to class label distribution, and calculating semantic distances between the category clusters and the newly arrived data block to judge whether concept drift occurs; and 6, updating the classifier according to a concept drift situation. The method can be used for the short-text data stream classification problem of unceasingly changed class label distribution.

Description

technical field [0001] The invention belongs to the field of mining text data streams in practical applications, and in particular relates to the classification problem of ever-changing short text data streams. Background technique [0002] With the rapid development of instant messaging and Internet technology, network users and network servers generate a large number of short text data streams, including Sina Weibo, online comments and instant messages. These short text data contain rich value for scientific research institutes, government departments and Internet service providers. The short text data stream has the following three characteristics: 1. Each short text is short in length and does not have enough information, resulting in serious data sparsity; 2. The huge amount of data generated in a short period of time is likely to cause serious dimension disasters; 3. , The potential drift of text topics over time. Based on these three characteristics, traditional sho...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 胡学钢王海燕李培培
Owner HEFEI UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products