Online short text data stream classification method based on feature extension

A classification method and short text technology, applied in text database clustering/classification, text database query, unstructured text data retrieval, etc., can solve problems that cannot handle continuous data well, text classification technology is difficult to be effective, and models cannot issues such as better performance

Active Publication Date: 2020-04-17
HEFEI UNIV OF TECH
View PDF7 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] One of the challenges: Traditional short text classification due to the high-dimensional sparseness of short texts makes traditional text classification techniques difficult to be effective; the current solution: one is to use external corpora to expand short texts, and then use traditional classification methods for classification , such as Naive Bayes ( Bayes), support vector machine (SVM), decision tree and other classifiers; one is to use its own hidden statistical information to expand short text for short text classification, such as LDA, KNN, etc.
However, the stability of these models is greatly affected by the integrity of the external corpus, resulting in poor portability and stability of the model.
[0004] Challenge 2: Due to the massive and infinite nature of continuous data, traditional multi-iterative deep learning frameworks based on static data sets (such as Text-CNN, RNN, etc.) cannot handle continuous data well, and the model cannot obtain better performance
[0005] Challenge 3: Short text streams have characteristics such as dynamic changes. Due to the limitation of the static framework of the network layer, the current mainstream deep learning framework cannot quickly adapt to changing data streams, resulting in the inability of traditional neural network models to process short texts well. dynamics of book

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Online short text data stream classification method based on feature extension
  • Online short text data stream classification method based on feature extension
  • Online short text data stream classification method based on feature extension

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0075] In this example, if figure 2 As shown, an online short text data stream classification method based on feature expansion is carried out as follows:

[0076] Step 1: Build the Word2vec model based on the external corpus, and obtain the word vector set Vec:

[0077] Step 1.1: According to the sliding window mechanism, the given short text data stream Stream={d 1 , d 2 ,...,d e ,...,d E} is divided into T sets of data blocks according to time, recorded as D={D 1 ,D 2 ,...,D t ,...,D T}, where d e Indicates the e-th short text in the short text data stream Stream; D t Represent the data block at time t in the short text data stream Stream, e=1,2,...,E, t=1,2,...,T;

[0078] Step 1.2: Obtain the text external corpus for the short text data stream Stream from the knowledge base, denoted as C'={d' 1 ,d' 2 ,...,d′ m ,...,d′ M}, m=1,2,...,M, where M represents the text external corpus C 1 The total number of texts, d′ m represents the mth text, and has q=1,2,....

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an online short text data stream classification method based on feature extension, which comprises the following steps of: 1, constructing a Word2vec model according to an external corpus, and obtaining a word vector set Vec; 2, vectorizing the short text data stream by using Vec, and performing text vectorization extension based on a CNN model; 3, constructing an online deep learning network for the extended text vectors; 4, introducing concept drift semaphore into neurons in the LSTM network and detecting the distribution change of the short text stream; and 5, completing the model updating of the online deep learning network and the prediction of the short text data stream. According to the method, the classification accuracy of the short text data streams can beeffectively improved, concept drift is correctly detected, the model is adjusted, and therefore the purpose of rapidly adapting to the short text data stream environment is achieved.

Description

technical field [0001] The invention belongs to the field of practical application to short text data flow mining and online deep learning, and in particular relates to the classification problem of constantly changing, fast and infinite short text data flow. Background technique [0002] With the rise of information technology such as mobile development and micro-service framework, a kind of massive, high-speed and dynamic data-data flow has emerged in practical application areas such as social networking, online shopping, and sensor networks. In the social field, due to the popularity of social network media and forums, very short texts flood into our lives, such as Weibo, tweets, Facebook and other user comments and interactions on forums. Short essays contain a lot of information in various fields such as sports, education, science, etc. Compared with ordinary texts, short texts are sparse, real-time, massive, irregular, and dynamic, which leads to thematic evolution. ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/33G06F16/35G06F40/30G06K9/62G06N3/04G06N3/08
CPCG06F16/3344G06F16/35G06N3/084G06N3/044G06N3/045G06F18/241
Inventor 李培培胡阳胡学钢
Owner HEFEI UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products