Barrage text clustering method based on feature extension and T-oBTM

A text clustering and text technology, applied in text database clustering/classification, unstructured text data retrieval, special data processing applications, etc., can solve the problems of low algorithm efficiency, long model processing time, topic-word pair distribution and Issues such as complex topic distribution

Pending Publication Date: 2020-04-24
HEBEI UNIV OF ENG
View PDF5 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0012] The size of the corpus is huge, the word pairs are directly extracted, and many noisy word pairs are retained, resulting in complex topic-word pair distribution and topic distribution, resulting in long model processing time and low algorithm efficiency

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Barrage text clustering method based on feature extension and T-oBTM
  • Barrage text clustering method based on feature extension and T-oBTM
  • Barrage text clustering method based on feature extension and T-oBTM

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0047] The present invention proposes a barrage text clustering method based on feature expansion and T-oBTM, which includes three steps of network neologism processing stage, topic modeling stage, and text clustering stage, and its specific method is:

[0048] The first stage is network neologism processing, which includes text preprocessing. In the stage of network neologism processing, a new word recognition algorithm based on weight-optimized mutual information and left and right information entropy is used to find out network neologisms in the barrage text, and the network The new words are updated to the word segmentation lexicon, and the external knowledge base is used to obtain the relevant content of the network new words, and the characteristic words related to the network new words are obtained through analysis, and the corpus is obtained by using the characteristic words to expand the text features; the specific method of the network new word processing stage To: Us...

Embodiment 2

[0052] The following side documents are analyzed as a case: (only part of the text is shown)

[0053]

[0054] 1. Obtain one or more bullet chat texts of the video data, and then display the bullet chat data set;

[0055] 2. Use the new word recognition algorithm based on weight-optimized mutual information and left-right information entropy to find out the top8 new words in the barrage text set, and update the word segmentation lexicon;

[0056] 1. String mutual information score data display:

[0057] Format: 'second-order co-occurrence words': (mutual information calculation results, word frequency)

[0058]

[0059]

[0060] 2. The information entropy score of left and right strings:

[0061] Format: 'second-order co-occurrence words': left (right) information entropy

[0062]

[0063] 3. Word string word score: display the top 8 word strings. The observation results show that the higher the score, the greater the probability that the word string is a more c...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a barrage text clustering method based on feature extension and T-oBTM. The method comprises three steps of a network new word processing stage, a theme modeling stage and a text clustering stage. The invention provides an oBTM streaming short text clustering method (T-oBTM) for carrying out threshold constraint on word pairs according to bullet screen characteristics, the algorithm execution time is shortened, network new words are recognized and processed, the purpose of expanding text characteristics is achieved, and then the algorithm precision is improved. Accordingto the method, the network new words are recognized and processed, the word segmentation lexicon is enriched, and the word segmentation precision is improved; when the network new words are processed, the recognized entity nouns and sentiments, viewpoints and opinion words are processed differently, short text features are expanded, and clustering precision is improved.

Description

technical field [0001] The invention relates to the technical field of multimedia processing, in particular to a method for clustering barrage text based on feature expansion and T-oBTM. Background technique [0002] Bullet chat refers to the comments that can be sent to the screen when the video is playing, which can instantly express the user's views and emotions. Therefore, the research on the hidden information in the bullet chat is of great value, which is helpful for discovering video user topics and other work. Compared with other types of comments, the bullet chat text is too short, contains too many new words on the Internet, has strong immediacy, and changes rapidly, and belongs to streaming short text. Due to the above characteristics, the research on barrage text has the difficulty of less semantic information and high-dimensional sparsity. [0003] Bullet screens are sent by users in real time, and the content is mostly subjective emotion, so the research on bu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/35H04N21/235H04N21/435
CPCG06F16/355H04N21/235H04N21/435
Inventor 吴迪黄竹韵生龙张梦甜杨瑞欣孙雷
Owner HEBEI UNIV OF ENG
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products