A short text box clustering method, system, device and storage medium

A clustering method and short text technology, applied in text database clustering/classification, unstructured text data retrieval, instruments, etc., can solve problems such as poor practicability, reduce complexity, improve accuracy, and avoid vector The effect of high dimensionality

Active Publication Date: 2022-05-27
HARBIN INST OF TECH AT WEIHAI
View PDF8 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The accuracy of the clustering results of this method is closely related to the parameter settings, and the practicability is not strong

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A short text box clustering method, system, device and storage medium
  • A short text box clustering method, system, device and storage medium
  • A short text box clustering method, system, device and storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0092] A short text box clustering method, such as figure 2 shown, including steps:

[0093] (1) Data preprocessing is performed on the extracted original short text to obtain the word segmentation of the short text;

[0094] (2) Extract the feature words of each short text;

[0095] (3) Convert the feature words of the short text into feature word vectors;

[0096] (4) Initialize the cluster center first, and then use the locality-sensitive hash algorithm to map the cluster center to the LSH table;

[0097] (5) According to the text similarity between the short text and the cluster center, select several candidate classes; the number of candidate classes is artificially set, generally 3-5, depending on the specific situation, the number of candidate classes will be change;

[0098] (6) Calculate the hash value of each short text feature vector in each candidate class, and find the nearest neighbor of the short text feature vector from the LSH table, and select the cluste...

Embodiment 2

[0102] According to a short text box clustering method provided in Embodiment 1, the difference is:

[0103] In step (1), data preprocessing is performed on the extracted original short text, such as figure 1 shown, specifically:

[0104] 1) Data cleaning: remove spelling mistakes, acronyms, colloquial expressions, irregular grammatical expressions, emoticons, garbled characters, links and useless symbols in the original short text; useless symbols such as "@, #, [] , []";

[0105] Data cleaning is performed on the data set to reduce data noise, achieve format standardization and remove duplicate data.

[0106] 2) Perform text segmentation on the short text after data cleaning: for English text, directly use spaces to segment English text; for Chinese text, use jieba tokenizer to segment Chinese text;

[0107] 3) Carry out stop word processing: By establishing a stop word dictionary, the text segmentation result is matched with the words in the stop word dictionary. If the ...

Embodiment 3

[0150] The realization system of a kind of short text box clustering method that embodiment 1 or 2 provides, as image 3 shown, including:

[0151] The data collection module is used to collect short text data from the social networking website platform, and then store the collected short text data into the database;

[0152] The data preprocessing module is used to preprocess the short text data collected by the data acquisition module to obtain the short text word segmentation result;

[0153] Feature word extraction module, used to extract the feature words of each short text;

[0154] The word vector conversion module is used to convert short text feature words into short text feature vectors;

[0155] The text clustering module is used to perform text clustering on short text feature vectors, store the text clustering results in the database, and display the short text data clustering results on the front-end interface.

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention relates to a short text box clustering method, system, device and storage medium, the method comprising: preprocessing the original short text; extracting short text feature words; converting the short text feature words into short text feature vectors; Initialize the clustering center, and then map the clustering center to the LSH table; then select several candidate classes according to the text similarity between the short text and the clustering center; select the cluster set of short text feature vectors according to the hash value; recalculate The new clustering center of the cluster set of the short text feature vector; the loop is executed until the new clustering center does not change, and the text clustering result is output. In the present invention, the WMD-IP distance is used as the text similarity, and the position of the word vector is considered, so that the semantic information of the word can be more fully utilized, the complexity of the intermediate calculation process is reduced, and the accuracy of the clustering result of the short text frame is improved.

Description

technical field [0001] The invention relates to a short text box clustering method, system, equipment and storage medium, belonging to the field of machine learning and pattern recognition. Background technique [0002] With the increasing popularity of mobile Internet devices and the rapid development and application of online social media platforms, Sina Weibo, Zhihu, WeChat, Douyin, Twitter, Tieba, forums and other social media software are increasingly used in people's daily lives. Increasingly, they attract hundreds of millions of internet users. These Internet users generate massive amounts of text data for dissemination through these softwares every day. The text data has a small number of characters, and its characteristics change over time, carrying a large amount of information. How to deal with these short text data, clustering and analysis of these short text data has important research significance and application value. [0003] At present, the commonly used ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35G06F40/216G06F40/289G06F40/30
CPCG06F16/35G06F40/289G06F40/216G06F40/30
Inventor 王超俊何清刚魏玉良王凯王佰玲
Owner HARBIN INST OF TECH AT WEIHAI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products