Unlock instant, AI-driven research and patent intelligence for your innovation.

A Short Text Clustering Method Based on Deep Semantic Path Search

A path search and clustering method technology, applied in text database clustering/classification, unstructured text data retrieval, special data processing applications, etc., can solve problems such as semantic interference, and achieve the effect of high clustering accuracy

Active Publication Date: 2019-07-16
SICHUAN XW BANK CO LTD
View PDF2 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0008] Based on the above technical problems, the present invention provides a short text clustering method based on deep semantic path search, which aims to solve the problem that individual noise words seriously interfere with the semantic analysis of the entire short text

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Short Text Clustering Method Based on Deep Semantic Path Search
  • A Short Text Clustering Method Based on Deep Semantic Path Search

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] All the features disclosed in this specification, except mutually exclusive features and / or steps, can be combined in any way.

[0050] The present invention will be described in detail below in conjunction with the accompanying drawings.

[0051] A short text clustering method based on deep semantic path search, comprising the following steps:

[0052] Step 1: Preprocessing the general corpus to obtain the vocabulary corresponding to the corpus;

[0053] The preprocessing method is: the sentence in the corpus is subjected to case conversion and word segmentation processing; the words that appear more than N times in the corpus are selected; the words are used as the vocabulary corresponding to the corpus; where N represents words Frequency threshold.

[0054] Step 2: The method of using the hyperparameters of word2vec to establish the real number vector (Embedding) of words is:

[0055] Step S301: mapping the word into a K-dimensional real number vector, and using t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention belongs to the field of the vectorization expression of text features, and discloses a short text clustering method based on deep semantic path search. The method comprises the following steps that: reprocessing a general corpus to obtain a vocabulary corresponding to the corpus; establishing the real number vector of each word in the vocabulary; preprocessing a short text, and utilizing the processed short text to train an LSTM (Long Short-Term Memory) serialized model to obtain an optimized LSTM model; searching an ordered subsequence combination in a word sequence in the short text, utilizing the optimized LSTM model to calculate the probability of the subsequence combination, and utilizing the probability to select an optimal sematic path of the short text; utilizing an optimal language path among short texts to carry out cosine similarity calculation to obtain a similarity among the short texts; and taking the similarity as a clustering parameter to cluster the short text so as to obtain a final clustering result. By use of the method, the problem that an individual noise word affects whole short text semantics can be effectively solved.

Description

technical field [0001] The invention relates to the field of vectorized representation of text features, in particular to a short text clustering method based on deep semantic path search. Background technique [0002] At present, with the widespread popularity of social media and the rise of chatbots, it is a very important channel to find valuable information from short texts. And short text clustering is an important task. Its main challenge is the sparsity of text representation. In order to overcome this difficulty, some researchers try to enrich and expand short text data through Wikipedia or ontology database. But this expansion is based on the semantic expansion of the "word" dimension. In fact, it is a calculation at the "sentence" level; it is very likely that the following two sentences will appear in a cluster, and their meanings are opposite: I like to eat apples and I don't like to eat apples. [0003] This is the inconsistency of sentence expression brough...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35
CPCG06F16/35
Inventor 李开宇李秀生
Owner SICHUAN XW BANK CO LTD