Dialogue short text clustering method based on form and semantic similarity

A semantic similarity and short text technology, applied in text database clustering/classification, unstructured text data retrieval, instrumentation, etc., can solve problems such as short text cannot be handled well, prominent, single topic, etc.

Inactive Publication Date: 2014-08-27
EAST CHINA NORMAL UNIV
View PDF5 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] (2) The topic is single, and a short dialogue text usually only discusses one thing
[0007] (4) Synonyms, mixed use of upper and lower case letters, and input errors are prominent
For example, Sahami et al. enter short texts into search engines to obtain the most relevant text sets returned, and these text data are used as auxiliary data information for corresponding short texts. This method solves the information sparsity of short texts to a certain extent, but A la...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Dialogue short text clustering method based on form and semantic similarity
  • Dialogue short text clustering method based on form and semantic similarity
  • Dialogue short text clustering method based on form and semantic similarity

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0031] The present invention can effectively cluster short dialogue texts. The following takes the dialogue text provided by Xiaoi robot as an example, combined with the attached figure 2 The present invention is further described.

[0032] The implementation process mainly includes two stages. The first stage is to filter and preprocess the original text data, such as text length filtering, Chinese word segmentation, and unification of English strings, and then use the keyword extraction tool to obtain keywords and weights; In the second stage, the short text collection is clustered using the morphology of strings and the semantic similarity of words, which is the process of FS-STC clustering method.

[0033] 1). Preprocessing stage

[0034] If the text set that needs to be clustered is a short Chinese text, it is first necessary to use the word segmentation tool to segment the short text, and use the Chinese Academy of Sciences 2014 word segmentation tool to segment the t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a dialogue short text clustering method based on form and semantic similarity. The form similarity adopts character string editing distance similarity, and the semantic similarity is based on HowNet and WordNet knowledge bases; weight values of the short text and words are introduced during the calculation of the short text similarity. The dialogue short text clustering method based on the form and semantic similarity solves the problems of certain irregular and input wrong noise information, synonyms and semantic gaps included in the dialogue short text to a certain extent, and consequently, relatively great improvement is realized in comparison with a word bag vector based clustering method.

Description

technical field [0001] The invention belongs to the technical field of short text clustering, and relates to a method for clustering short texts of dialogues based on the similarity of string edit distance and the semantic similarity of words. Background technique [0002] With the rapid development of mobile communication and mobile Internet, various human-machine intelligent dialogue systems have emerged, such as Siri, google now, Xiaoi robot, etc. Taking Xiaoi Robot as an example, the number of users has exceeded 100 million, and there are 10 billion dialogue visits every year and a large amount of valuable dialogue text data are generated. These data are important data sources for user interest mining and knowledge base improvement of intelligent dialogue systems. Clustering analysis on these dialogue text data can gather similar dialogue texts and form several important cluster centers, which can improve the efficiency of mining user interests and extracting knowledge t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 胡琴敏陈国梁杨河彬罗念钟哲凡裴逸钧
Owner EAST CHINA NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products