Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method for realizing fast-speed short text bi-cluster

A short text, double clustering technology, applied in special data processing applications, instruments, electronic digital data processing, etc., can solve problems such as unreachable, poor results, and low clustering accuracy.

Active Publication Date: 2013-06-26
中科国力(镇江)智能技术有限公司
View PDF5 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] (2) Accurate calculation of short text similarity
At present, although there are many similarity algorithms (such as Euclidean distance method, cos distance method, Pearson coefficient method, VDM method, etc.), according to our research, they all have defects, and the effect is not good in practical applications.
[0007] (3) Fast and accurate clustering of short texts
Traditional single clustering (such as K nearest neighbor method, hierarchical clustering method, etc.) is difficult to achieve accurate clustering. When facing open corpus, the clustering accuracy is generally very low, which cannot meet the needs of practical applications.
Moreover, when the length of the short text is slightly higher, the clustering accuracy is lower

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for realizing fast-speed short text bi-cluster
  • Method for realizing fast-speed short text bi-cluster
  • Method for realizing fast-speed short text bi-cluster

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

[0037] Such as figure 1 As shown, a fast short text biclustering method includes the following steps:

[0038] Step 1) Preprocessing of short text distractors, with the support of irrelevant word dictionary and part of speech dictionary, quickly identify and process irrelevant words and part of speech for short text.

[0039] Step 2) Based on the short text similarity calculation, the preprocessed two short text similarities are calculated to form a short text similarity sparse matrix.

[0040] Step 3) Perform first-level clustering of short texts on the short text similarity sparse matrix, and divide similar short texts into clusters one by one according to the settlement results of short text similarity.

[0041] Step 4) Perform secondary clustering of short texts on the basis of primary clustering results.

[0042] The above steps will b...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a method for realizing fast-speed short text bi-cluster. The method comprises the following steps of: (1) preprocessing short text disturbance items, and carrying out fast-speed unrelated-language and word-class recognition and processing recognition on short texts with the support of an unrelated-language dictionary and a word-class dictionary; (2) calculating the similarity of two preprocessed short texts to form a short text similarity sparse matrix; (3) carrying out short text first-level clustering on the short text similarity sparse matrix, and dividing similar short texts into clusters one by one according to the calculation result of the short text similarity; and (4) carrying out second-level clustering on the basis of the result of the first-level clustering.

Description

technical field [0001] The invention relates to natural language processing in the field of artificial intelligence computers, in particular to a fast short text bi-clustering method and its realization by using natural language processing and data clustering. Background technique [0002] In a large number of natural language applications, there is a basic and common problem: for a corpus composed of short texts (hereinafter referred to as short text corpus or corpus), how to organize the short texts according to a certain similarity clustered into different classes. [0003] Generally speaking, the basic idea of ​​text clustering is to cluster "similar" texts into a class; in this class, the "differences" between texts are small. Texts that are not "similar" are clustered into other classes. The "gap" between different classes is large. Here, "similarity" / "gap" is a measure between some texts, which depends on different application requirements. There are many traditio...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F17/27
Inventor 符建辉刘亮亮王石王卫民
Owner 中科国力(镇江)智能技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products