Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Method for clustering network-based short texts

A clustering method and text clustering technology, applied in text database clustering/classification, unstructured text data retrieval, special data processing applications, etc., can solve the problems of few clustering studies, unsatisfactory clustering results, Value is very sensitive and other issues, to achieve the effect of high clustering accuracy, ideal clustering effect, and strong practicability

Active Publication Date: 2015-08-26
QILU UNIV OF TECH
View PDF4 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] Grid-based clustering method (Clique clustering method, etc.) because the processing time of grid clustering is related to the number of cells divided in each dimension space, it is sensitive to isolated point processing and cannot handle large data, so to a certain extent Reduced the quality and accuracy of algorithm clustering;
[0010] The more classic partition-based clustering method is the traditional K-means clustering method, because the initial clustering center is randomly selected, which will reduce the accuracy of the clustering results, and the algorithm is very sensitive to outliers, while At present, the improvement of the K-means clustering method is aimed at ordinary texts, and there are not many studies on the clustering of network short texts. Since the characteristics of ordinary texts are different from those of network short texts, if the existing K-means method based on ordinary texts is improved -means clustering method for clustering, its clustering results are not ideal
Therefore, the existing technology cannot perform clustering according to the characteristics of the network short text itself.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for clustering network-based short texts
  • Method for clustering network-based short texts
  • Method for clustering network-based short texts

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0063] 1. Experiment with the TFIDF formula for weight calculation in preprocessing.

[0064] This paper obtains user comment information from Zhongguancun Online as an experimental data set. First, the traditional TFIDF formula is used for calculation, and the experimental data set is word-segmented by the word segmentation software ICTCLAS of the Chinese Academy of Sciences. Table 1 below is the result of removing stop words from the experimental part of the text.

[0065]

[0066] Now we select the first text in Table 1 after removing the stop words and use the original TFIDF formula to calculate the weights of their feature items, and the results are shown in Table 2 below.

[0067]

[0068] From the number of texts containing feature items in text 1, it can be seen that the highest number is not necessarily the most important, so although some words contain many texts, they are not important keywords for distinguishing texts. It can be seen that the original TFIDF ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The present invention discloses a method for clustering network-based short texts. The specific implementation process comprises: firstly acquiring a network-based comment; pre-processing the acquired network-based comment, wherein the pre-processing comprises performing word segmentation on the network-based comment, then removing the word that is not used, segmenting a keyword, and performing weighted calculation on the keyword; and clustering the pre-processed texts. The method for clustering network-based short texts, as compared with the prior art, implements collection and analysis of massive data over the network, such that a user conveniently searches for valued information. With this method, the precision in clustering the network-based short texts is high, thereby accommodating practical needs of the user. Therefore, the method according to the present invention has great practicability and can be simply promoted.

Description

technical field [0001] The invention relates to the technical field of Web text clustering, in particular to a highly practical network short text clustering method. Background technique [0002] Nowadays, the Internet has become the primary platform for people to obtain information and interact with each other, such as Zhongguancun Online, Autohome, Pacific Computer, etc. People can learn about product consultation and express their own opinions through these interactive portals. Various advantages and disadvantages and opinions put forward by related products, among which there is a large amount of valuable information that needs to be discovered by people. [0003] For example, before we buy a certain mobile phone, we often go to websites like Zhongguancun Online to find out other users' comments on this mobile phone, such as "It's a pity that it is not a 4G network. Disappointed, the power adapter is very hot in summer!", "The main screen The material is made of flexibl...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 耿玉水张立说孙涛
Owner QILU UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products