Method for clustering network-based short texts

A clustering method and text clustering technology, applied in text database clustering/classification, unstructured text data retrieval, special data processing applications, etc., can solve the problems of few clustering studies, unsatisfactory clustering results, Value is very sensitive and other issues, to achieve the effect of high clustering accuracy, ideal clustering effect, and strong practicability

Active Publication Date: 2015-08-26
QILU UNIV OF TECH
View PDF4 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] Grid-based clustering method (Clique clustering method, etc.) because the processing time of grid clustering is related to the number of cells divided in each dimension space, it is sensitive to isolated point processing and cannot handle large data, so to a certain extent Reduced the quality and accuracy of algorithm clustering;
[0010] The more classic partition-based clustering method is the traditional K-means clustering method, because the initial clustering center is randomly selected, which will reduce the accuracy of the clustering results, and the alg

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for clustering network-based short texts
  • Method for clustering network-based short texts
  • Method for clustering network-based short texts

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0062] Example:

[0063] 1. Experiment with TFIDF formula for weight calculation in preprocessing.

[0064] In this paper, user comment information is obtained from Zhongguancun Online as the experimental data set. First, the traditional TFIDF formula is used for calculation. The experimental data set is segmented by the Chinese Academy of Sciences word segmentation software ICTCLAS. Table 1 below is the result of removing stop words from the experimental part of the text.

[0065]

[0066] Now we select the first text in Table 1 after removing the stop words and use the original TFIDF formula to calculate the weight of their feature items. The results are shown in Table 2 below.

[0067]

[0068] From the number of texts containing feature items in text one, it can be seen that the highest number is not necessarily the most important. Therefore, although some words contain a large number of texts, they are not important keywords to distinguish texts. It can be seen th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention discloses a method for clustering network-based short texts. The specific implementation process comprises: firstly acquiring a network-based comment; pre-processing the acquired network-based comment, wherein the pre-processing comprises performing word segmentation on the network-based comment, then removing the word that is not used, segmenting a keyword, and performing weighted calculation on the keyword; and clustering the pre-processed texts. The method for clustering network-based short texts, as compared with the prior art, implements collection and analysis of massive data over the network, such that a user conveniently searches for valued information. With this method, the precision in clustering the network-based short texts is high, thereby accommodating practical needs of the user. Therefore, the method according to the present invention has great practicability and can be simply promoted.

Description

technical field [0001] The invention relates to the technical field of Web text clustering, in particular to a highly practical network short text clustering method. Background technique [0002] Nowadays, the Internet has become the primary platform for people to obtain information and interact with each other, such as Zhongguancun Online, Autohome, Pacific Computer, etc. People can learn about product consultation and express their own opinions through these interactive portals. Various advantages and disadvantages and opinions put forward by related products, among which there is a large amount of valuable information that needs to be discovered by people. [0003] For example, before we buy a certain mobile phone, we often go to websites like Zhongguancun Online to find out other users' comments on this mobile phone, such as "It's a pity that it is not a 4G network. Disappointed, the power adapter is very hot in summer!", "The main screen The material is made of flexibl...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/35
Inventor 耿玉水张立说孙涛
Owner QILU UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products