An internet data clustering method and system

A clustering method and Internet technology, applied in the field of Internet text data clustering methods and systems, can solve the problems of instability of clustering results, limit the scope of application, etc., and achieve the effect of reducing instability

Inactive Publication Date: 2017-06-27
SHENZHEN INST OF ADVANCED TECH CHINESE ACAD OF SCI
View PDF6 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The existing FG-k-means algorithm is superior to other algorithms in terms of clustering performance, but there are still the following problems: the FG-k-means algorithm needs to use the information of the feature group to complete the purpose of two-level clustering optimization, but generally Such information is not provided in the text data, which limits the scope of its application; FG-k-means has the problem of instability of clustering results caused by different selection of initial center points

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • An internet data clustering method and system
  • An internet data clustering method and system
  • An internet data clustering method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0030] see figure 1 , is a flowchart of a method for clustering Internet text data according to an embodiment of the present invention. The internet text data clustering method of the embodiment of the present invention comprises the following steps:

[0031] Step 100: use the topic model to train the text data, obtain the probability distribution matrix of all keywords under each topic, and group the keywords in the text collection;

[0032] In step 100, the number of topics, clustering integration model data volume and clustering clusters can be set in the topic model; when keywords are grouped...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to an internet text data clustering method and system. The internet text data clustering method comprises the steps of: a, training text data by using a topic model to obtain a probability distribution matrix of all key words in each topic, and grouping the key words in a text set; b, rearranging feature sets of the text data according to the grouping of the key works to obtain new document data containing key word grouping feature information; c, operating a double-layer soft subspace clustering algorithm on the new document data containing the key word grouping information to generate a clustering center matrix and a sample ownership matrix; d, repeating the steps a-c for n times to obtain a plurality of clustering results; e, operating a clustering integration algorithm on a model set to integrate the multiple clustering results to obtain a final clustering result. The method and the system can reduce the instability of an FG-k-means algorithm effectively.

Description

technical field [0001] The invention belongs to the technical field of data mining, in particular to an internet text data clustering method and system. Background technique [0002] With the advent of the era of big data, the data faced in the field of data mining has become more and more complex. Especially Internet text data, in addition to the huge amount, the text data constructed by the Vector Space Model (Vector SpaceModel) also has ultra-high dimensionality and sparsity. Existing data mining clustering algorithms, such as k-means, hierarchical clustering There are generally deficiencies and limitations when applied to text clustering such as general clustering and general clustering. [0003] Aiming at the problem of subspace clustering of high-dimensional sparse data, academia has proposed many related subspace clustering algorithms (Subspace Clustering), and soft subspace clustering algorithm is one of them. According to different weighted layers, soft subspace c...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06K9/62
CPCG06F16/35G06F18/23213
Inventor 赵鹤李栋一黄哲学姜青山陈会高琴朱敏蔡业首
Owner SHENZHEN INST OF ADVANCED TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products