Clustering method and system

A clustering method and clustering technology, applied in the field of data processing, can solve the problems of reduced clustering operation performance and increased computing time, and achieve the effect of reducing the number of comparisons, reducing the burden, and improving the operation performance.

Active Publication Date: 2011-05-11
ALIBABA GRP HLDG LTD
View PDF3 Cites 22 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The technical problem to be solved in this application is to provide a clustering method to solve the problem of increased calculation time caused by the calculation of vector similarity with other files in order to perform clustering for each readable file in the prior art. The problem of degraded performance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Clustering method and system
  • Clustering method and system
  • Clustering method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0111] Corresponding to the method provided in Embodiment 1 of a clustering method of the present application, see Figure 4 , the present application also provides a clustering system embodiment 1, in this embodiment, the system may include:

[0112] The vectorization unit 401 is configured to vectorize multiple readable files to obtain multiple file vectors corresponding to the multiple readable files.

[0113] In this embodiment, the readable files can be files in various formats converted into vectors, for example, Word documents, Excel tables, etc.; Convert the multiple readable files into corresponding multiple file vectors. The vectorization is to convert a readable file into a vector composed of a series of numbers, where each number represents a value corresponding to a different feature. The vectors corresponding to different readable files are different. The file vector in this application means vector, and it is called a file vector to distinguish it from subseq...

Embodiment 2

[0119] Corresponding to the method provided in Embodiment 2 of a clustering method of the present application, see Figure 5, the present application also provides a preferred embodiment 2 of a clustering system. In this embodiment, the system may specifically include:

[0120] The vectorization unit 401 is configured to vectorize multiple readable files to obtain multiple file vectors corresponding to the multiple readable files.

[0121] The extraction unit 402 is specifically configured to sequentially add and sum the eigenvalues ​​of the common features of the multiple file vectors to obtain the corresponding eigenvalues ​​of the total eigenvectors.

[0122] The first calculation unit 501 is configured to respectively calculate the first similarity between the plurality of file vectors and the total feature vector.

[0123] The first sorting unit 502 is configured to sort the multiple file vectors for the first time according to the first similarity.

[0124] The second ...

Embodiment 3

[0131] Corresponding to the method provided in Embodiment 3 of a clustering method of the present application, see Figure 5 , the present application also provides a preferred embodiment 3 of a clustering system. In this embodiment, the system may specifically include:

[0132] A vectorization unit 401, configured to vectorize multiple readable files to obtain multiple file vectors corresponding to multiple readable files;

[0133] The extraction unit 402 is specifically configured to sequentially add and sum the eigenvalues ​​of the common features of the multiple file vectors to obtain the corresponding eigenvalues ​​of the total eigenvectors.

[0134] The first calculation unit 501 is configured to respectively calculate the first similarity between the plurality of file vectors and the total feature vector.

[0135] The first sorting unit 502 is configured to sort the multiple file vectors for the first time according to the first similarity.

[0136] The second calcula...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a clustering method and a system. The method comprises: performing vectorization on a plurality of readable documents, so as to obtain a plurality of document vectors corresponding to the readable documents; extracting common general characteristic vectors of the readable documents according to the document vectors; and performing clustering on the readable documents according to the general characteristic vectors and the similarity among the document vectors. The invention further provides the method and the system used for clustering Internet web page. The method or the system provided by the embodiment of the invention is adopted for clustering, so as to reduce times of comparisons for the similarity among the document vectors, and further reduce the load of system resource, such as the usage amount of a CPU and an internal memory, the running time for clustering is reduced, and the operational performance for clustering is improved.

Description

technical field [0001] This application relates to the field of data processing, in particular to a clustering method and system. Background technique [0002] In data processing, the process of dividing a collection of physical or abstract objects into multiple classes of similar objects is called clustering. A cluster generated by clustering is a collection of data objects that are similar to objects in the same cluster and different from objects in other clusters. When identifying readable files with a large amount of data, it is often necessary to perform clustering calculations, that is, to divide different readable files into different categories according to different thresholds, so as to obtain which readable files belong to the same class. A category, and finally realize the clustering of similar documents. [0003] In the prior art, the process of clustering a large number of readable files is generally as follows: firstly, the readable files are vectorized based...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/355G06F16/951G06F18/23211
Inventor 张涛郭家清
Owner ALIBABA GRP HLDG LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products