Clustering method and system

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A clustering method and clustering technology, applied in the field of data processing, can solve the problems of reduced clustering operation performance and increased computing time, and achieve the effect of reducing the number of comparisons, reducing the burden, and improving the operation performance.

Active Publication Date: 2011-05-11

ALIBABA GRP HLDG LTD

View PDF3 Cites 22 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0006] The technical problem to be solved in this application is to provide a clustering method to solve the problem of increased calculation time caused by the calculation of vector similarity with other files in order to perform clustering for each readable file in the prior art. The problem of degraded performance

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0111] Corresponding to the method provided in Embodiment 1 of a clustering method of the present application, see Figure 4 , the present application also provides a clustering system embodiment 1, in this embodiment, the system may include:

[0112] The vectorization unit 401 is configured to vectorize multiple readable files to obtain multiple file vectors corresponding to the multiple readable files.

[0113] In this embodiment, the readable files can be files in various formats converted into vectors, for example, Word documents, Excel tables, etc.; Convert the multiple readable files into corresponding multiple file vectors. The vectorization is to convert a readable file into a vector composed of a series of numbers, where each number represents a value corresponding to a different feature. The vectors corresponding to different readable files are different. The file vector in this application means vector, and it is called a file vector to distinguish it from subseq...

Embodiment 2

[0119] Corresponding to the method provided in Embodiment 2 of a clustering method of the present application, see Figure 5, the present application also provides a preferred embodiment 2 of a clustering system. In this embodiment, the system may specifically include:

[0120] The vectorization unit 401 is configured to vectorize multiple readable files to obtain multiple file vectors corresponding to the multiple readable files.

[0121] The extraction unit 402 is specifically configured to sequentially add and sum the eigenvalues of the common features of the multiple file vectors to obtain the corresponding eigenvalues of the total eigenvectors.

[0122] The first calculation unit 501 is configured to respectively calculate the first similarity between the plurality of file vectors and the total feature vector.

[0123] The first sorting unit 502 is configured to sort the multiple file vectors for the first time according to the first similarity.

[0124] The second ...

Embodiment 3

[0131] Corresponding to the method provided in Embodiment 3 of a clustering method of the present application, see Figure 5 , the present application also provides a preferred embodiment 3 of a clustering system. In this embodiment, the system may specifically include:

[0132] A vectorization unit 401, configured to vectorize multiple readable files to obtain multiple file vectors corresponding to multiple readable files;

[0133] The extraction unit 402 is specifically configured to sequentially add and sum the eigenvalues of the common features of the multiple file vectors to obtain the corresponding eigenvalues of the total eigenvectors.

[0134] The first calculation unit 501 is configured to respectively calculate the first similarity between the plurality of file vectors and the total feature vector.

[0135] The first sorting unit 502 is configured to sort the multiple file vectors for the first time according to the first similarity.

[0136] The second calcula...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a clustering method and a system. The method comprises: performing vectorization on a plurality of readable documents, so as to obtain a plurality of document vectors corresponding to the readable documents; extracting common general characteristic vectors of the readable documents according to the document vectors; and performing clustering on the readable documents according to the general characteristic vectors and the similarity among the document vectors. The invention further provides the method and the system used for clustering Internet web page. The method or the system provided by the embodiment of the invention is adopted for clustering, so as to reduce times of comparisons for the similarity among the document vectors, and further reduce the load of system resource, such as the usage amount of a CPU and an internal memory, the running time for clustering is reduced, and the operational performance for clustering is improved.

Description

technical field [0001] This application relates to the field of data processing, in particular to a clustering method and system. Background technique [0002] In data processing, the process of dividing a collection of physical or abstract objects into multiple classes of similar objects is called clustering. A cluster generated by clustering is a collection of data objects that are similar to objects in the same cluster and different from objects in other clusters. When identifying readable files with a large amount of data, it is often necessary to perform clustering calculations, that is, to divide different readable files into different categories according to different thresholds, so as to obtain which readable files belong to the same class. A category, and finally realize the clustering of similar documents. [0003] In the prior art, the process of clustering a large number of readable files is generally as follows: firstly, the readable files are vectorized based...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/30

CPCG06F16/355G06F16/951G06F18/23211

Inventor 张涛郭家清

Owner ALIBABA GRP HLDG LTD

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Clustering method and system

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology