Method for improving data clustering quality by improving k-means based on deviation maximization method

A technology of maximizing deviation and data clustering, applied in the field of data processing, can solve problems such as the selection of K value that cannot be solved, and achieve fast speed and good effect

Inactive Publication Date: 2020-02-07
山东正云信息科技有限公司
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] Furthermore, the traditional k-means algorithm cannot solve the problem of selecting the K value, because the data set does not have a class label, so it is not known how many classes it should be divided into. It can only determine which class the elements should be classified into according to the data attributes.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for improving data clustering quality by improving k-means based on deviation maximization method
  • Method for improving data clustering quality by improving k-means based on deviation maximization method
  • Method for improving data clustering quality by improving k-means based on deviation maximization method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0039] A method for improving the quality of data clustering based on the method of maximizing deviation to improve k-means, including:

[0040] 1) Calculate the weight of the maximum deviation of the read data:

[0041] Objective weight ω j The formula for determining the method is as follows:

[0042]

[0043] In formula (1), i and j represent rows and columns. The objective weight can make full use of the information of the decision object, and widen the gap between data in order to make better decisions;

[0044] 2) Use the maximum dispersion method to calculate the weight w of each attribute of the sample k , And then construct a weighted matrix, and construct a matrix according to the sample data arrangement;

[0045] 3) Weight the attributes of the data set:

[0046] Set data attribute X = {x i1 ,x i2 ,...,x im }, where m represents the number of attributes, divide these data into k classes, and the attribute weight value is w 1 ,w 2 ,...,w m , And w j > = 0, j = 1, 2,..., m, the...

Embodiment 2

[0051] As described in Example 1, a method for improving the quality of data clustering based on the method of maximizing dispersion to improve k-means, the difference is that the k-means++ algorithm specifically includes:

[0052] Assuming that the data sample set is divided into k categories, the steps of the k-means++ algorithm are as follows:

[0053] 1) Randomly select the first center point and preset a value of k, where k is the number of categories;

[0054] 3) Calculate the shortest distance between each element and each center point, denoted by D(X); among them, the larger the point of D(X), the greater the probability that the center point is, until k center points are found;

[0055] 3) After the cth iteration, for any sample data, find the distance to the k centers, and then classify the sample into the cluster where the center with the shortest distance is located;

[0056] 4) Use the mean value calculation method to update the center value of these clusters;

[0057] 5) Fo...

experiment example

[0059] Attached figure 1 , To verify the effect of the present invention.

[0060] This experimental example is to use the method of the present invention to improve the k-means based on the dispersion maximization method to improve the quality of data clustering, respectively, to perform data on the two representative data sets Iris and Wine in the UCI machine learning database. Cluster processing, and then visually check the data clustering quality.

[0061] Among them, the UCI database is a database specially used for testing machine learning and data mining algorithms. The data in the database has a certain classification, so the quality of the clustering can be seen intuitively. The data sets Iris and Wine, as experimental data, are shown in Table 3.1 Experimental Data Sets. The Iris data set contains 150 labeled 4-dimensional sample points, and the number of cluster categories is k=3; the Wine data set has 178 items, and 13 parameters are used to distinguish 3 wine varieti...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A method for improving data clustering quality by improving k-means based on a deviation maximization method is characterized by comprising the following steps: 1) performing deviation maximization weight calculation on read-in data: 2) calculating a weight wk of each attribute of a sample by using the deviation maximization method, and then constructing a weighting matrix; 3) weighting the attributes of the data set: 4) calling a k-means algorithm or a k-means + + algorithm, judging whether iteration is terminated or not by judging whether the result is converged to a specified threshold value or not, and finally obtaining a clustering result. Aiming at the defect that an algorithm in the prior art is used for carrying out indifference processing on all data sample attributes, the invention discloses a method for improving data clustering quality by improving k-means based on a deviation maximization method, and the method specifically comprises the steps: carrying out the weighting of attributes, obtaining the objective weight value of each attribute according to the specific information in a data set, increasing the difference of all data, and achieving a better data clusteringeffect.

Description

Technical field [0001] The invention discloses a method for improving data clustering quality based on a deviation maximization method to improve k-means, and belongs to the technical field of data processing. Background technique [0002] With the rapid development of society, various information technologies such as artificial intelligence and data mining have been applied in many aspects, and many industries will generate extremely large data sets. Cluster analysis is one of the commonly used data mining methods. It is not a specific algorithm in itself, but a general task to be solved. The algorithm is used to achieve these tasks. Clustering does not have a priori data, it is completely dependent on the similarity between the data to divide it into different clusters, and the similarity is generally calculated based on the attributes between the data elements, and the expected result of the final calculation is in different clusters The attribute values ​​of data objects are...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06K9/62
CPCG06F18/23213
Inventor 张凯李雪梅王祥凯
Owner 山东正云信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products