Box separation method based on k-means clustering

A clustering algorithm and binning technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as data set errors, achieve fast running speed, good generalization ability, improve accuracy and reliability explanatory effect

Inactive Publication Date: 2015-04-22
GUANGDONG POWER GRID CO LTD INFORMATION CENT
View PDF3 Cites 22 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0015] In order to solve the technical problem that the existing binning method is likely to cause errors for data sets with obvious data density distribution bias, the present invention provides a binning method based on the k-means clustering algori

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Box separation method based on k-means clustering
  • Box separation method based on k-means clustering
  • Box separation method based on k-means clustering

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0030] For the Logistic model of a certain probability prediction, the conventional equal-depth binning method and the binning method based on k-means clustering of the present invention are used to discretize the continuous variables, and then the WOE value (Weight Of Evidence, weight of evidence) as model input to compare the probabilistic prediction models under the two binning methods.

[0031] The input dataset is: with 2 decision classes, 1.72 million instances and 30 attribute conditions.

[0032] Specifically include the following steps, wherein the process of the present invention is as follows figure 1 shown:

[0033] S1. Preprocess continuous variables, including removing missing values ​​and outliers in the data set. The missing values ​​are removed directly; the outliers are discriminated by GESR, a common outlier discriminant method in statistics. In the field of data analysis, the GESR discriminant method is recognized as an effective method for discriminatin...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a box separation method based on k-means clustering. The box separation method comprises the following steps that continuous variables are preprocessed; normalization processing is carried out on the preprocessed data, a k-means clustering algorithm is applied on the data obtained after the normalization processing is carried out to divide the data into a plurality of sections; the equal interval method is adopted for setting the initial center of the k-means clustering algorithm to obtain clustering centers; after the clustering centers are obtained, the midpoint of the adjacent clustering centers is used as a classification division point, each object is added into the closest class, and therefore the data are divided into the multiple sections; each clustering center is calculated again, then the data are divided again until each clustering center does not change any more, and the final clustering result is obtained. According to the box separation method, the technical problem that errors are likely to be caused for a data set with the obvious data density distribution bias according to an existing box separation method is solved, the k-means clustering algorithm does not select the initial center randomly any more, and the data separation result is accurate.

Description

technical field [0001] The invention belongs to the field of data preprocessing of data analysis and mining, in particular to a binning method based on k-means clustering. Background technique [0002] In various data analysis and mining, one of the means of data preprocessing is to discretize continuous variables, and the most commonly used discretization method is to bin continuous variables. An excellent binning method can effectively remove noise of continuous variables, smooth data, increase data granularity, reduce data computational complexity, and provide a better qualitative and quantitative analysis basis for subsequent analysis and mining. [0003] At present, the commonly used binning methods are the equal depth method, the isometric method and the expert definition method. The equal depth method is to sort the data, and each box has the same amount of data; the isometric method is to sort the data set and distribute it evenly in the data value interval, that is...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/285
Inventor 吴广财莫玉纯严宇平杨秋勇桂媛江疆
Owner GUANGDONG POWER GRID CO LTD INFORMATION CENT
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products