Improved algorithm for missing value interpolation

A missing value and interpolation technology, applied in the field of data analysis and preprocessing, can solve problems such as the difficulty of grasping the rationality and correctness of missing value interpolation

Inactive Publication Date: 2018-06-22
GUANGDONG KINGPOINT DATA SCI & TECH CO LTD
View PDF3 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In practical application, the imputation method is often selected based on personal subjectivity and past experience for missing value interpolation. It is d

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Improved algorithm for missing value interpolation
  • Improved algorithm for missing value interpolation

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0022] Such as figure 1 As shown, the present invention is an improved interpolation scheme for data missing values, combined with the existing interpolation method, using the principle of bagging algorithm, to select and retrain various interpolation algorithms according to the characteristics of the data, Improve the original algorithm, and then achieve the purpose of restoring a complete and approximate real data set. The group and algorithm include the following steps for the imputation scheme for missing values:

[0023] Step S1, performing hierarchical clustering on all data.

[0024] To a certain extent, it is guaranteed that the complete data and missing data of the same type are gathered together for analysis, which is more reasonable and convenient to construct a model suitable for missing imputation.

[0025] Step S2, for the classes with missing values, divide the records into complete data groups m according to whether they are records with missing values 1 and...

Embodiment 2

[0034] Such as figure 2 Shown are the different processing methods of the bagging algorithm for different types of missing values.

[0035] The basic idea of ​​bagging algorithm:

[0036] Given a weak learning algorithm and a training set, the accuracy of a single weak learning algorithm is not high, and the learning algorithm is used multiple times to obtain a sequence of prediction functions and vote, and the accuracy of the final result will be improved.

[0037] Step S1: Independently and randomly extract a certain amount of data from the data set to form a self-service data set, and repeat this process independently for many times until many independent self-service data sets are generated.

[0038] Step S2: Each bootstrap dataset is independently used to train a "component classifier".

[0039] Step S3: The final classification decision is decided by voting on the respective decision results of these "component classifiers".

[0040] Improve the bagging algorithm for...

Embodiment 3

[0042] On the basis of the above-mentioned implementation, taking a certain amount of normal distribution data set containing multiple variables randomly generated as an example, use data analysis software to simulate the complete data set and random missing data with different missing rates, and use the above method to interpolate repair.

[0043] First, the data is hierarchically clustered, the classes with missing values ​​are analyzed, and the complete data in the classes are randomly divided into training sets and test sets; then, mean replacement method, hot card filling method, regression replacement method, multiple Substitution and group deletion methods deal with missing data.

[0044] The imputation effect was examined from two aspects of accuracy and distribution, the K-S normality test was used to evaluate whether the data sets processed by different methods conformed to the normal distribution, and the Wilcoxon signed rank sum test was used to compare the process...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an improved algorithm for missing value interpolation. The algorithm comprises the steps that hierarchical clustering is performed on all data; for a class containing missing values, according to the judgment of whether a record of the missing values is available, the record is divided into a complete data set m1 and a missing data set m2; the data in m1 is randomly dividedinto a training set and a test set, k types of interpolation methods are used to predict the test set, and the method with the highest accuracy is obtained; whether the obtained method is a weak method is judged, if yes, a function for missing value interpolation of the class is obtained in combination with a bagging algorithm, and if not, the algorithm is a final algorithm; the final interpolation function is adopted to perform interpolation on the missing values of the class; and whether another class containing missing values exists is judged. According to the improved algorithm for missing value interpolation, the method suitable for the data set can be selected from numerous missing value interpolation schemes according to specific properties of the data, interpolation effects of themissing values through various methods are compared according to the principle of the bagging algorithm, and therefore the practical method for missing value interpolation of the data is obtained.

Description

technical field [0001] The invention relates to the field of data analysis and preprocessing, in particular to an improved algorithm for interpolating missing values. Background technique [0002] Today's society is in an era of big data. In the research of many problems, data is often missing or unavailable, which makes it impossible to use data comprehensively and effectively for analysis and research. In order to avoid affecting the research, researchers should take various measures from the design stage to avoid missing values ​​in the research process, and take measures to remedy the missing values ​​that have already occurred as much as possible. Interpolating reasonable estimated values ​​for missing data can reduce the estimator bias caused by missing data, and try to construct a complete data set to facilitate the use of statistical analysis methods. In the simulation study of imputation of missing values, the imputation results are measured by comparing the approx...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/18
CPCG16Z99/00
Inventor 吴剑文许飞月陈乐焱陶波
Owner GUANGDONG KINGPOINT DATA SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products