Data missing processing method and system based on set division information amount maximization

A technology of set division and data loss, which is applied in the direction of electrical digital data processing, special data processing applications, patient-specific data, etc., can solve the problems of concealing the original data rules and large amount of calculation, so as to improve efficiency, reduce calculation amount, and avoid The effect of data errors

Active Publication Date: 2022-04-15
SICHUAN ACADEMY OF MEDICAL SCI SICHUAN PROVINCIAL PEOPLES HOSPITAL
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] The purpose of the present invention is to provide a data missing processing method and system based on the maximization of set div

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data missing processing method and system based on set division information amount maximization
  • Data missing processing method and system based on set division information amount maximization
  • Data missing processing method and system based on set division information amount maximization

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0046] A data missing processing method based on maximizing the amount of set partition information, obtain patient data, the patient data contains N patient samples, each patient contains F features, there are missing values ​​in the obtained data, and the acquired N The F characteristic data of the patient are saved in the form of matrix S,

[0047] Transform the matrix S to obtain the matrix T, and the mapping relationship of transforming the matrix S into the matrix T is: if S i,j Existence of acquired data, will define T i,j =C, C is a constant, if S i,j In the absence of acquired data, T i,j =a i / F×C, where a i For the number of non-missing data in the i-th sample data, calculate the sum of each column of the matrix T to get Sum 1 ,Sum 2 ,…,Sum F ,

[0048] where i=1,...,N,

[0049] j=1,...,F,

[0050] And i, j, N and F are all positive integers,

[0051] According to the sum of each column of the matrix T from small to large, the feature data under the colum...

Embodiment 2

[0060] For a missing data set, where the number of samples = N and the number of features = F, different methods of deleting missing data can be used to obtain subsets with complete data containing different amounts of data, and the subset containing the largest amount of data can be selected. Set as the optimal subset for subsequent data analysis. Taking Table 1 as an example, V represents the observed value, and the blank is the missing value.

[0061]

[0062] Table 1 Raw data

[0063] For example, if you delete the fourth column and delete the corresponding row with missing data, you can get a subset containing samples 2, 5, 6, 8, and features 1, 2, and 3; similarly, delete the third column to get a subset containing samples 2, 8, a subset of features 1, 2, 4; delete the 3rd and 4th columns, you can get a subset containing samples 2, 3, 5, 6, 8, features 1, 2... But with the features As the number increases, the number of deletion methods also increases. In this examp...

Embodiment 3

[0076] Take the extreme data of table 6 as an example to analyze the method of the present invention, the data of the 6th characteristic of this data are missing data,

[0077]

[0078] The raw data of table 6 embodiment 3

[0079] The data set has 10 samples and 6 features, and the missing conditions of each column are different. It is assumed that the number of features observed in each sample is a n , replace all V in the data set with 100, and replace the missing data of each sample with m n =a n / F 100, for sample 1, m n =100 / 3, and so on, the observed variables and missing data in the data set are replaced accordingly, and the values ​​​​of each column are summed, so Table 7 can be obtained,

[0080]

[0081] Table 7 The intermediate data after conversion of embodiment 3

[0082] Here, for the convenience of calculation, m n Rounding off is performed in the calculation. The obtained sum can reflect the data retention of each sample on the feature. The lar...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the technical field of medical data processing, in particular to a data missing processing method and system based on set division information amount maximization, and the system comprises a data acquisition unit, a data processing unit, a feature deletion unit and an optimal subset output unit. The method for quickly finding the optimal subset of the missing data is obtained by judging the information amount, so that the calculation amount is greatly reduced, and the data processing efficiency in medical data analysis is improved. By adopting the method, a new thought is provided for missing data in the field of medicine, and the problems that the calculation amount is large and the real data rule is masked due to a traditional deleting method and a filling method are solved.

Description

technical field [0001] The present invention relates to the technical field of medical data processing, in particular to a data missing processing method and system for maximizing the amount of information based on set division. Background technique [0002] The problem of missing data is usually unavoidable in real-world research, not only the outcome variables may be missing, but covariates may also be missing. There may be many reasons for the lack of data, such as: 1. The patient refused to answer specific questions, such as the patient did not report sensitive information such as income data; 2. The patient was lost to follow-up, such as patient migration, death, withdrawal from the study, etc.; Some patients arrange certain examinations. For example, cholesterol examinations are not arranged for some patients; 4. Investigator or mechanical failure, such as investigators forgetting to record data due to subjective reasons, sphygmomanometer failure, etc. [0003] The la...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/215G16H10/60
CPCG06F16/215G16H10/60
Inventor 吴行伟童荣生常欢吴竞鲜温亚林
Owner SICHUAN ACADEMY OF MEDICAL SCI SICHUAN PROVINCIAL PEOPLES HOSPITAL
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products