According to an aspect, there is provided a computer-implemented method for
processing a
data set, the
data set comprising respective data subsets for a plurality of subjects, each data subset comprising a plurality of data entries, each entry comprising respective parameter values for each of a plurality of parameters at a respective time point, wherein for a first data subset relating to a first subject in the plurality of subjects, one or more parameter values for at least a first parameter in the plurality of parameters is missing from the first data subset, the method comprising, for a first missing parameter value in a first
data entry in the first data subset (a) determining completeness scores for the first parameter, wherein each completeness
score indicates a level of completeness of the data entries in the first data subset for the first parameter and a respective one of the other parameters in the plurality of parameters; (b) determining correlation scores for the first parameter, wherein each correlation
score indicates a level of correlation between the parameter values in the
data set for the first parameter and the parameter values in the data set for a respective one of the other parameters in the plurality of parameters; (c) determining a subset of the plurality of parameters to use to form regression trees based on the determined completeness scores and the determined correlation scores; (d) forming a plurality of regression trees, wherein each regression tree relates to a respective parameter combination of the first parameter and one or more of the other parameters in the determined subset, and each regression tree is trained to predict a parameter value for the first parameter based on input parameter values for the one or more other parameters in the parameter combination, wherein each regression tree is trained using training data comprising parameter values for the parameters in the respective parameter combination, wherein the training data includes the parameter values in any
data entry in the first data subset for which a parameter value is present for all of the parameters in the respective parameter combination; (e) using each regression tree to predict a parameter value for the first parameter based on parameter values in the first
data entry for the one or more other parameters in the parameter combination; and (0 combining the predicted parameter values to estimate the first missing parameter value. A corresponding apparatus and
computer program product are also provided.