Privacy is defined in the context of a guessing
game based on the so-called guessing inequality. The privacy of a sanitized
record, i.e., guessing
anonymity, is defined by the number of guesses an attacker needs to correctly guess an original
record used to generate a sanitized
record. Using this definition, optimization problems are formulated that optimize a second anonymization parameter (privacy or data
distortion) given constraints on a first anonymization parameter (data
distortion or privacy, respectively). Optimization is performed across a spectrum of possible values for at least one
noise parameter within a
noise model.
Noise is then generated based on the
noise parameter value(s) and applied to the data, which may comprise real and / or categorical data. Prior to anonymization, the data may have identifiers suppressed, whereas
outlier data values in the noise perturbed data may be likewise modified to further ensure privacy.