Attribute-dependent differential privacy anonymization method

By employing a differential privacy anonymization method based on attribute correlation, this approach dynamically selects privacy-preserving attributes and rationally allocates budgets, thus solving the privacy leakage problem of non-sensitive information in existing technologies and achieving secure and accurate dataset distribution.

CN116305259BActive Publication Date: 2026-06-23DALIAN UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
DALIAN UNIV OF TECH
Filing Date
2023-03-01
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing differential privacy models are effective in protecting the privacy of sensitive information in specific environments, but they still pose a risk of privacy leakage when non-sensitive information is attacked, and they do not take into account the data distortion caused by the reasonable allocation of privacy budget.

Method used

By defining the correlation between attribute values, selecting highly correlated attribute combinations, and allocating an appropriate privacy budget, differential privacy protection is achieved using Laplace and exponential mechanisms. Privacy-protected attributes are dynamically selected to reduce privacy leaks caused by dataset distribution.

Benefits of technology

It effectively prevents attribute linking attacks, reduces data distortion, improves the privacy protection capabilities of datasets, and ensures the security and accuracy of data publication.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116305259B_ABST
    Figure CN116305259B_ABST
Patent Text Reader

Abstract

The application belongs to the technical field of information security, and discloses an attribute correlation-based differential privacy anonymous method. First, the correlation between attributes is analyzed through data distribution analysis of a data set, the number of each equivalent group of each attribute group is calculated by using permutation and combination, all minimum attribute combinations are found, if the number of each equivalent group of the attribute combination is greater than or equal to 30, and there is an equivalent group with a number less than 30 in each equivalent group when any attribute is added, the attribute combination is a minimum equivalent group. Then, the set condition of each attribute is obtained by performing intersection on all minimum attribute groups, the number of each attribute under the intersection is calculated, then the priority of the attributes with intersection is divided, the intersection number is used as the privacy protection priority of the attribute, privacy budget is allocated through the privacy protection priority of each attribute, and finally, the attributes with intersection are protected by differential privacy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to an attribute-related differential privacy anonymization method, belonging to the field of information security technology. Background Technology

[0002] With the rapid development of computer and network technologies, information resource sharing has become a necessity. However, along with information sharing, the leakage of personal privacy information has become increasingly serious. Although individual data publishing entities take measures to hide personally identifiable information or certain private data in their published data, it is worth noting that linking between multiple public data sources often leads to unexpected privacy leaks, adding many security risks to people's daily lives. Privacy-preserving data publishing is a promising method of information sharing that can protect personal privacy. Currently, privacy protection methods mainly utilize anonymization techniques such as generalization, suppression, deconstruction, permutation, and perturbation.

[0003] With the development of technologies such as data mining, data protection is no longer limited to the four major categories of attacks: record linking, attribute linking, table linking, and probability attacks. Increasingly, attacks targeting multidimensional data with correlations are emerging, exploiting the relationships between attributes to infer potential private data and thus leading to information disclosure. Furthermore, attributes not considered for privacy protection during data publication can still cause serious privacy breaches. For example, in medical data publication, sensitive disease-related attributes and quasi-identifiers are well protected. However, when the same published data is used in a medical insurance scenario, insurance companies can use the mail codes in the published data to determine the prevalence of diseases in a region and thus allocate insurance amounts for that region, resulting in a privacy breach. Without privacy protection for these attributes, published data still carries the risk of privacy breaches.

[0004] Existing differential privacy models offer good privacy protection, but they only consider protecting one or more sensitive information items under specific circumstances. When attackers target non-sensitive information, they can still exploit unprotected attributes to cause privacy breaches. Furthermore, it's necessary to consider allocating privacy budgets appropriately to prevent data distortion caused by excessive noise after privacy protection. Summary of the Invention

[0005] To effectively address the issue of attribute information leakage due to privacy concerns in published data, and the impact of dataset distribution on privacy protection, this invention proposes a differential privacy anonymization method based on attribute correlation. The scheme first proposes a method for handling the correlation between attribute values ​​and selects highly correlated attribute combinations, ensuring that even unprotected attributes are protected against link attacks. Then, this invention allocates appropriate privacy budgets to attributes with different correlations to maximize data privacy protection and applies differential privacy protection to privacy-protecting attributes.

[0006] The technical solution of this invention: A differential privacy anonymization method based on attribute correlation, comprising the following steps:

[0007] Define variables:

[0008] Table 1 Commonly Used Variables and Explanations

[0009]

[0010] The specific steps are as follows:

[0011] (1) Test the correlation between attributes in the dataset and select the attribute combination with high correlation;

[0012] The specific process for handling the correlation between attribute values ​​is as follows:

[0013] (1.1) First, all attributes are randomly combined, each attribute combination is retrieved, and all minimum attribute groups are determined;

[0014] Minimal attribute group: This is a division of related attributes, defined by the following formula:

[0015] N(f i ≥30

[0016] N(f k )<30

[0017] Wherein, N(f) i ) is the i-th value combination of the smallest attribute group D. Similarly, the number of elements in the dataset is N(f) k ) is the k-th value combination of attribute group D'. The number of attributes in the dataset; D' is the attribute group to which any attribute is added to the minimum attribute group D; s(v j ) is the number of distinct values ​​for the j-th attribute in the attribute group;

[0018] (1.2) After obtaining all the minimum attribute groups above, perform set processing on all minimum attribute groups. Attributes with intersection are the attributes with high attribute correlation. For attribute groups without intersection, randomly select one attribute as the attribute with high attribute correlation.

[0019] (2) Allocate privacy budgets and allocate appropriate privacy budgets to attributes with different attribute correlations in order to maximize data privacy protection and perform differential privacy protection on privacy protection attributes;

[0020] The privacy budget is allocated to attributes with different correlations, and the specific process is as follows:

[0021] (2.1) First, all the obtained minimum attribute groups are processed by set processing, and the number of intersections of each attribute is calculated as the privacy protection priority. The higher the number of intersections, the higher the privacy protection priority. For attribute groups without intersections, any attribute is selected as the privacy protection attribute, and its priority is the same as the priority of the attribute with one intersection. Then all attributes are divided into privacy protection attribute group A and non-privacy protection attribute group B. Then, privacy budget is allocated to the attributes with different privacy protection priorities.

[0022] Privacy Budget Allocation: Used to allocate privacy budgets for privacy protection attributes. The privacy budget formulas for each privacy protection attribute are defined as follows:

[0023]

[0024] Among them, w(v) l ) is the privacy protection priority of the l-th privacy protection attribute, and ε is the total budget for privacy protection;

[0025] (2.2) Differential privacy protection is performed on the attributes after privacy budget allocation in the previous step. For each privacy protection attribute that is numerical, a Laplace noise is added to the statistical results using the Laplace mechanism. For each privacy protection attribute that is non-numerical, an exponential mechanism is used to make it satisfy the ε-differential privacy with the corresponding privacy budget, so as to obtain a dataset that satisfies ε-differential privacy as a whole, thereby improving the protection of the privacy attributes of the dataset.

[0026] ε-differential privacy: This provides a method to maximize the accuracy of data queries by using random noise when querying statistical databases, while minimizing the chance of identifying records. It removes individual characteristics while preserving statistical features to protect user privacy. Its formula is defined as follows:

[0027] Pr(A(D1)∈S)≤e ε ×Pr(A(D2)∈S)

[0028] Where D1 and D2 are adjacent datasets, A() is a randomization algorithm, and S is the privacy-preserving dataset;

[0029] Laplace mechanism: The Laplace distribution is a concept in statistics, representing a continuous probability distribution. The Laplace mechanism adds Laplace noise to the statistically accurate result, where the noise x follows a Laplace distribution, as shown in the following formula:

[0030]

[0031] Where x is the added noise, ε l This is the privacy budget allocated to the l-th privacy-preserving attribute, Δ f It is global sensitivity;

[0032] The global sensitivity mentioned above can be calculated using the following formula:

[0033]

[0034] Where D1 and D2 are neighboring datasets, and ||f(D1)-f(D2)|| is the Manhattan distance between f(D1) and f(D2);

[0035] Exponential mechanism: Used to numerically map non-numerical data to satisfy ε-differential privacy. The probability distribution of the exponential mechanism is as follows:

[0036]

[0037] Where D1 and D2 are adjacent datasets, q(D1,o) is the number of times o appears in dataset D1, and Δ q It is global sensitivity, ε l It's a budget for privacy protection;

[0038] The global sensitivity mentioned above can be calculated using the following formula:

[0039]

[0040] Where D1 and D2 are adjacent datasets, q(D1,o) is the number of times o appears in dataset D1, and q(D2,o) is the number of times o appears in dataset D2;

[0041] The beneficial effects of this invention are:

[0042] The beneficial effects of this invention are as follows: While data publishing provides a wealth of user information for convenient data analysis and mining, it also introduces a series of security issues. Improper publishing can pose threats to user privacy, especially since unprocessed attribute applications can still lead to privacy leaks in other fields. This invention proposes an attribute-related differential privacy anonymization protection method.

[0043] When selecting privacy protection attributes, the impact of the dataset on privacy protection is taken into account. The minimum attribute group is dynamically divided based on the recording status of each attribute in the dataset, and the privacy protection attribute is selected through the set, which reduces the possibility of privacy leakage caused by the distribution of the dataset data.

[0044] When allocating the privacy budget, the privacy protection priority of the privacy protection attribute group is dynamically calculated by processing the smallest attribute group, and the privacy budget is allocated reasonably to reduce the possibility of data distortion caused by excessive noise and maximize privacy protection. Attached Figure Description

[0045] Figure 1 This is a structural diagram of the attribute-related differential privacy and anonymity protection strategy described in this invention.

[0046] Figure 2 This is a flowchart illustrating the process of obtaining a combination of privacy protection attributes as described in this invention.

[0047] Figure 3 The present invention describes a flowchart of differential privacy processing for data. Detailed Implementation

[0048] To make the objectives, technical solutions, and advantages of the present invention clearer and more understandable, the present invention will be further described in detail below through embodiments and accompanying drawings.

[0049] A differential privacy anonymization method based on attribute correlation is proposed. This method includes how to dynamically select privacy-preserving attributes based on the correlation between attributes, how to calculate the priority of privacy-preserving attributes, and how to reasonably allocate the privacy budget for differential privacy protection.

[0050] Reference Figure 2 The specific process of dynamically selecting privacy protection attributes based on the correlation between attributes is as follows:

[0051] Step 1. Scan the data table to obtain all attribute combinations.

[0052] Step 2. Calculate whether each attribute group is the smallest attribute group, using the following formula:

[0053] N(f i ≥30

[0054] N(f k )<30

[0055] Wherein, N(f) i ) is the i-th value combination of the smallest attribute group D. Similarly, the number of elements in the dataset is N(f) k ) is the k-th value combination of attribute group D'. The number of items in the dataset;

[0056] Step 3. Divide the attribute groups, retain the smallest attribute group, and do not process the non-smallest attribute groups.

[0057] Step 4. Perform set processing on the smallest attribute group to obtain all intersecting attributes. For attribute groups without intersection, randomly select one attribute as the attribute with high correlation to obtain the privacy-preserving attribute group.

[0058] Reference Figure 3 The specific process of calculating the priority of privacy protection attributes and reasonably allocating the privacy budget for differential privacy protection is as follows:

[0059] Step 5. Perform set processing on all minimum attribute groups, calculate the number of attribute intersections, and for attribute groups with no intersection, randomly select an attribute intersection count and assign it 1.

[0060] Step 6. Allocate the privacy budget for each privacy protection attribute. The formula is as follows:

[0061]

[0062] Among them, w(v) i ) is the privacy protection priority of the i-th privacy protection attribute, and ε is the total budget for privacy protection;

[0063] Step 7. Add noise to the numerical attributes that require privacy protection. Use the Laplace mechanism to add noise x so that it follows a Laplace distribution, as shown in the following formula:

[0064]

[0065] Where x is the added noise, ε l This is the privacy budget allocated to the l-th privacy-preserving attribute, Δ f It is global sensitivity;

[0066] The global sensitivity mentioned above can be calculated using the following formula:

[0067]

[0068] Where D1 and D2 are neighboring datasets, and ||f(D1)-f(D2)|| is the Manhattan distance between f(D1) and f(D2);

[0069] Step 8. Perform numerical mapping on non-numerical attributes requiring privacy protection to satisfy ε-differential privacy. The exponential mechanism satisfies the following probability distribution:

[0070]

[0071] Where D1 and D2 are adjacent datasets, q(D1,o) is the number of times o appears in dataset D1, and Δ q It is global sensitivity, ε l It's a budget for privacy protection;

[0072] The global sensitivity mentioned above can be calculated using the following formula:

[0073]

[0074] Where D1 and D2 are adjacent datasets, q(D1,o) is the number of times o appears in dataset D1, and q(D2,o) is the number of times o appears in dataset D2.

Claims

1. An attribute correlation-based differential privacy anonymization method, characterized in that, The steps are as follows: (1) Check the correlation between attributes in the data set, and select the attribute combination with high correlation; The correlation between attribute values is processed as follows: (1.1) First, randomly combine all attributes, retrieve each attribute combination, and determine all minimum attribute groups; Minimum attribute group: a kind of division of attributes with correlation, defined by the following formula: N(f i )≥30 N(f k ) < 30 where N(f i ) is the i-th value combination of the minimal attribute set D number of occurrences in the dataset, and N(f k ) is the k-th value combination of the attribute set D' number of occurrences in the dataset; D' is the attribute set D with an arbitrary attribute added; s(v j ) is the number of different values of the j-th attribute of the attribute set (1.2) After obtaining all the minimum attribute groups above, perform set processing on all the minimum attribute groups. The attributes with intersection are the attributes with high correlation. For attribute groups without intersection, randomly select an attribute as the attribute with high correlation; (2) Allocate privacy budget to different attribute correlations to allocate appropriate privacy budget to different attributes to maximize privacy protection of the data, and perform differential privacy protection on the privacy protected attributes; The privacy budget is allocated to different attribute correlations as follows: (2.1) First, perform set processing on all the minimum attribute groups obtained, calculate the intersection times of each attribute as the privacy protection priority, and the higher the intersection times, the higher the privacy protection priority. For attribute groups without intersection, select any one attribute as the privacy protection attribute, and its priority is the same as that of the attribute with intersection once. Then, allocate privacy budget to attributes with different privacy protection priorities; Privacy budget allocation: used for privacy budget allocation of privacy protected attributes, the privacy budget allocation formula of each privacy protected attribute is defined as follows: wherein w(v l ) is the privacy protection priority of the first privacy protection attribute, and ε is the total budget of privacy protection. (2.2) Perform differential privacy protection on the attributes after the privacy budget allocation in the previous step. For numerical attributes, use Laplace mechanism to add a Laplace noise to the statistical result. For non-numerical attributes, use exponential mechanism to make it satisfy the ε-differential privacy corresponding to the privacy budget, obtain a data set that satisfies the overall ε-differential privacy, and improve the protection of privacy attributes of the data set; ε-differential privacy: provides a kind of random noise when querying from a statistical database, maximizes the accuracy of data query, and maximizes the opportunity to identify its record while retaining statistical characteristics to protect user privacy; its formula is defined as follows: Pr(A(D1)∈S)≤e ε ×Pr(A(D2)∈S) Where D1 and D2 are adjacent data sets, A() is a randomization algorithm, and S is the data set after privacy protection; Laplace mechanism: add a Laplace noise to the statistical result, where the noise x satisfies the Laplace distribution, and the formula is as follows: where x is the added noise, Δ f is the global sensitivity; The global sensitivity is calculated by the following formula: Where D1 and D2 are adjacent data sets, and ||f(D1)-f(D2)|| is the Manhattan distance between f(D1) and f(D2); Exponential mechanism: used for numerical mapping of non-numerical data to make it satisfy ε-differential privacy. The exponential mechanism satisfies the probability distribution as follows: where D1 and D2 are adjacent data sets, q(D1, o) is the number of occurrences of o in data set D1, and q is the global sensitivity; The global sensitivity is calculated by the following formula: Where D1 and D2 are adjacent data sets, and q(D2, o) is the number of times o appears in data set D2.