[0019] The present invention will be described in detail below in conjunction with the examples.
[0020] The present invention provides a method based on differential privacy protection decision tree, which aims at the differential privacy protection of the classic greedy decision tree C4. A protection mechanism alters this answer in a way that preserves the privacy of everyone in the dataset. The present invention comprises the following steps:
[0021] 1) Use Bernoulli (Bernoulli) random sampling principle to sample the original data set with sampling probability p to obtain a data set sample, and the obtained data set satisfies ln(1+p(e ε -1))- Differential privacy:
[0022] Perform Bernoulli random sampling on the original data set with the assumed sampling probability p, put the selected samples into the spatial samples, otherwise discard them, and calculate the privacy budget ε required to build the entire decision tree under the sampling probability p p. Among them, the privacy budget ε p It is pre-specified by the data owner or data publisher according to the user's privacy requirements. The higher the privacy requirement, the privacy budget ε p The smaller the value, usually set to 0.01, 0.1 or 1, etc. ε p =ε 1 +ε 2 , ε 1 represents the first-stage privacy budget, ε 2 Denotes the second-stage privacy budget.
[0023] The privacy budget ε calculated to make the privacy-preserving decision tree algorithm satisfy ε-differential privacy p The value is guaranteed, and the Bernoulli sampling method needs to satisfy ln(1+p(e ε -1))- Differential privacy:
[0024] Given a data set D, and algorithm A satisfies ε-differential privacy on the data set D. If method A p The operation is as follows: draw a sample from the data set D with probability p to obtain the data set D p , and then algorithm A acts on data set D p , then the data set D p satisfy ln(1+p(e ε -1)) - differential privacy. where ε is the privacy budget.
[0025] ε-differential privacy: For any pair of adjacent datasets D and D', a random function B satisfies ε-differential privacy, for any S ∈ Rang(B), we have:
[0026] Pr[A(D)=S]≤e ε Pr[A(D')=S];
[0027] In the formula, Pr represents the probability, and S represents the subdivision plan set.
[0028] By ln(1+p(e ε -1))-Differential privacy enables the construction of a corresponding decision tree on a new data set randomly sampled by Bernoulli, and can ensure that the sampled data set also meets a specific privacy cost. Then the subsequent privacy protection decision tree construction can be carried out on the selected records that can represent the characteristics of the overall data to a certain extent.
[0029] 2) Perform preliminary processing on the sampled data set, and use continuous attributes and discrete attributes to participate in decision-making under privacy protection to reduce the number of calls to the index mechanism;
[0030] 2.1) Let s represent a scheme in any continuous attribute value segmentation scheme set S, and u(D,s) represent the availability of the current scheme s. In order to make the continuous attribute and the discrete attribute participate in the selection together, the weight of the scheme s in the continuous attribute value subdivision scheme set S is selected with the following probability by the exponential mechanism:
[0031]
[0032] In the formula, Δu represents the sensitivity.
[0033] 2.2) After the weight is determined, the subdivision scheme s of the continuous attribute is given by Probability directly participates in attribute availability selection together with discrete attributes; Probability is involved in attribute selection.
[0034] In the above steps, determine the usability function selected when measuring the usability of the attribute segmentation scheme: information gain and maximum class frequency sum.
[0035] Let x represent an attribute in the record, and the subdivision scheme for x can be expressed as s:x→{x 1 ,x 2 ,...,x q},x 1 ,x 2 ,...,x q Indicates the subdivision value of x. D. x Represents a data set with an attribute value of x, |D x | means D x the number of records. D. xj Indicates that the attribute value is x j (j=1,2,...,q) constitute a data set. Subdivision scheme s:x→{x 1 ,x 2 ,...,x q} is the data set D x Divide into several sub-datasets D x1 ,D x2 ,...,D xq. Let data set D x The classification attribute of has m different values, that is, defines m different classes C i (i=1,2,...,m), each class C i The number of records in is c i.
[0036] The availability function of information gain, that is, u(D,s)=InfoGain(D,s); first calculate the data set D x entropy:
[0037]
[0038] where: p i = c i /|D x |. Scheme s:x→{x 1 ,x 2 ,...,x q} The resulting information gain is InfoGain(D, s)=I(D x )-H(D x ),in, is the weighted sum of the entropies of all subsets, I(D xj ) is the data set D xj entropy. Since I(D r ) is the maximum value of log 2 m, E(D x ) has a minimum value of 0. So the sensitivity of the information gain function is Δu=log 2 m.
[0039] The availability function of the maximum class frequency sum, namely u(D,s)=max(D,s); where,
[0040]
[0041] for D x Any subset D of xj , Refers to the number of records in the node with the most tuples. It can be seen from the above formula that the sensitivity of max(D,s) is 1. Therefore, the present invention uses the availability function of the maximum sum of class frequencies.
[0042] 3) Initialize the C4.5 decision tree according to the extracted data set samples, and use the SVT (sparse vector) method to judge whether the nodes in the decision tree continue to split, so that the allocation of the privacy budget no longer depends on the height of the tree, and solve the recursive construction The problem of rapidly exhausting privacy budgets in decision trees.
[0043] Since the privacy budget allocation is closely related to the height of the decision tree, if the height of the tree is too high, the privacy budget will be exhausted quickly, and the privacy budget ε for each query and selection of split attributes is very small, so the noise increases and the decision-making accuracy decreases rapidly; If the height is too low, it will directly affect the usability and accuracy of the decision tree. In the previous experiments of privacy protection methods, the decision tree was set to a fixed height according to the needs of users.
[0044] The SVT method is used to find query counts greater than a certain threshold. Using the SVT (sparse vector) method to judge whether the nodes in the decision tree continue to split is as follows:
[0045] 3.1) Determine the threshold θ, compare the count query result count() with the threshold θ, if count() > θ, the query result is found, otherwise continue.
[0046] The method of determining the threshold θ is: count the leaf nodes of the decision tree constructed without adding noise, and obtain the count query {count(v 1 ), count(v 2 ),...,count(v n )}, and then calculate the average value of these value sets as the final threshold θ to be determined. Among them, v i Indicates a leaf node, i=1, 2, ..., n.
[0047] 3.2) Add Laplace noise to the threshold θ to obtain the threshold noi(θ) after adding Laplace noise;
[0048] 3.3) Add Laplacian noise to the query result count(v) of each node, the obtained noicount(v), and add the query result noicount(v) with Laplacian noise to the added Laplace noise Compared with the threshold noi(θ) after the Sri Lankan noise, if noicount(v)≥noi(θ), it means that this node does not meet the privacy requirements, and this node needs to be split; if noicount(v)
[0049] In step 3.3), add Laplacian noise for privacy protection of the response count query:
[0050]
[0051] In the formula, Lap(2/ε 1 ) is Laplace noise.
[0052] In the process of using the SVT method to judge whether a node is split, it does not protect privacy by iteratively splitting the privacy budget as in the past, and the privacy budget value required for each judgment is ε 1 , so that the privacy budget will not be consumed quickly due to multiple iterations, resulting in a large amount of noise.
[0053] 4) Build a decision tree recursively:
[0054] 4.1) Record the root node at l 1 Floor;
[0055] 4.2) when l i When i+1 All nodes v in j; v j ∈ l i+1 , l i is the current layer, h is the tree height;
[0056] 4.3) If v j is a leaf node, then noicount(p(v j ))=noicount(p(v j ))+noicount(v j ), p(v j ) means v j the parent node of ; otherwise, S=S∪v j;
[0057] 4.4) Add 1 to the variable i, and record the h-1 layer as the current layer;
[0058] 4.5) when l i When > 1, traverse l i middle node v j , and v j ∈S, and satisfy:
[0059] noicount(p(v j ))=noicount(p(v j ))+noicount(v j );
[0060] 4.6) Update v j The parent node to complete the construction of the decision tree.
[0061] The above-mentioned embodiments are only used to illustrate the present invention, and the structure, size, location and shape of each component can be changed. On the basis of the technical solution of the present invention, all improvements to individual components according to the principles of the present invention and equivalent transformations shall not be excluded from the protection scope of the present invention.