Method for constructing decision tree based on differential privacy protection

A differential privacy and decision tree technology, applied in digital data protection, instrumentation, electrical digital data processing, etc., can solve the problems of inefficient selection methods and depletion of privacy budget, so as to protect privacy, improve accuracy, and reduce rapid consumption Effect

Inactive Publication Date: 2017-12-29
RENMIN UNIVERSITY OF CHINA
0 Cites 16 Cited by

AI-Extracted Technical Summary

Problems solved by technology

However, they mainly have two shortcomings: 1) decision classification is only performed on small spatial data. When the data points reach millions of levels, a large number of classification trees will be generated, res...
View more

Method used

2) preliminary processing is carried out to the dataset obtained by sampling, and continuous attribute and discrete attribute are participated in the decision-making selection under privacy protection together, to reduce calling index mechanism number of times;
3) according to the data set sample initialization C4...
View more

Abstract

The invention relates to a method for constructing a decision tree based on differential privacy protection. The method comprises the steps that sampling is performed on an original dataset according to a sampling probability p to obtain dataset samples, and the obtained dataset meets the requirement for ln(1+p(e<epsilon>-1)-differential privacy, wherein primary processing is performed on the dataset obtained through sampling, and continuous properties and discrete properties are made to jointly participate in decision selection under privacy protection; a C4.5 decision tree is initialized according to extracted dataset samples, and a sparse vector method is utilized to judge whether nodes in the decision tree continue to split; and the decision tree is constructed recursively. Through the method, classification accuracy is high, and the decision tree can be constructed efficiently and accurately while privacy is protected.

Application Domain

Character and pattern recognitionDigital data protection

Technology Topic

Privacy protectionData set +5

Image

  • Method for constructing decision tree based on differential privacy protection
  • Method for constructing decision tree based on differential privacy protection
  • Method for constructing decision tree based on differential privacy protection

Examples

  • Experimental program(1)

Example Embodiment

[0019] The present invention will be described in detail below in conjunction with the examples.
[0020] The present invention provides a method based on differential privacy protection decision tree, which aims at the differential privacy protection of the classic greedy decision tree C4. A protection mechanism alters this answer in a way that preserves the privacy of everyone in the dataset. The present invention comprises the following steps:
[0021] 1) Use Bernoulli (Bernoulli) random sampling principle to sample the original data set with sampling probability p to obtain a data set sample, and the obtained data set satisfies ln(1+p(e ε -1))- Differential privacy:
[0022] Perform Bernoulli random sampling on the original data set with the assumed sampling probability p, put the selected samples into the spatial samples, otherwise discard them, and calculate the privacy budget ε required to build the entire decision tree under the sampling probability p p. Among them, the privacy budget ε p It is pre-specified by the data owner or data publisher according to the user's privacy requirements. The higher the privacy requirement, the privacy budget ε p The smaller the value, usually set to 0.01, 0.1 or 1, etc. ε p =ε 1 +ε 2 , ε 1 represents the first-stage privacy budget, ε 2 Denotes the second-stage privacy budget.
[0023] The privacy budget ε calculated to make the privacy-preserving decision tree algorithm satisfy ε-differential privacy p The value is guaranteed, and the Bernoulli sampling method needs to satisfy ln(1+p(e ε -1))- Differential privacy:
[0024] Given a data set D, and algorithm A satisfies ε-differential privacy on the data set D. If method A p The operation is as follows: draw a sample from the data set D with probability p to obtain the data set D p , and then algorithm A acts on data set D p , then the data set D p satisfy ln(1+p(e ε -1)) - differential privacy. where ε is the privacy budget.
[0025] ε-differential privacy: For any pair of adjacent datasets D and D', a random function B satisfies ε-differential privacy, for any S ∈ Rang(B), we have:
[0026] Pr[A(D)=S]≤e ε Pr[A(D')=S];
[0027] In the formula, Pr represents the probability, and S represents the subdivision plan set.
[0028] By ln(1+p(e ε -1))-Differential privacy enables the construction of a corresponding decision tree on a new data set randomly sampled by Bernoulli, and can ensure that the sampled data set also meets a specific privacy cost. Then the subsequent privacy protection decision tree construction can be carried out on the selected records that can represent the characteristics of the overall data to a certain extent.
[0029] 2) Perform preliminary processing on the sampled data set, and use continuous attributes and discrete attributes to participate in decision-making under privacy protection to reduce the number of calls to the index mechanism;
[0030] 2.1) Let s represent a scheme in any continuous attribute value segmentation scheme set S, and u(D,s) represent the availability of the current scheme s. In order to make the continuous attribute and the discrete attribute participate in the selection together, the weight of the scheme s in the continuous attribute value subdivision scheme set S is selected with the following probability by the exponential mechanism:
[0031]
[0032] In the formula, Δu represents the sensitivity.
[0033] 2.2) After the weight is determined, the subdivision scheme s of the continuous attribute is given by Probability directly participates in attribute availability selection together with discrete attributes; Probability is involved in attribute selection.
[0034] In the above steps, determine the usability function selected when measuring the usability of the attribute segmentation scheme: information gain and maximum class frequency sum.
[0035] Let x represent an attribute in the record, and the subdivision scheme for x can be expressed as s:x→{x 1 ,x 2 ,...,x q},x 1 ,x 2 ,...,x q Indicates the subdivision value of x. D. x Represents a data set with an attribute value of x, |D x | means D x the number of records. D. xj Indicates that the attribute value is x j (j=1,2,...,q) constitute a data set. Subdivision scheme s:x→{x 1 ,x 2 ,...,x q} is the data set D x Divide into several sub-datasets D x1 ,D x2 ,...,D xq. Let data set D x The classification attribute of has m different values, that is, defines m different classes C i (i=1,2,...,m), each class C i The number of records in is c i.
[0036] The availability function of information gain, that is, u(D,s)=InfoGain(D,s); first calculate the data set D x entropy:
[0037]
[0038] where: p i = c i /|D x |. Scheme s:x→{x 1 ,x 2 ,...,x q} The resulting information gain is InfoGain(D, s)=I(D x )-H(D x ),in, is the weighted sum of the entropies of all subsets, I(D xj ) is the data set D xj entropy. Since I(D r ) is the maximum value of log 2 m, E(D x ) has a minimum value of 0. So the sensitivity of the information gain function is Δu=log 2 m.
[0039] The availability function of the maximum class frequency sum, namely u(D,s)=max(D,s); where,
[0040]
[0041] for D x Any subset D of xj , Refers to the number of records in the node with the most tuples. It can be seen from the above formula that the sensitivity of max(D,s) is 1. Therefore, the present invention uses the availability function of the maximum sum of class frequencies.
[0042] 3) Initialize the C4.5 decision tree according to the extracted data set samples, and use the SVT (sparse vector) method to judge whether the nodes in the decision tree continue to split, so that the allocation of the privacy budget no longer depends on the height of the tree, and solve the recursive construction The problem of rapidly exhausting privacy budgets in decision trees.
[0043] Since the privacy budget allocation is closely related to the height of the decision tree, if the height of the tree is too high, the privacy budget will be exhausted quickly, and the privacy budget ε for each query and selection of split attributes is very small, so the noise increases and the decision-making accuracy decreases rapidly; If the height is too low, it will directly affect the usability and accuracy of the decision tree. In the previous experiments of privacy protection methods, the decision tree was set to a fixed height according to the needs of users.
[0044] The SVT method is used to find query counts greater than a certain threshold. Using the SVT (sparse vector) method to judge whether the nodes in the decision tree continue to split is as follows:
[0045] 3.1) Determine the threshold θ, compare the count query result count() with the threshold θ, if count() > θ, the query result is found, otherwise continue.
[0046] The method of determining the threshold θ is: count the leaf nodes of the decision tree constructed without adding noise, and obtain the count query {count(v 1 ), count(v 2 ),...,count(v n )}, and then calculate the average value of these value sets as the final threshold θ to be determined. Among them, v i Indicates a leaf node, i=1, 2, ..., n.
[0047] 3.2) Add Laplace noise to the threshold θ to obtain the threshold noi(θ) after adding Laplace noise;
[0048] 3.3) Add Laplacian noise to the query result count(v) of each node, the obtained noicount(v), and add the query result noicount(v) with Laplacian noise to the added Laplace noise Compared with the threshold noi(θ) after the Sri Lankan noise, if noicount(v)≥noi(θ), it means that this node does not meet the privacy requirements, and this node needs to be split; if noicount(v)
[0049] In step 3.3), add Laplacian noise for privacy protection of the response count query:
[0050]
[0051] In the formula, Lap(2/ε 1 ) is Laplace noise.
[0052] In the process of using the SVT method to judge whether a node is split, it does not protect privacy by iteratively splitting the privacy budget as in the past, and the privacy budget value required for each judgment is ε 1 , so that the privacy budget will not be consumed quickly due to multiple iterations, resulting in a large amount of noise.
[0053] 4) Build a decision tree recursively:
[0054] 4.1) Record the root node at l 1 Floor;
[0055] 4.2) when l i When i+1 All nodes v in j; v j ∈ l i+1 , l i is the current layer, h is the tree height;
[0056] 4.3) If v j is a leaf node, then noicount(p(v j ))=noicount(p(v j ))+noicount(v j ), p(v j ) means v j the parent node of ; otherwise, S=S∪v j;
[0057] 4.4) Add 1 to the variable i, and record the h-1 layer as the current layer;
[0058] 4.5) when l i When > 1, traverse l i middle node v j , and v j ∈S, and satisfy:
[0059] noicount(p(v j ))=noicount(p(v j ))+noicount(v j );
[0060] 4.6) Update v j The parent node to complete the construction of the decision tree.
[0061] The above-mentioned embodiments are only used to illustrate the present invention, and the structure, size, location and shape of each component can be changed. On the basis of the technical solution of the present invention, all improvements to individual components according to the principles of the present invention and equivalent transformations shall not be excluded from the protection scope of the present invention.

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.

Similar technology patents

Color interpolation method

InactiveUS20050117040A1improve accuracy
Owner:MEGACHIPS

Emotion classifying method fusing intrinsic feature and shallow feature

ActiveCN105824922AImprove classification performanceimprove accuracy
Owner:CHONGQING UNIV OF POSTS & TELECOMM

Video monitoring method and system

Owner:深圳辉锐天眼科技有限公司

Scene semantic segmentation method based on full convolution and long and short term memory units

InactiveCN107480726Aimprove accuracylow resolution accuracy
Owner:UNIV OF ELECTRONIC SCI & TECH OF CHINA

Classification and recommendation of technical efficacy words

  • improve accuracy
  • privacy protection

Golf club head with adjustable vibration-absorbing capacity

InactiveUS20050277485A1improve grip comfortimprove accuracy
Owner:FUSHENG IND CO LTD

Stent delivery system with securement and deployment accuracy

ActiveUS7473271B2improve accuracyreduces occurrence and/or severity
Owner:BOSTON SCI SCIMED INC

Method for improving an HS-DSCH transport format allocation

InactiveUS20060089104A1improve accuracyincrease benefit
Owner:NOKIA SOLUTIONS & NETWORKS OY

Catheter systems

ActiveUS20120059255A1increase selectivityimprove accuracy
Owner:ST JUDE MEDICAL ATRIAL FIBRILLATION DIV

Gaming Machine And Gaming System Using Chips

ActiveUS20090075725A1improve accuracy
Owner:UNIVERSAL ENTERTAINMENT CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products