VIEWS: 51 PAGES: 40 POSTED ON: 11/26/2009
Privacy-Preserving Data Mining Representor : Li Cao October 15, 2003 1 Presentation organization Associate Data Mining with Privacy Privacy Preserving Data Mining scheme using random perturbation Privacy Preserving Data Mining using Randomized Response Techniques Comparing these two cases 2 Privacy protection history Privacy concerns nowadays Citizens’ attitude Scholars’ attitude 3 Internet users’ attitudes 17% 27% Privacy fundamentalists Pragmatic majority Marginal concern 56% 4 Privacy value Filtering to weed out unwanted information Better search results with less effort Useful recommendations Market trend ………… Example: From the analysis of a large number of purchase transaction records with the costumers’ age and income, we know what kind of costumers like some style or brand. 5 Motivation (Introducing Data Mining ) Data Mining’s goal: discover knowledge, trends, patterns from large amounts of data. Data Mining’s primary task: developing accurate models about aggregated data without access to precise information in individual data records. (Not only discovering knowledge but also preserving privacy) 6 Presentation organization Associate Data Mining with Privacy Privacy Preserving Data Mining scheme using random perturbation Privacy Preserving Data Mining using Randomized Response Techniques Comparing these two cases 7 Privacy Preserving Data Mining scheme using random perturbation Basic idea Reconstruction procedure Decision-Tree classification Three different algorithms 8 Attribute list of a example 9 Records of a example 10 Privacy preserving methods Value-class Membership: the values for an attribute are partitioned into a set of disjoint, mutually-exclusive classes. Value Distortion: xi+r instead of xi where r is a random value. a) Uniform b) Gaussian 11 Basic idea Original data Perturbed data (Let users provide a modified value for sensitive attributes) Estimate the distribution of original data from perturbed data Build classifiers by using these reconstructed distributions (Decision-tree) 12 Basic Steps 13 Reconstruction problem 1) View the n original data value x1,x2,…xn of a one-dimensional distribution as realizations of n independent identically distributed (iid) random variables X1,X2,…Xn, each with the same distribution as the random variable X. To hide these data values, n independent random variables Y1,Y2…Yn have been used, each with the same distribution as a different random variable Y. 14 2) 3) Given x1+y1, x2+y2…xn+yn (where yi is the realization of Yi) and the cumulative distribution function Fy for Y, we would like to estimate the cumulative distribution function Fx for X. In short, given a cumulative distribution Fy and the realizations of n iid random samples X1+Y1, X2+Y2,…Xn+Yn, estimate Fx. 15 Reconstruction process Let the value of Xi+Yi be wi (xi+yi). Use Bayes’ rule to estimate the posterior distribution function Fx1’ (given that X1+Y1=w1) for X1, assuming we know the density function fx and fy for X and Y respectively. 16 To estimate the posterior distribution function Fx’ given x1+y1, x2+y2…xn+yn, we average the distribution function for each of the Xi. 17 The corresponding posterior density function, fx’ is obtained by differentiating Fx’: Given a sufficiently large number of samples, fx’ will be very close to the real density function fx. 18 Reconstruction algorithm 1. 2. 3. fx0 := Uniform distribution j :=0 //Iteration number Repeat 4. j := j+1 until (stopping criterion met) 19 Stopping Criterion Observed randomized distribution ?= The result of randomizing the current estimate of the original distribution The difference between successive estimates of the original distribution is very small. 20 Reconstruction effect 21 22 Decision-Tree Classification Tow stages: (1) Growth Example: (2) Prune 23 Tree-growth phase algorithm Partition (Data S) begin if (most points in S are of the same class) then return; for each attribute A do evaluate splits on attribute A; Use best split to partition S into S1 and S2; Partition (S1) Partition (S2) end 24 Choose the best split Information gain (categorical attributes) Gini index (continuous attributes) 25 Gini index calculation (pj is the relative frequency of class j in S) If a split divides S into two subsets S1 and S2 Note: Calculating this index requires only the distribution of the class values. 26 27 When & How original distribution are reconstructed Global: Reconstruct the distribution for each attribute once. Decision-tree classification. ByClass: For each attribute, first split the training data by class, then reconstruct the distributions separately. Decision-tree classification Local: The same as in ByClass, however, instead of doing reconstruction only once, reconstruction is done at each node. 28 Example (ByClass and Local) 29 Comparing the three algorithms Execution Time Global ByClass Local Cheapest Middle Most expensive Accuracy Worst Middle Best 30 Presentation Organization Associate Data Mining with Privacy Privacy Preserving Data Mining scheme using random perturbation Privacy Preserving Data Mining using Randomized Response Techniques Comparing these two cases Are there any other classification methods available? 31 Privacy Preserving Data Mining using Randomized Response Techniques Randomized Response Building Decision-Tree Key: Information Gain Calculation Experimental results 32 Randomized Response A survey contains a sensitive attribute A. Instead of asking whether the respondent has the attribute A, ask two related questions, the answer to which are opposite to each other (have A no A). Respondent use a randomizing device to decide which question to answer. The device is designed in such way that the probability of choosing the first question is θ . 33 To estimate the percentage of people who has the attribute A, we can use: P’(A=yes) = P(A=yes)* θ +P(A=no)*(1- θ) P’(A=no) = P(A=no)* θ +P(A=yes)*(1- θ) P’(k=yes): The proportion of the “yes” responses obtained from the survey data. P(k=yes): The estimated proportion of the “yes” responses Our Goal: P(A=yes) and P(A=no) 34 Example: Sensitive attribute: Married? Two questions: 1) A? Yes / No 2) B? No / Yes 35 Decision-Tree (Key: Info Gain) m: m classes assumed Qj: The relative frequency of class j in S v: any possible value of attribute A Sv: The subset of S for which attribute A has value v. | Sv |: The number of elements in Sv |S|: The number of elements in S 36 P(E): The proportion of the records in the undisguised data set that satisfy E=true P*(E): The proportion of the records in the disguised data set that satisfy E=true Assume the class label is binary. So the Entropy(S) can be calculated. Similarly, calculate |Sv|. At last, we get Gain(S,A). 37 Experimental results 38 Comparing these two cases Perturbation Attribute Continuous Privacy Preserving Value method distortion Choose attribute to Gini index split Inverse procedure Reconstruct distribution Response Categorical Randomized response Information Gain Estimate P(E) from P’(E) 39 Future work Solve categorical problems by the first scheme Solve continuous problems by the second scheme Combine these two scheme to solve some problems Other classification suitable 40