Privacy-Preserving Data Mining

Document Sample
Privacy-Preserving Data Mining Powered By Docstoc
					Privacy-Preserving Data Mining

Representor : Li Cao October 15, 2003

1

Presentation organization






Associate Data Mining with Privacy Privacy Preserving Data Mining scheme using random perturbation Privacy Preserving Data Mining using Randomized Response Techniques Comparing these two cases
2

Privacy protection history Privacy concerns nowadays
 

Citizens’ attitude Scholars’ attitude
3

Internet users’ attitudes
17% 27% Privacy fundamentalists Pragmatic majority Marginal concern

56%

4

Privacy value
Filtering to weed out unwanted information  Better search results with less effort  Useful recommendations  Market trend ………… Example: From the analysis of a large number of purchase transaction records with the costumers’ age and income, we know what kind of costumers like some style or brand.


5

Motivation (Introducing Data Mining )




Data Mining’s goal: discover knowledge, trends, patterns from large amounts of data. Data Mining’s primary task: developing accurate models about aggregated data without access to precise information in individual data records. (Not only discovering knowledge but also preserving privacy)
6

Presentation organization






Associate Data Mining with Privacy Privacy Preserving Data Mining scheme using random perturbation Privacy Preserving Data Mining using Randomized Response Techniques Comparing these two cases
7

Privacy Preserving Data Mining scheme using random perturbation



 

Basic idea Reconstruction procedure Decision-Tree classification Three different algorithms
8

Attribute list of a example

9

Records of a example

10

Privacy preserving methods




Value-class Membership: the values for an attribute are partitioned into a set of disjoint, mutually-exclusive classes. Value Distortion: xi+r instead of xi where r is a random value. a) Uniform b) Gaussian
11

Basic idea


 

Original data  Perturbed data (Let users provide a modified value for sensitive attributes) Estimate the distribution of original data from perturbed data Build classifiers by using these reconstructed distributions (Decision-tree)
12

Basic Steps

13

Reconstruction problem
1)

View the n original data value x1,x2,…xn of a one-dimensional distribution as realizations of n independent identically distributed (iid) random variables X1,X2,…Xn, each with the same distribution as the random variable X.
To hide these data values, n independent random variables Y1,Y2…Yn have been used, each with the same distribution as a different random variable Y.
14

2)

3)

Given x1+y1, x2+y2…xn+yn (where yi is the realization of Yi) and the cumulative distribution function Fy for Y, we would like to estimate the cumulative distribution function Fx for X.



In short, given a cumulative distribution Fy and the realizations of n iid random samples X1+Y1, X2+Y2,…Xn+Yn, estimate Fx.

15

Reconstruction process


Let the value of Xi+Yi be wi (xi+yi). Use Bayes’ rule to estimate the posterior distribution function Fx1’ (given that X1+Y1=w1) for X1, assuming we know the density function fx and fy for X and Y respectively.

16



To estimate the posterior distribution function Fx’ given x1+y1, x2+y2…xn+yn, we average the distribution function for each of the Xi.

17



The corresponding posterior density function, fx’ is obtained by differentiating Fx’:



Given a sufficiently large number of samples, fx’ will be very close to the real density function fx.
18

Reconstruction algorithm
1. 2. 3.

fx0 := Uniform distribution j :=0 //Iteration number Repeat

4.

j := j+1 until (stopping criterion met)
19

Stopping Criterion


Observed randomized distribution ?= The result of randomizing the current estimate of the original distribution
The difference between successive estimates of the original distribution is very small.
20



Reconstruction effect

21

22

Decision-Tree Classification




Tow stages: (1) Growth Example:

(2) Prune

23

Tree-growth phase algorithm
Partition (Data S) begin if (most points in S are of the same class) then return; for each attribute A do evaluate splits on attribute A; Use best split to partition S into S1 and S2; Partition (S1) Partition (S2) end
24

Choose the best split
 

Information gain (categorical attributes) Gini index (continuous attributes)

25

Gini index calculation
 

(pj is the relative frequency of class j in S) If a split divides S into two subsets S1 and S2

Note: Calculating this index requires only the distribution of the class values.
26

27

When & How original distribution are reconstructed
 



Global: Reconstruct the distribution for each attribute once. Decision-tree classification. ByClass: For each attribute, first split the training data by class, then reconstruct the distributions separately. Decision-tree classification Local: The same as in ByClass, however, instead of doing reconstruction only once, reconstruction is done at each node.
28

Example (ByClass and Local)

29

Comparing the three algorithms
Execution Time
Global ByClass Local Cheapest Middle Most expensive

Accuracy
Worst Middle Best

30

Presentation Organization
   



Associate Data Mining with Privacy Privacy Preserving Data Mining scheme using random perturbation Privacy Preserving Data Mining using Randomized Response Techniques Comparing these two cases Are there any other classification methods available?
31

Privacy Preserving Data Mining using Randomized Response Techniques
 



Randomized Response Building Decision-Tree Key: Information Gain Calculation Experimental results

32

Randomized Response





A survey contains a sensitive attribute A. Instead of asking whether the respondent has the attribute A, ask two related questions, the answer to which are opposite to each other (have A  no A). Respondent use a randomizing device to decide which question to answer. The device is designed in such way that the probability of choosing the first question is θ .
33



To estimate the percentage of people who has the attribute A, we can use: P’(A=yes) = P(A=yes)* θ +P(A=no)*(1- θ) P’(A=no) = P(A=no)* θ +P(A=yes)*(1- θ) P’(k=yes): The proportion of the “yes” responses obtained from the survey data. P(k=yes): The estimated proportion of the “yes” responses Our Goal: P(A=yes) and P(A=no)
34



Example:
Sensitive attribute: Married? Two questions: 1) A? Yes / No
2)

B?

No / Yes
35

Decision-Tree (Key: Info Gain)

m: m classes assumed Qj: The relative frequency of class j in S v: any possible value of attribute A Sv: The subset of S for which attribute A has value v. | Sv |: The number of elements in Sv |S|: The number of elements in S
36

P(E): The proportion of the records in the undisguised data set that satisfy E=true P*(E): The proportion of the records in the disguised data set that satisfy E=true

Assume the class label is binary. So the Entropy(S) can be calculated. Similarly, calculate |Sv|. At last, we get Gain(S,A).
37

Experimental results

38

Comparing these two cases
Perturbation Attribute Continuous Privacy Preserving Value method distortion Choose attribute to Gini index split Inverse procedure Reconstruct distribution Response Categorical Randomized response Information Gain Estimate P(E) from P’(E)
39

Future work

  

Solve categorical problems by the first scheme Solve continuous problems by the second scheme Combine these two scheme to solve some problems Other classification suitable
40


				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:51
posted:11/26/2009
language:English
pages:40
Lingjuan Ma Lingjuan Ma MS
About work for China Compulsory Certification. Some of the documents come from Internet, if you hold the copyright please contact me by huangcaijin@sohu.com