Docstoc

Privacy Preserving Data Mining

Document Sample
Privacy Preserving Data Mining Powered By Docstoc
					Secure Multiparty Computation and Privacy

Yehuda Lindell Bar-Ilan University

Motivation
 Huge databases exist in society today  Medical data  Consumer purchase data  Census data  Communication and media-related data  Data gathered by government agencies  Can this data be utilized?  For medical research  For improving customer service  For homeland security

Motivation
 Data sharing is necessary for full utilization:  Pooling medical data can improve the quality of medical research  Pooling of information from different government agencies can provide a wider picture




What is the health status of citizens that are supported by social welfare? Are there citizens that receive simultaneous support from different agencies?

 Data gathered by the government (e.g.,

census data) should be publicly available

The Problem
 The huge amount of data available means

that it is possible to learn a lot of information about individuals from public data
   

Purchasing patterns Family history Medical data And much much more

Privacy Breaches
 Latanya Sweeney – testimony before the Privacy and Integrity

Advisory Committee of the Department of Homeland Security:


One problem is that people don’t understand what makes data unique or identifiable. For example, in 1997 I was able to show how medical information that had all explicit identifiers, such as name, address and Social Security number removed could be re-identified using publicly available population registers (e.g., a voter list). In this particular example, I was able to show how the medical record of William Weld, the governor of Massachusetts of the time could be re-identified using only his date of birth, gender and ZIP. In fact, 87% of the population of the United States is uniquely identified by date of birth (e.g., month, day and year), gender, and their 5-digit ZIP codes. The point is that data that may look anonymous is not necessarily anonymous.

Motivation – Homeland Security
 Many different security agencies coexist

 These agencies are hesitant to share

information




Sometimes this is legitimate (if all agencies share all information, a single mole can compromise all agencies) Typically, they only share conclusions

 The problem: more patterns could be found if

if data and not just conclusions are shared

Motivation – Concrete Example
 Investigation at Stillwater State Correctional

Facility, Minnesota
  



Data mining software was applied to phone records from the prison A pattern linking calls between prisoners and a recent parolee was discovered The calling data was then mined again together with records of prisoners’ financial accounts The result: a large drug smuggling ring was discovered

Data Utilization or Privacy
 In many cases, due to public outcry, important

data mining projects have been halted
 

The Canadian big-brother database The total awareness project

 It may be good that these projects were

scrapped in their current form, but it would be better if we could have our cake and eat it too…

Privacy
 The human definition:  Privacy and autonomy: information about us that we feel is personal, confidential or private should not be unnecessarily distributed or publicly known  Privacy and control: Our personal or private information should not be misused (whatever that means)

 How can we mathematically formulate this?  The same information is classified differently by different people  Legitimate use is interpreted differently by different people

Turing’s Concern
 No attempt has yet been made to show that the “computable” numbers

include all numbers which would naturally be regarded as computable. All arguments which can be given are bound to be, fundamentally, appeals to intuition, and for this reason rather unsatisfactory mathematically. The real question at issue is “What are the possible processes which can be carried out in computing a number?”  The arguments which I shall use are of three kinds.  A direct appeal to intuition.  A proof of the equivalence of two definitions (in case the new definition has a greater intuitive appeal).  Giving examples of large classes of numbers which are computable.  Once it is granted that computable numbers are all “computable” several other propositions of the same character follow. In particular, it follows that, if there is a general process for determining whether a formula of the Hilbert function calculus is provable, then the determination can be carried out by a machine.

Secure Computation and Privacy
 Secure computation:  Assume that there is a function that all parties wish to compute  Secure computation shows how to compute that function in the safest way possible  In particular, it guarantees minimal information leakage (the output only)  Privacy:  Does the function output itself reveal “sensitive information”, or  Should the parties agree to compute this function?

This Talk
 We focus on the problem of secure multiparty

computation
  

Definitional paradigms (the focus) Feasibility results Efficiency concerns

 We will also discuss the limitations of the

approach for solving problems of privacy

Secure Multiparty Computation
 A set of parties with private inputs

 Parties wish to jointly compute a function of

their inputs so that certain security properties (like privacy and correctness) are preserved


E.g., secure elections, auctions…

 Properties must be ensured even if some of

the parties maliciously attack the protocol

Secure Computation Tasks
 Examples:
     

Authentication protocols Online payments Auctions Elections Privacy preserving data mining Essentially any task…

Security Requirements
 Consider a secure auction (with secret bids):
 



An adversary may wish to learn the bids of all parties – to prevent this, require privacy An adversary may wish to win with a lower bid than the highest – to prevent this, require correctness But, the adversary may also wish to ensure that it always gives the highest bid – to prevent this, require independence of inputs

Defining Security
 Option 1: analyze security concerns for each specific

problem
 

Auctions: as in previous slide Elections: privacy and correctness only  But then, maybe can condition vote for a party based on whether it will get the minimum number of votes

 Problems:  How do we know that all concerns are covered?  Definitions are application dependent and need to be redefined from scratch for each task

Defining Security – Option 2
 The real/ideal model paradigm for defining security [GMW,GL,Be,MR,Ca]:
 

Ideal model: parties send inputs to a trusted party, who computes the function for them Real model: parties run a real protocol with no trusted help

 A protocol is secure if any attack on a real

protocol can be carried out in the ideal model  Since no attacks can be carried out in the ideal model, security is implied

The Real Model
x y

Protocol output

Protocol output

The Ideal Model
x
y
x y

f1(x,y)

f2(x,y)

f1(x,y)

f2(x,y)

The Security Definition:

For every real adversary A



there exists an adversary S

Protocol interaction Trusted party

REAL

IDEAL

Properties of the Definition
 Privacy:  The ideal-model adversary cannot learn more about the honest party’s input than what is revealed by the function output  Thus, the same is true of the real-model adversary  Correctness:  In the ideal model, the function is always computed correctly  Thus, the same is true in the real-model  Others:  For example, independence of inputs

Why This Approach?
 General – it captures all applications

 The specifics of an application are defined by

its functionality, security is defined as above  The security guarantees achieved are easily understood (because the ideal model is easily understood)  We can be confident that we did not “miss” any security requirements

More Details on the Definition
 The Adversary  Computational power:  Probabilistic polynomial-time versus all-powerful  Adversarial deviation  Semi-honest: follows protocol instructions  Malicious: arbitrary actions  Corruption behaviour  Static: set of corrupted parties fixed at onset  Adaptive: can choose to corrupt parties at any time during computation  Number of corruptions  Honest majority versus unlimited corruptions

More Details on the Definition
 The Network  Scheduling  Synchronous: messages sent in rounds  Asynchronous: messages can be delayed arbitrarily (adversary controls message delivery)  Semi-synchronous: adversary controls scheduling, but only to a limit (can be achieved using local clocks)


Communication channels  Authenticated: adversary can see, but cannot modify messages sent between the parties

The Ideal Model – More Details
 The Trusted Party:  Defined by any probabilistic polynomial-time Turing machine – this machine defines the functionality  Trusted party linked to all participants via perfectly private and authenticated channels  Upon receiving an input from a party, the trusted party runs the machine  If there is an output, it sends it to the designated party  Continue as above  This is more general than secure function evaluation

Examples
 Elections:  Trusted party receives votes. After all have been received (or after a certain time), it computes the election results and sends them out to all  Poker:  The trusted party deals the cards randomly (sending 5 cards to each party), and remembers which cards were dealed  Parties decide which cards to throw and send these cards to the trusted party  The trusted party checks that the cards are correct and re-deals the correct number of new cards  The trusted party publicizes the cards of a party who opens his hand

The Ideal Model – More Details
 The definition we gave suffices in the case of

an honest majority  When there is no honest majority
 

Guaranteed output delivery cannot be achieved Fairness cannot be achieved Corrupted parties receive output first Adversary decides if honest parties receive their outputs as well This is called security with abort

 Changes to ideal model:
  

Adding Defects to the Ideal Model
 In the case of no honest majority, fairness and

guaranteed output delivery cannot be achieved


This “defect” is included into the ideal model

 This approach can be used to also model partial

information leakage:
  

The parties wish to compute a function f, but more information is leaked by the protocol This can be modeled by having the trusted party explicitly leak this information Helps for efficiency considerations

 Advantage: explicit defect!

More on Definitions
 There are numerous ways to define the real

model, regarding both the adversary and network  The main thing is to realistically (and conservatively) model the real world scenario and adversarial threats


An overly conservative approach may be detrimental in that it precludes efficient protocols

Feasibility – A Fundamental Theorem
 Any multiparty functionality can be securely

computed




For any number of corrupted parties: security with abort is achieved, assuming enhanced trapdoor permutations [Yao,GMW] With an honest majority: full security is achieved, assume private channels only [BGW,CCD]

 These results hold in the stand-alone model  When composition is considered, things are much more difficult

Application to Private Data Mining
 The setting:  Data is distributed at different sites  These sites may be third parties (e.g., hospitals, government bodies) or may be the individual him or herself  The aim:  Compute the data mining algorithm on the data so that nothing but the output is learned  That is, carry out a secure computation  Conclusion: private data mining is solved in

principle

Privacy  Security
 As we have mentioned, secure computation

only deals with the process of computing the function


It does not ask whether or not the function should be computed

Privacy and Secure Computation
 Secure computation can be used to solve any

distributed data-mining problem  A two-stage process:
 

Decide that the function/algorithm should be computed – an issue of privacy Apply secure computation techniques to compute it securely – security

 But, not every privacy problem can be cast as

a distributed computation

Census Bureau and Privacy
 Case study – the census bureau:  The census bureau releases its results so that they can be studied  Question:  How does it make sure that the results released do not compromise privacy?  Answer:  The tables are manipulated to hopefully protect the privacy of individuals  Methods are highly involved (statistics), but are not “cryptographic”

Census Bureau and Privacy
 Suggestion:




If someone wants to study the results (ask a statistical query), let them run a secure computation with the census bureau Secure computation is necessary so that parties don’t have to reveal to the census bureau what they are studying

Census Bureau and Privacy
 Naïve objection:


This would be far too expensive and is not realistic in practice
This contradicts government transparency The census bureau can fudge results based on political interests, without possibly being detected (regular census tables can be compared against other sources to check for accuracy)

 Better objection:
 

Census Bureau and Privacy
 An even better objection:
 

Who says that privacy is preserved in this way? If the census bureau doesn’t know the query (because it’s a secure computation), then it is easy to ask queries that overlap and reveal confidential information about an individual

Conclusion – Census Bureau
 Secure multiparty computation doesn’t solve

this problem




Because sometimes in reality, actual data must be revealed, and not just the results of a function evaluation In other words, this problem cannot be cast as a distributed computation


Casting it as a distributed computation introduces other problems and can even harm privacy

Secure Computation and Privacy
 The census bureau case is different – it is not

a distributed computation problem
 What about secure computation for

distributed problems?

Privacy Difficulties
 Crucial point:


In order to analyze the privacy of a solution, it is crucial to understand why we want privacy to start with

 We demonstrate this with two examples:
 

Personalized newspapers Personalized shopping catalogs

Personalized Newspapers
 The aim:


Present a newspaper with a layout that is personalized to a reader’s interest
We don’t want the newspaper to necessarily know what we are interested in
  

 The problem:


Our political opinions may be private Our interest in certain stocks can be confidential Or just because our interests are our own

Personalized Newspapers
 The non-private solution:


Input:




User input: answers to an “interest questionnaire” and possible ratings for articles read Automated input: the newspaper gathers information about which articles were read by the user, for how long and so on (appropriate for online newspapers)



The computation: data mining algorithms are run to determine what is of interest to the user and in which layout to present it

Personalized Newspapers
 The solution – secure computation
 



The reader inputs its personal data The newspaper inputs the articles and the “rules” to define the layout based on the reader’s data The reader receives the personalized newspaper, the newspaper learns nothing Successful data mining here would consider all readers together, which is not practical

 Caveat:


Personalized Newspapers
 Privacy is clearly preserved


The newspaper learns nothing about the reader’s data (interests, preferences, reading history etc.)

 There is no dilemma here even regarding

computing the function


The newspaper learns nothing so the function can clearly be computed without compromising privacy

Personalized Newspaper Danger
 Why do we want privacy regarding our interests

and why is this an important question?  Typical answer to why we want privacy


A personal feeling of discomfort A danger to our autonomy: if the newspaper knows our interests, political leanings etc., it could feed us slanted information Our interests could be used against us

 A more concrete answer




Personalized Newspaper Danger
 The solution based on secure computation

does not solve the problem at all


The newspaper can set rules that say: if the reader has political view X, then present this article…

 The secure-computation solution provides full

privacy of information


But, the danger to autonomy comes from being able to act upon private information, and this was not prevented

Personalized Newspaper Danger
 The reason that we want privacy is crucial
 

The secure computation solution is trivially private, but doesn’t solve the problem at all It can be fixed – but we need to have a good understanding of what the problem is before we can design a solution

Personalized Newspaper Solution
 The rules used for defining the layout etc.

should be either:
 

Defined by the newspaper but be public Defined by the users themselves This is obvious – but only once you understand why you actually want privacy Good, rigorous definitions can be formed, but they take time and care

 Conclusions:
 

Personalized Shopping Catalogs
 The same aim:
 

Present customers with catalogs containing products that they are interested in Give certain customers “deals”
We don’t want the store to know all of our shopping history

 The same problem:


 The same solution: secure computation

Personalized Shopping Catalogs
 A new problem: price discrimination




The store can charge much higher prices to customers who are known to not “shop around” The secure computation solution once again does not solve the problem

Privacy Difficulties
 Even if no information is learned, it may still

be possible to “breach privacy”  Coming up with good definitions of privacy for different tasks is very non-trivial


The advantage of the cryptographic idealmodel definition (regarding its clarity) is somewhat diminished here

Challenges
 Secure computation can be used in many

cases to improve privacy
 

If the function itself preserves sufficient privacy, then this provides a “full solution” If the function does not preserve privacy, but there is no choice but to compute it, using secure computation minimizes damage

 The problem:


Current protocols are typically very inefficient

Efficiency and Secure Computation
 Two types of adversaries


Semi-honest: follow the protocol


Have efficient protocols for specific tasks Most known protocols are highly inefficient



Malicious: arbitrary actions


 Semi-honest adversaries make sense for:
 

Preventing inadvertent information leakage Where the participating parties really trust each other (e.g., hospitals)

Alternative Approaches
 Weaken the security guarantee
 

Consider malicious adversaries as before Define security in an alternative way (not using simulation)

 Example:


Provide a definition of security based on indistinguishability and not simulation

An Example: PIR
 PIR = Private Information Retrieval [CGKS]  Aim: allow a client to efficiently query a database

without the server learning what the query is  Definition:
 

Correctness: when the server is semi-honest, the client always obtains the correct result Privacy: the view of any (even malicious) server when the client’s query is i is indistinguishable from its view when the client’s query is j

 Sometimes, data privacy is also required (i.e., client

should learn its query and nothing else)

Alternative Approaches
 Consider weaker adversarial models
 

Are malicious adversaries always realistic? Maybe we can use game-theoretic notions of utility – adversaries only cheat in order to gain something…

 Of course, weaker adversaries can also be

combined with non-simulation definitions…

Where Are We Now?
 The problem of multi-disciplinarian fields:  Data miners typically don’t have a good understanding of security and cryptography  Cryptographers typically don’t really understand what data mining is all about


E.g., data-mining is an iterative process…



Our understanding of privacy is still very preliminary

Future Challenges
 Understand what privacy means and what we really

want
 



A very non-trivial task and one that requires interdisciplinary cooperation Computer scientists should help to formalize the notion, but lawyers, policy-makers, social scientists should be involved in understanding the concerns Some challenges here:  Reconciling cultural and legal differences relating to privacy in different countries  Understanding when privacy is “allowed” to be breached (should searching data require a warrant, cause and so on)

Future Challenges
 Appropriate modeling for secure computation  Efficient protocols…

A Final Word
 Privacy-preserving data mining is truly

needed





Data mining is being used: by security agencies, governmental bodies and corporations Privacy advocates and citizen outcry often prevents positive use of data mining Good solutions (which are still currently out of reach) may be able to resolve the conflict


				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:101
posted:11/25/2009
language:English
pages:60
Lingjuan Ma Lingjuan Ma MS
About work for China Compulsory Certification. Some of the documents come from Internet, if you hold the copyright please contact me by huangcaijin@sohu.com