Privacy Preserving Data Mining(1)

Document Sample
Privacy Preserving Data Mining(1) Powered By Docstoc
					Privacy Preserving Data Mining


      Haiqin Yang

       Extracted from a ppt “Secure
      Multiparty Computation and Privacy”
       Added “Privacy Preserving SVM”




                                            1
Outline

   Motivation
   Privacy
   Secure computation and privacy
   Privacy preserving SVM
   Related to 3D-LBS
   Challenges

                                     2
    Motivation
   Huge databases exist in various applications
       Medical data
       Consumer purchase data
       Census data
       Communication and media-related data
       Data gathered by government agencies
   Can these data be utilized?
       For medical research
       For improving customer service
       For homeland security

                                                   3
Motivation
   Data sharing is necessary for full utilization
       Pooling medical data can improve the quality of
        medical research
       Pooling of information from different government
        agencies can provide a wider picture
           What is the health status of citizens that are
            supported by social welfare?
           Are there citizens that receive simultaneous
            support from different agencies?
   Data gathered by the government (e.g.,
    census data) should be publicly available

                                                             4
Motivation

   The huge amount of data available
    means that it is possible to learn a lot
    of information about individuals from
    public data
       Purchasing patterns
       Family history
       Medical data
       …
                                               5
    Privacy
    Human definition:
        Privacy and autonomy: information that is
         personal, confidential or private should not be
         unnecessarily distributed or publicly known
        Privacy and control: Personal or private
         information should not be misused (whatever that
         means)
    Difficulties in mathematically formulating
        The same information is classified differently by
         different people
        Legitimate use is interpreted differently by different
         people                                               6
Secure computation and privacy
   Secure computation
       Assume that there is a function that all parties
        wish to compute
       Secure computation shows how to compute that
        function in the safest way possible
       In particular, it guarantees minimal information
        leakage (the output only)
   Privacy
       Does the function output itself reveal “sensitive
        information”, or
       Should the parties agree to compute this
        function?
                                                            7
This talk

    Secure multiparty computation
      Trace back to “two millionaires
       problem” (A. Yao 82), or earlier
    Privacy preserving data mining
        Privacy preserving SVM (Vaidya et al.
         KIS 2007)


                                                 8
Secure multiparty computation

   A set of parties with private inputs
   Objective: jointly compute a function of the
    inputs so that certain security properties
    (like privacy and correctness) are preserved
       Applications: secure elections, auctions…
   Properties must be ensured even if
    some of the parties maliciously attack
    the protocol
                                                    9
Secure computation tasks

   Examples:
       Authentication protocols
       Online payments
       Auctions
       Elections
       Privacy preserving data mining
       Essentially any task…

                                         10
The real model

       x                 y




 Protocol output   Protocol output

                                     11
The ideal model

    x                             y

              x         y



            f1(x,y)   f2(x,y)




  f1(x,y)                       f2(x,y)

                                          12
  The security definition

              For every real                there exists an




                               
               adversary A                    adversary S




        Protocol
       interaction


                                   Trusted party

                                                       13

REAL                                               IDEAL
Why this approach?

   General – it captures all applications
   The specifics of an application are defined
    by its functionality, security is defined as
    above
   The security guarantees achieved are easily
    understood (because the ideal model is
    easily understood)
   We can be confident that we did not “miss”
    any security requirements
                                               14
The ideal model – More details
   The definition we gave suffices in the
    case of an honest majority
   When there is no honest majority
       Guaranteed output delivery cannot be achieved
       Fairness cannot be achieved
   Changes to ideal model:
       Corrupted parties receive output first
       Adversary decides if honest parties receive their
        outputs as well
       This is called security with abort               15
Defects to the ideal model
   When no honest majority, fairness and
    guaranteed output delivery cannot be achieved
       This “defect” is included into the ideal model
   This approach can be used to models of partial
    information leakage:
       The parties wish to compute a function f, but
        more information is leaked by the protocol
       This can be modeled by having the trusted party
        explicitly leak this information
       Helps for efficiency considerations
   Advantage: explicit defect!                          16
Privacy preserving data mining
   Setting
       Data is distributed at different sites
       These sites may be third parties (e.g.,
        hospitals, government bodies) or individuals
   Aim
       Compute the data mining algorithm on the
        data so that nothing but the output is learned
       That is, carry out a secure computation


                                                       17
Privacy  Security

   Secure computation only deals with
    the process of computing the function
       It does not ask whether or not the
        function should be computed




                                             18
Privacy and secure computation

   Secure computation can be used to solve any
    distributed data-mining problem
   A two-stage process
       Decide that the function/algorithm should be
        computed – an issue of privacy
       Apply secure computation techniques to
        compute it securely – security
   But, not every privacy problem can be cast as a
    distributed computation

                                                       19
Privacy preserving SVM classification
(Vaidya et al. KIS 2007)




                                        20
SVM introduction

   Data
   Objective
                                     SVM Illustration




                                                        
                                                  




                             x2
                                         




                                             x1




         Problem: How to pass the kernel
         matrix among different parties?
                                                            21
     Case 1: On vertically partitioned data
   Bank, health insurance           Linear kernel
    company and auto
    insurance company collect
    different information about
    the same people
                                     Polynomial kernel

                d
                                     RBF kernel

    N      D1       D2




                                                          22
Secure merge
   Assumption: K parties, Pi holds vi
   Procedure
       P0 chooses a random number R from a
        uniform distribution over F
       P0 sends (R+v0) mod |F | to P1
       Pi receives
       Pi sends                  to Pi+1



                                              23
Case 2: On horizontally partitioned data

   Different banks collect data for their
    customers
                d


       N
                  D1

                  D2




                                             24
Case 3: On arbitrarily partitioned data




                                          25
How to do?
   Homomorphic encryption


    where * is either addition or multiplication (in some
    abelian group)
   Existing cryptosystems
       Goldwasser–Micali (Blum M, Goldwasser S 1984)
       Benaloh cryptosystem (Benaloh JC 1986)
       Naccache–Stern cryptosystem (Naccache D, Stern J 1998)
       Paillier cryptosystem (Paillier P 1999)
       Okamoto–Uchiyama cryptosystem (Okamoto T, Uchiyama
        S 1998)
                                                            26
Algorithm




            27
Other security issues

   Collusion with the third party Q




                                       28
Related to 3D-LBS
   Issues
       Some data may need be encrypted before
        research test
       May need to work data miners on
        encrypted data

   Problems
       Lack knowledge of security and
        cryptography
       Limited understanding of privacy     29
    Future challenges
   Understand the real problem and what the real
    requirement is
       A very non-trivial task and one that requires
        interdisciplinary cooperation
       Computer scientists should help to formalize the
        notion, lawyers, policy-makers, social scientists
        should be involved in understanding the concerns
       Some challenges
           Reconciling cultural and legal differences relating to
            privacy in different countries
           Understanding when privacy is “allowed” to be
            breached
                                                                 30
Future challenges

   Appropriate modeling for secure
    computation

   Efficient protocols…




                                      31
Conclusion
   Privacy-preserving data mining is truly
    needed
       Data mining is being used: by security
        agencies, governmental bodies and
        corporations
       Privacy advocates and citizen outcry
        often prevents positive use of data mining
       Good solutions (which are still currently
        out of reach) may be able to resolve the
        conflict                                 32

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:47
posted:12/13/2011
language:Latin
pages:32
pptfiles pptfiles
About