Privacy and Data Mining in the Electronic Society -- Overview.ppt

Document Sample
Privacy and Data Mining in the Electronic Society -- Overview.ppt Powered By Docstoc
					Privacy and Data Mining in the Electronic 
          Society -- Overview


                                   Xintao Wu
                      University of North Carolina at Charlotte
                                  August 20, 2012


                           
                       
                          
Privacy Case
• Nydia Velázquez    (1982)

    Three weeks after Nydia Velázquez won the New
  York Democratic Party's nomination to serve in the
  U.S. House of Representatives, somebody at St.
  Claire Hospital in New York faxed Velázquez's
  medical records to the New York Post. The records
  detailed the care that Velázquez had received at
  the hospital after a suicide attempt--an attempt
  that had happened several years before the
  election.




  Database Nation: The Death of Privacy in the 21st Century, Simson 
    Garfinkel, Jan 2000, 1-56592-653-6 

                                    2
Privacy Case
•     AOL's publication of the search histories of more than 650,000
      of its users has yielded more than just one of the year's bigger
      privacy scandals.  (Aug 6, 2006)

      That database does not include names or user identities. Instead, it 
      lists only a unique ID number for each user.  AOL user 710794 

          n    an overweight golfer, owner of a 1986 Porsche 944 and 1998 Cadillac 
               SLS, and a fan of the University of Tennessee Volunteers Men's 
               Basketball team. 
       
          n    interested in the Cherokee County School District in Canton, Ga., and has 
               looked up the Suwanee Sports Academy in Suwanee, Ga., which caters to 
               local youth, and the Youth Basketball of America's Georgia affiliate.
 
          n    regularly searches for "lolitas," a term commonly used to describe 
               photographs and videos of minors who are nude or engaged in sexual 
               acts. 




              AOL's disturbing glimpse into users' lives By Declan McCullagh , 
              CNET News.com, August 7, 2006, 8:05 PM PDT


                                                           3
NetFlix Prize
• An open competition for the best collaborative filtering 
  algorithm to predict user ratings for films.
   n   On Sept 21 2009, the grand prize $1M was given to BellKor’s 
       Pragmatic Chaos team which bested Netflix’s own algorithm for 
       predicting ratings by 10.06%.
   n   Wiki http://en.wikipedia.org/wiki/Netflix_Prize
• NetFlix cancels contest after privacy lawsuit on March 12, 
  2010.
   n   http://www.wired.com/threatlevel/2010/03/netflix-cancels-contest/




                                      4
Source: http://www.privacyinternational.org/issues/foia/foia-laws.jpg


                                   5
6
National Laws
• USA 
   n   HIPAA for health care
        u   Passed August 21, 96
        u   lowest bar and the States are welcome to enact more stringent rules
              s   California State Bill 1386
   n   Grann-Leach-Bliley Act of 1999 for financial institutions
   n   COPPA for childern’s online privacy
   n   etc.
• Canada
   n   PIPEDA 2000
        u   Personal Information Protection and Electronic Documents Act
        u   Effective from Jan 2004


• European Union (Directive 94/46/EC)
   n   Passed by European Parliament Oct 95 and Effective from Oct 98. 
   n   Provides guidelines for member state legislation
   n   Forbids sharing data with states that do not protect privacy 


                                               7
Privacy & Breaches of Privacy
• Various definitions of privacy
   n   http://www.privacy.org
   n   http://en.wikipedia.org/wiki/Privacy
   n   Context dependent definitions:  physical privacy, internet privacy, 
       medical privacy, genetics, political privacy, surveillance.
• An individual right
   n   The claim of individuals, groups,  or institutions to determine for 
       themselves when, how, and to what extent information about them 
       is communicated to others. – Alan Westin
• Expansion of government and company databases & 
  growing use of web and mobile devices lead to increase of 
  collection, analysis and disclosure of sensitive information.
   n   Location based services need user’s position and preference



                                        8
Privacy vs. Confidentiality
• Privacy is the right to keep one’s personal information out 
  of the public view
• Confidentiality is the dissemination without public 
  identification
• Disclosure
   n   Identity disclosure = when a specific person’s record can be found 
       in a released file.
   n   Attribute disclosure = when sensitive information about a specific 
       person is revealed through the released file, sometimes with 
       additional knowledge.
   n    Inferential disclosure =  if from the released data one can 
       determine the value of some characteristic of an individual more 
       accurately than otherwise would have been possible



                                       9
Mining vs. Privacy
• Data mining
   n   The goal of data mining is summary results (e.g., classification, 
       cluster, association rules etc.) from the data (distribution)

• Individual Privacy
   n   Individual values in database must not be disclosed, or at least no 
       close estimation can be got by attackers
   n   Contractual limitations:  privacy policies, corporate agreements


• Privacy Preserving Data Mining (PPDM)
   n   How to transform data such that
        u   we can build a good data mining model (data utility)
        u   while preserving privacy at the record level (privacy)?




                                           10
Two Approaches 
• Distributed
   n   Suitable for multi-party 
       platforms
   n   Secure multi-party computation
   n   Tolerated disclosure: 
       computationally private
• Generalization/randomization
  /transformation
   n   Perturb data to protect privacy 
       of individual records.
   n   Preserve intrinsic distributions 
       necessary for modeling.
   n   Tolerated disclosure: 
       statistically  private


                                           11
Data miner vs. attacker




                    12
   Scope 
             ssn    name zip           race    …   age       Sex   Bal      income   …    IntP

      1                     28223      Asian   …   20        M     10k      85k      …    2k
      2                     28223      Asian   …   30        F     15k      70k      …    18k
      3                     28262      Black   …   20        M     50k      120k     …    35k

      4                     28261      White   …   26        M     45k      23k      …    134k
      .                     .          .       …   .         .     .        .        …    .
      N                     28223      Asian   …   20        M     80k      110k     …    15k




69% unique on zip and birth date                                         Generalization/randomization
87% with zip, birth date and gender.
k-anonymity, L-diversity
SDC etc. 




                                                        13
Additive Noise Randomization Example
                                 Bal       income             …       IntP
                  1              10k       85k                …       2k
                  2              15k       70k                …       18k
                  3              50k       120k               …       35k
                  4              45k       23k                …       134k
                  .              .         .                  …       .
                  N              80k       110k               …       15k

     17.334   88.759    2.099                  10   85    2                  7.334   3.759   0.099


     19.199   77.537    25.939                 15   70    18                 4.199   7.537   7.939
                                       =                          +
                                               50   120   35                 9.199   8.447   3.678
     59.199   128.447   38.678

                                               45   23    134                6.208   7.313   1.939
     51.208   30.313    135.939

                                               80   110   15                 9.048   5.692   6.318
     89.048   115.692   21.318




              Y                        =            X             +                  E
                                                     14
Additive Randomization (Z=X+Y) 
• R.Agrawal and R.Srikant SIGMOD 00

     Alice’s 
      age       30 | 70K | ...         50 | 40K | ... 
                                                          ...
Add random 
                Randomizer              Randomizer
 number to 
   Age
                65 | 20K | ...          25 | 60K | ...    ...
      30 
  becomes 
      65 
   (30+35)       Reconstruct             Reconstruct
                 Distribution            Distribution      ...
                    of Age                of Salary


                             Classification
                                                           Model
                               Algorithm


                                              15
Identity Theft
• SSN
                               ### - ## - ####


     Determined by zip code Group no             Sequential no



      https://secure.ssa.gov/apps10/poms.nsf/lnx/0100201030

   Facebook study
      http://www.heinz.cmu.edu/~acquisti/papers/
      privacy-facebook-gross-acquisti.pdf




                                         16
Randomized Response ([ Stanley Warner; JASA 1965])
                                                                                                           Cheated in
                                                                                                           exam
  : Cheated in the exam           : Didn’t cheat in the exam
                                                                           Purpose



Purpose:  Get the proportion(    ) of population 
members that cheated in the exam.
  §  Procedure:                                                                  “Yes” answer
                                                                                                                  Didn’t  cheat

                                Randomization device

                                Do you belong to A? (p) 
                                Do you belong to    ?(1-p)
          …

                      …




                                                                                                    “No” answer



 As:                                                       Unbiased estimate of            is:   




                                                                    17
Linked data 
     ssn   name zip     race    …   age       Sex   Bal   income   …    IntP

 1              28223   Asian   …   20        M     10k   85k      …    2k
 2              28223   Asian   …   30        F     15k   70k      …    18k
 3              28262   Black   …   20        M     50k   120k     …    35k

 4              28261   White   …   26        M     45k   23k      …    134k
 .              .       .       …   .         .     .     .        …    .
 N              28223   Asian   …   20        M     80k   110k     …    15k




                                                                 sensitive links




                                         18
Privacy issues in Social Network

• Social network contains much private relation information;

• Anonymization is not enough for protecting the privacy. 
  Subgraph attacks [Backstrom et al., WWW07, Hay et al., 07].




                                                   sensitive link
     attacker




                                  19
Other issues
• Statistical disclosure limitation methods for 
  tabular/microdata
• Secure multi-party computation protocols and tools
• Privacy issues in various application areas such as e-
  commerce, healthcare, finance, and RFID




                               20
Tutorials on PPDM 
• Privacy in data system, Rakesh Agrawal, PODS03

• Privacy preserving data mining, Chris Clifton, PKDD02, KDD03

• Preserving privacy in database systems, Johann-Chrostoph Freytag, 
  WAIM06

• Models and methods for privacy preserving data publishing and
  analysis, Johannes Gehrke, ICDM05, ICDE06, KDD06

• Cryptographic techniques in privacy preserving data mining, Helger 
  Lipmaa, PKDD06

• Randomization based privacy preserving data mining, Xintao Wu, 
  PKDD06 & WAIM06

                                    21

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:16
posted:2/13/2014
language:English
pages:21