Docstoc

Algorithm based disclosure in privacy-preserving data publishing

Document Sample
Algorithm based disclosure in privacy-preserving data publishing Powered By Docstoc
					Algorithm Safe Privacy-Preserving
         Data Publishing
                    Xin Jin
       George Washington University
                 Nan Zhang
       George Washington University
                Gautam Das
       University of Texas at Arlington
                 Outline

• Introduction
• Algorithm-safe Data Publishing Model
• Amendment Toolset: Look-ahead
  Partitioning and Stratified Pick-up
• Experimental Results
• Conclusion
   Privacy-Preserving Data Publishing

• Share individual records to enable analytical tasks (e.g.
  aggregate query answering, data mining) while
  protecting individual privacy information.
          Quasi-identifier (QI)   Sensitive Attribute (SA)
           Country    Gender              Disease
 Allen      U.K.          M           prostate cancer
  Bob       Spain         M              diabetes
 Calvin    Hungary        M            heart disease
 David     Poland         M              diabetes
  Eve       U.S.          F                 HIV
 Grace     Canada         F                 HIV
  What is Algorithm-based Disclosure?
• Algorithm-based disclosure in existing methods (e.g.,
[WFW+07] [LLV07] [MGK+07]).
• An example by using ℓ–diversity.

                           QI                      SA
                Country          Gender         Disease
      Allen    W. Europe            M        prostate cancer
      Bob      W. Europe            M           diabetes

     Calvin      Any                *        heart disease
      David      Any                *           diabetes
      Eve        Any                *             HIV
     Grace       Any                *             HIV
                       2 – diversity Table
What If an Adversary Knows the Algorithm?
                               Country      Gender            Disease
                    Allen     W. Europe         M      prostate cancer
                     Bob      W. Europe         M             diabetes

                    Calvin      Any              *      heart disease
                    David       Any              *            diabetes
                     Eve        Any              *              HIV
                    Grace       Any              *              HIV

                                      Published Table
          Country    Gender           Disease                     Country      Gender      Disease
 Allen     U.K.         M       prostate cancer,     Allen       W. Europe       M      prostate cancer
  Bob      Spain        M          diabetes           Bob        W. Europe       M         diabetes
 Calvin   Hungary       M             HIV,           Calvin      C. Europe       M            HIV
 David    Poland        M           diabetes         David       C. Europe       M         diabetes
  Eve      U.S.         F            HIV,             Eve        N. America      F            HIV
 Grace    Canada        F        heart disease       Grace       N. America      F      heart disease

             1st Conjectured Original Data                              Better Output Table
What If an Adversary Knows the Algorithm?
                             Country      Gender             Disease
                    Allen   W. Europe          M      prostate cancer
                    Bob     W. Europe          M             diabetes

                   Calvin     Any               *      heart disease
                    David     Any               *            diabetes
                    Eve       Any               *              HIV
                   Grace      Any               *              HIV

                                 Published Table
         Country   Gender       Disease                         Country      Gender      Disease
Allen     U.K.       M       prostate cancer,       Allen       Europe         M      prostate cancer
 Bob      Spain      M          diabetes             Bob        Europe         M         diabetes
Calvin   Hungary     M              HIV             Calvin      Europe         M              HIV
David    Poland      M              HIV             David       Europe         M              HIV
 Eve      U.S.       F        heart disease,         Eve       N. America      F       heart disease
                                diabetes
Grace    Canada      F                              Grace      N. America      F         diabetes

          2nd Conjectured Original Data                                 Better Output Table
What If an Adversary Knows the Algorithm?
                   Country    Gender        Disease
         Allen    W. Europe     M       prostate cancer
          Bob     W. Europe     M           diabetes

         Calvin     Any         *         heart disease
         David      Any         *           diabetes
          Eve       Any         *             HIV
         Grace      Any         *             HIV

                        Published Table
                   Country    Gender       Disease
         Allen       U.K.       M      prostate cancer,
          Bob       Spain       M         diabetes
         Calvin    Hungary      M      heart disease,
         David     Poland       M        diabetes
          Eve        U.S.       F            HIV
         Grace     Canada       F            HIV

                   3rd Conjectured Original Data
Algorithm-safe Data Publishing (ASP)
    Smart User           Q: How likely does Eve
                              have HIV?

                                           Naïve User

                          My
                        answer


       Algorithm
                          =        My
                                 answer


 Background Knowledge                 Background Knowledge



    Published Table                       Published Table
 Algorithm-safe Data Publishing (ASP)

• Problem Definition: For each tuple (i.e.,
  row) ti = <q, s> in the original data T, there
  is:
  Pr{ti [SA] = s’ | ti [QI] = q, K} = Pr{ti [SA] = s’ | ti
  [QI] = q, K, A}
  for each s’ in the domain of SA, where K is
  background knowledge and A is the data
  publishing algorithm.
               Necessary Condition #1
                 QI*-Independence
                                            Query SA by QI

                     Query
      QI      SA
                                                                    Data Publisher
      ...     …                                Safe QI-SA
                     QI-SA                     correlation         Safe QI-SA
      …       …      correlation                                   correlation
                                   Oracle
                                                             QI*     SA*
     Original Data

QI*-Independence :                                           ...       …
Generated QI* is conditional independent of                  …         …
the original SA, given a combination of QI
and the published SA*.
                                                         ASP Published Table
               Necessary Condition #2                                 Impossibl
                                                                      e QI-SA
                 SA*-Independence                                     correlatio
                                                                      n
                                            Query SA by QI

                     Query
      QI      SA
                                                                    Data Publisher
      ...     …                                Safe QI-SA
                                                                   Perturbed Safe QI-
                     QI-SA                     correlation
      …       …      correlation                                   SA correlation

                                   Oracle                    QI*      SA*
     Original Data

SA*-Independence :                                           ...       …
Generated SA* is conditional independent of                  …         …
the original SA, given a combination of QI,
QI* and the impossible QI-SA correlation.
                                                         ASP Published Table
     How to Achieve ASP Model?
• Play the Role of Oracle
• Satisfy QI*-Independence
• Never perturb SA




• Worst-case Eligibility Test
• Look-ahead partitioning
A Mondrian Method [LDR06] to Achieve
          ℓ–diversity (ℓ = 2)
         y

     10                                                              t5       S1
     9                             t1                                         S2
     8                                                         t6             S3
     7                    t2                                                  S4
     6                                              t7                        S5
     5               t3

     4
     3                     t4

     2
     1                                                   t8

             1   2    3        4        5   6   7    8        9 10        x
A Mondrian Method to Achieve ℓ–diversity
         y       y

     10      10                                          t5
                                                                  S1
     9       9                 t1
                                                                  S2
     8       8                                      t6
                                                                  S3
     7       7            t2
                                                                  S4
     6       6                            t7
                                                                  S5
     5       5       t3

     4       4
     3       3            t4

     2       2
     1       1                                 t8
                                      x                       x
             1       3 4 5 6 7 8 9 8
                  2 1 2 3 4 5 6 7 10 9 10
                                    x=5
A Mondrian Method to Achieve ℓ–diversity
          y
          y
                      y
      10
      10          10             t1                   t5
      9
                                                               S1
      9           9
      8                                                        S2
      8
      7           8        t2                    t6
                                                               S3
      7
      6           7
y=5                                                            S4
      6           6                         t7
      5
                                                               S5
      5
      4           5 t3
      4
      3           4
      2
      3           3         t4

      1
      2           2

      1       1    1
                  2 3       4 5 6 7 8 9 t810 x
                           1 2 3 4 5 6 7 8 x 9 10          x
              1   2       3 4 5 6 7 8 9 10
                                      x=5
A Mondrian Method to Achieve ℓ–diversity
                    y
         y
               y
               10                                 t5
     10
              109                                          S1
     9                       t1
              98                             t6            S2
     8
              87                                           S3
     7                  t2
              76                                           S4
     6
              6 5 t3                    t7                 S5
     5
              54
     4
              43        t4
     3
              32
     2
              21
     1            1 2 3 4 5 6 t87 8 9 10               x
              1
             1 2 3 4 5 6 7 8 9 10 x                    x
                 1 2 3 4 5 6 7 8 9 10
                                  x=5
     Look-Ahead Partitioning
    y

10                                                       t5             S1
9                                 t1                                    S2
8                                                             t6        S3
7                        t2                                             S4
6                                                  t7                   S5
5               t3

4
3                        t4

2
1                                                       t8

        1   2        3        4        5   6   7    8        9 10   x
        Look-Ahead Partitioning
    y        y

10       10                                    t5
                                                        S1
9        9                 t1
                                                        S2
8        8                                t6
                                                        S3
7        7            t2
                                                        S4
6        6                      t7
                                                        S5
5        5       t3
4        4
3        3            t4

2        2
1        1                           t8
                                  x                 x
        1        3 4 5 6 7 8 9 8
              2 1 2 3 4 5 6 7 10 9 10
        Look-Ahead Partitioning
               y
    y
          y
          10                                 t5
10
         109                                          S1
9                       t1
         98                             t6            S2
8
         87                                           S3
7                  t2
         76                                           S4
6
         6 5 t3                    t7                 S5
5
         54
4
         43        t4
3
         32
2
         21
1            1 2 3 4 5 6 t87 8 9 10               x
         1
        1 2 3 4 5 6 7 8 9 10 x                    x
            1 2 3 4 5 6 7 8 9 10
                             x=5
                Amendment Toolset
• Look-Ahead Partitioning : Execute the partitioning if a
  worst (i.e., most skewed) scenario of QI-SA correlation is
  eligible to achieves the given privacy guarantee (e.g., ℓ–
  diversity).
 Can be extended to other algorithms such as Hilb [GKKM07],
  Incognito [LDR05], MASK [WFW+07], etc.
 Limitation: May harm the utility due to large-sized groups.



• Stratified Pick-up: Take as input the anonymous groups
  and attempt to further partition each of these groups
  iteratively based solely on the distinctness of SA values.
                Stratified Pick-Up
                    y
    y
                y
                10                                t5
10
             9
            10                                              S1
9                                    t1
            98                                         t6   S2
8
            87                                              S3
7                           t2
            76                                              S4
6
            6 5 t3                           t7             S5
5
            54
4
            43                  t4
3
            32
2
            21
1
            1               1    2 3 4 5 6 7 8 t8 10 x
                                                 9
        1       2       3       4 5 6 7 8 9 10 x
                        1       2 3 4 5 6 7 8 9 10 x
               Experiment Setup

• Adult Dataset (http://archive.ics.uci.edu/ml/)
• 45,222 tuples
• SA: Education.


• Census Dataset (http://ipums.org)
• 300K tuples
• SA: Occupation
Effect of Amendment Toolset
Time Performance
                    Conclusion
• We unveil algorithm-based disclosure is much more
  significant than ever studied.

• We rigidly define Algorithm-Safe data Publishing (ASP)
  model.

• We propose a screening tool for algorithm-based
  disclosure by two necessary conditions.

• We explore amendments on problematic methods (if
  “diagnosed” of algorithm-based disclosure).
                        References
[WFW+07] Wong, R. C. and Fu, A. W. and Wang, K. and Pei, J.
   Minimality Attack in Privacy-Preserving Data Publishing.
[LLV07] Li, N. and Li, T. and Venkatasubramanian, S. t-Closeness:
   Privacy Beyond k-anonymity and ℓ-diversity
[MGK+07] Machanavajjhala, A. and Gehrke, J. and Kifer, D. and
   Venkitasubramaniam, M. ℓ-diversity: Privacy Beyond k-anonymity.
[ZJB07] Zhang, L. and Jajodia, S. and Brodsky, A. Information
   Disclosure under Realistic Assumptions: Privacy versus Optimality.
[GKKM07] Ghinita, G. and Karras, P. and Kalnis, P. and Mamoulis, N.
   Fast Data Anonymization with Low Information Loss.
[LDR06] LeFevre, K. and DeWitt, D. J. and Ramakrishnan, R. Mondrian
   Multidimensional k-anonymity
[LDR05] LeFevre, K. and DeWitt, D. J. and Ramakrishnan, R. Incognito:
   efficient full-domain k-anonymity
[XT06] Xiao, X. and Tao, Y. Anatomy: Simple and Effective Privacy
   Preservation.
Thank You

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:9/22/2012
language:English
pages:27