Docstoc

Algorithm based disclosure in privacy-preserving data publishing_1_

Document Sample
Algorithm based disclosure in privacy-preserving data publishing_1_ Powered By Docstoc
					          Versatile Publishing For
           Privacy Preservation

Xin Jin, Mingyang Zhang, Nan Zhang          Gautam Das
   George Washington University    University of Texas at Arlington
                   Outline
•   Introduction
•   Inference For Multiple Privacy Rules
•   Guardian Normal Form
•   GD and UAD Algorithms
•   Experimental Results
•   Conclusion
 Privacy Preserving Data Publishing
• QI       SA, i.e., an adversary knowing QI cannot infer
the SA of a tuple (beyond a privacy guarantee).
• A privacy guarantee example: l–diversity
          Quasi-identifier (QI)         Sensitive Attribute (SA)
             Age      Gender                    Disease
 Allen     [30-80]         *                      HIV
  Bob       [30-80]          *                   diabetes
 Calvin     [35-55]          F                   diabetes
 David      [35-55]          F                      flu
  Eve       [20-40]          M                     drug
 Grace      [20-40]          M                      HIV
                 2 – diversity Published Table
    A Sneak Peek at Real Application
• The Texas Department of State Health Services
publishes every year a table of all patients discharged
from more than 450 state-licensed hospitals.
www. Dshs.state.tx.us/thcic/Hospitals/HospitalsData. shtm


•Defines 9 privacy requirements.
Example:
If a hospital has fewer than five discharges of a particular gender, then
suppress the zipcode of its patients of that gender.
 Race is changed to ‘Other’ and ethnicity is suppressed if a hospital has
fewer than ten discharges of a race.
The entire zipcode and gender code are suppressed if the ICD code
indicates alcohol or drug use or an HIV diagnosis.
…
Texas Inpatient Discharge Data




Example: If a hospital has fewer than five discharges of a
particular gender, then suppress the zipcode of its patients of
that gender.




    hospital, gender                    zipcode
              Multiple SA Publishing
• [MKGV06] defines multiple SA attributes
Treats Si as the sole SA attribute and {Q1, Q2, …, Qm, S1, …, Si-1,
Si+1, …, Sn} is treated as QI.

                                       age, ICD, state, gender    race
   SA: race and state                  age, ICD, hospital, race     state




• Lack of flexibility: provides stronger privacy definition
than necessary.
A Novel Problem: Versatile Publishing
• Allows the privacy requirement of publishing a table to
be defined as an arbitrary set of privacy rules.
• Each rule: {Q1, Q2, …, Qp}        {S1, S2, …, Sr}

             LHS attributes         RHS attributes

• Assures that an adversary learning the LHS attributes
cannot learn the RHS attributes beyond a pre-defined
privacy guarantee such as l-diversity, t-closeness, etc..
        A Running Example
hospital age      gender         ICD           state   race
A       37        F              HIV           TX      asian
A       71        M              diabetes      MN      white
B       55        F              diabetes      CA      black
B       37        F              flu           VA      white
C       23        M              drug          TX      black
C       37        M              HIV           MN      white

       Rule #1: age, ICD           race
       Rule #2: gender, ICD           state
       Rule #3: hospital, race         state

       Privacy guarantee: 2-diversity
                      Simple Solution #1:
                    Straight Decomposition
age, ICD         race        gender, ICD          state       hospital, race      state

age     ICD      race        gender      ICD         state    hospital   race     state
37      HIV      asian         F         HIV          TX         A       asian        TX
37      HIV      white         M         HIV          MN         B       black        CA
23     drug      white         M         drug         TX         B       white        MN
37      flu      black         F           flu        VA         A       white        VA
55    diabetes   white         F       diabetes       MN         C       white        TX
71    diabetes   black         M       diabetes       CA         C       black        MN

                    join

  Asian is linked with TX or MN                       Asian is linked with TX or CA

                              Intersection Attack [GKS08]

                     asian            TX, violating hospital, race       state
        Multiple SA Publishing Method
• Defines as SA all attributes that appear on the RHS of at least
  one privacy rule, and QI as the set of all other attributes.
Rule #1: age, ICD         race
Rule #2: gender, ICD         state      2 SA: race, state
Rule #3: hospital, race       state
                                        4 SA: ICD, state, race,hospital

Rule #4: hospital, age      ICD         Curse of dimensionality
Rule #5: gender, race        hospital
  Traditional Data Normalization
• Step 1: Obtain irreducible functional dependencies (FD).

• Step 2: Test whether there is any FD violates the normal
  form over the large table.

• Step 3: Decompose the table to remove the violation if there
  is any.
   Inference For Multiple Rules
• Inference on multiple privacy rules.
  Example: AB     C implies that A   C and B   C

• Completeness of Inference Rules
      Guardian Normal Form (GNF)
• Non-triviality: a privacy rule satisfied by two
anonymized table might be broken by the combination of
these two, due to intersection attack.

• Guardian Normal Form (GNF): a normal form for the
schema of published tables which guarantees that all
privacy rules are guaranteed over the collection of
published tables.


• GNF is defined at the schema-level of published tables
rather than tuple-level.
                                  An Example
                                           ICD, gender            hospital

                                              hospital            state

                                          age, hospital, gender           race

                                             no privacy rule enforced


           hospital       state


age                                race




           gender          ICD

      Rule #1: age, ICD            race
                                An Example
                                         ICD, gender       hospital

                                           hospital            state

                                       age, hospital, gender           race

                                          no privacy rule enforced



          hospital    state
                                        race is unreachable from age or ICD

age                             race




          gender          ICD

      Rule #1: age, ICD          race
                              An Example
                                        ICD, gender            hospital

                                           hospital            state

                                       age, hospital, gender           race

                                          no privacy rule enforced


        hospital    state
                                       state is reachable from either gender
                                       or ICD
age                            race




       gender           ICD


 Rule #2: gender, ICD          state
  Guardian Decomposition Algorithm
• Similar in spirit to the database normalization algorithm [EN03]
  (decomposition into BCNF)
• Find a privacy rule which violates GNF, decompose the existing
  sub-tables to address the privacy rule, and continue until no
  more offending privacy rule exists.




            Greedily add attributes
            if GNF remains




                   End: no further decomposition, publish T11 and T12
Utility Aware Decomposition Algorithm
• Leverage the link between utility optimization and as the MIN-
  VERTEX-COLORING problem.
Experimental Results
                        Conclusion

• Defined novel problem of versatile publishing which captures the
  real-world requirement of multiple privacy rules.

• Derived the sound and complete set of inference axioms for privacy
  rules.

• Defined guardian normal form (GNF).

• Developed two decomposition algorithms GD and UAD and
  conducted comprehensive experiments.
                         Reference

[1] Texas Department of State Health Services, User manual of texas
hospital inpatient discharge public use data file, 2008
[2] A. Machanavajjhala, D. Kifer, J. Gehrke and M. Vekitasubramaniam.
    l-diversity: Privacy beyond k-anonymization, in ICDE, 2006.
[3] S. R. Ganta, S. P. Kasiviswanathan, and A. Smith. Composition
     attacks and auxililary information in data privacy, in KDD 2008
[3] R. Elmasri and S.B. Navathe. Fundamentals of Database Systems.
   (4th Edition), Addison Wesley, 2003.
Thank You

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:7/7/2011
language:English
pages:22