Template-Based Privacy Preservation in Classification

Document Sample
scope of work template
							                            Template-Based Privacy
                                Preservation in
                            Classification Problems


     Ke Wang              Benjamin C. M. Fung         Philip S. Yu
Simon Fraser University    Simon Fraser University   IBM T.J. Watson
     BC, Canada                 BC, Canada           Research Center
   wangk@cs.sfu.ca            bfung@cs.sfu.ca        psyu@us.ibm.com


                             IEEE ICDM 2005
                    Outline
• Privacy threats caused by data mining abilities

• Our method: Progressive Disclosure Algorithm

• Experimental evaluation

• Related works

• Conclusions
                                                    2
            Privacy Concern
• Most previous works concern the input of data
  mining tools where private information is revealed
  directly by inspection of the data without
  sophisticated analysis.

• Our privacy concern is on the output of data
  mining methods.
  – The aggregate patterns can be used to infer sensitive
    information about individuals.


                                                            3
• Motivating Example: A data owner wants to
  release a table to a data mining firm for
  classification analysis on Rating, but does not
  want the firm to infer the bankruptcy state
  Discharged using the attributes Job and Country.




• This work aims at releasing data with dual goals:
  – Preserve information for wanted classification analysis.
  – Limit usefulness of unwanted sensitive inferences.
• Motivating Example: A data owner wants to
  release a table to a data mining firm for
  classification analysis on Rating, but does not
  want the firm to infer the bankruptcy state
  Discharged using the attributes Job and Country.




• Inference: {Trader,UK}  Discharged
• Confidence = 4/5 = 80%
• An inference is sensitive if its confidence > threshold.   5
 Eliminate Low Support Inferences?
• In data mining, association or classification rules
  are used to capture general patterns of large
  populations.
   – A low support means the lack of statistical significance.


• In privacy protection, inference rules are used to
  infer sensitive information about individuals.
   – Eliminate sensitive inferences of any support.
   – In fact, a sensitive inference in a small group could
     present even more threats because individuals in a
     small group are more identifiable.
                                                             6
                The Problem
• Consider a table T(M1,…,Mm, 1, …, n, )
• Classification goal: Modeling class attribute 
• Privacy goal: Limit sensitive inferences on i
  – Specified by one or more templates <IC  , h>
  – IC is a set of attributes containing some masking
    attributes Mj, e.g., IC = {Job, Country}
  –  is a value from some i, e.g.,  = Discharged
  – h is a threshold on confidence
  – ic   is an inference, where ic contains values from IC
  – T satisfies <IC  , h> if every matching inference
    ic   has a confidence conf(ic  ) ≤ h.              7
      Flexibility of Templates
• Selectively protecting certain values while not
  protecting other values.
• Specifying a different threshold h for a different
  template IC  .
• Specifying multiple inference channels ICs (even
  for the same ).
• Specifying templates for multiple sensitive
  attributes.
• These flexibilities minimize unnecessary
  masking, i.e., minimize unnecessary information
                                                     8
  loss.
• Achieve goals by suppressing some values on
  masking attributes M1,…,Mm
• To eliminate {Trader,UK}  Discharged
   – Suppress Trader and Clerk to Job
   – Suppress UK and Canada to Country
   – Reduced confidence = 5/10 = 50%




                                                9
                 Challenges
• Incorrect suppression may eliminate some
  desired classification structures for modeling .

• Finding an optimal suppression is hard.
  – For a table with a total of q distinct values on masking
    attributes, there are 2q possible suppressed tables.
  – We present an approximate solution based on a
    search that iteratively improves the solution and
    prunes the search whenever no better solution is
    possible.

                                                          10
              The Algorithm
• Progressive Disclosure Algorithm (PDA) iteratively
  discloses domain values starting from the most
  suppressed T in which each masking attribute Mj in
  ∪IC contains only j.
• Supj contains all currently suppressed values in Mj.
• In each iteration, disclose one suppressed value w.
• To disclose a value w from Supj, we replace j with
  w in all suppressed records that currently contain
  j and originally contain w before suppression.
• This process repeats until no disclosure is possible
  without violating the set of templates.           11
         Progressive Disclosure
            Algorithm (PDA)
1: suppress every value of Mj to j where Mj ∪IC;
2: every Supj contains all domain values of Mj ∪IC;
3: while there is a candidate in ∪Supj do
4:    find winner w of highest Score(w) from ∪Supj;
5:    disclose w on T and remove w from ∪Supj;
6:    update Score(x) and status for x in ∪Supj;
7: end while
8: output the suppressed T and ∪Supj;

                                                   12
Conf = 5 / 24 = 21%   Conf = 1 / 4 = 25%   Conf = 1 / 4 = 25%
        Search Criteria: Score
• Disclosing a value v gains information and loses
  privacy
• Score(v) measures the information gain per unit
  of privacy loss.




• InfoGain measures the information gain of
  disclosing v.
                                                 14
        Search Criteria: Score
• PrivLoss measures the privacy loss of disclosing
  v, defined as the average increase of Conf(IC 
  ) over all affected IC  .


  where Conf and Confv represent the confidence
  before and after disclosing v.

• The key to the scalability of our algorithm is
  incrementally updating Score(v) in each iteration
  for candidates v in ∪Supj. (see paper for details)

                                                   15
                Cost Analysis
• At each iteration, the cost can be summarized
  as two operations.

  1. Scan the partitions on Link[w] for disclosing the
     winner w and maintaining some count statistics.

  2. Make use of the count statistics to update the score
     and status of every affected candidate without
     accessing data records. Thus, each iteration
     accesses only the records suppressed to w.

• The number of iterations is bounded by the
  number of distinct values in the masking
  attributes.                                               16
        Experimental Evaluation
• Data quality
   – Experiment with a broad range of templates.
   – Use C4.5 classifier.
   – Measure classification error before and after
     suppression.


• Efficiency and Scalability




                                                     17
                       Data sets
• Japanese Credit Screening (CRX)
  – Credit card applications
  – 8 categorical attributes and 2 classes
  – 465 recs. for training and 188 recs. for testing


• Adult
  –   Census data
  –   8 categorical attributes and 2 classes
  –   30162 recs. for training and 15060 recs. for testing
  –   Previously used in Bayardo et al. (2005), Fung et al.
      (2005), Iyengar (2002), and Wang et al. (2004).
                                                              18
                  Results on CRX
• TopN sensitive attributes 1,…,N; an IC includes the
  remaining masking attributes M1,…Mm.

• Base Error (BE) for original
  data = 15.4%
• Suppression Error (SE) for
  suppressed data
• Removal Error (RE) for
  removed 1,…,N
• RE-SE measures benefits
  of suppression

• Took at most 2 seconds
  for each experiment
        Results on Adult
• Base Error (BE) for original data = 17.6%
• Took at most 14 seconds for each experiment.




                                                 20
                        Scalability
• Replicate the Adult data set and substitute some random data.

• A time consuming setting:
   – 1 sensitive attribute
   – Remaining 7 as masking
     attributes
   – h=90%
              Related Works
• Iyengar (2002) proposed a genetic algorithm to
  address the problem of k-anonymity for
  classification.

• Bayardo et al. (2005) employed generalization
  and suppression to address a similar problem.

• Our work concerns over the output of data mining
  methods, where the threats are caused by what
  data mining tools can discover.
                                                   22
                Related Works
• Clifton (2000) suggested to eliminate sensitive
  inferences by limiting the data size.
• Verykios et al. (2004) proposed several
  algorithms for hiding association rules in a
  transaction database with minimal modification to
  the data.
  – Hide one rule at a time by either decreasing its support
    or its confidence
  – Achieved by removing items from transactions.
  – Our work considers the use of the data for classification
    analysis and eliminates all sensitive inferences
    including those with a low support.                    23
                Related Works
• Cox (1980) proposed the k%- dominance rule
  which suppresses a sensitive cell if the attribute
  values of two or three entities in the cell contribute
  more than k% of the corresponding SUM statistic.

   – Such “cell suppression” suppresses the count or other
     statistics stored in a cell of a statistical table.

   – Very different from the “value suppression” considered
     in our work.


                                                         24
               Conclusions
• Formulate a template-based privacy preservation
  problem.

• Show that suppression is an effective way to
  eliminate sensitive inferences.

• Present an effective algorithm based on a search
  that iteratively improves the solution.

• Evaluate this method on real life data sets.
                                                 25
                      References
1.   R. Agrawal, T. Imielinski, and A. N. Swami. Mining association
     rules between sets of items in large datasets. In Proc. of the 1993
     ACM SIGMOD, pages 207-216, 1993.
2.   R. J. Bayardo and R. Agrawal. Data privacy through optimal k-
     anonymization. In Proc. of the 21st IEEE ICDE, pages 217-228,
     2005.
3.   C. Clifton. Using sample size to limit exposure to data mining.
     Journal of Computer Security, 8(4):281-307, 2000.
4.   C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Y. Zhu. Tools
     for privacy preserving data mining. SIGKDD Explorations, 4(2),
     2002.
5.   L. H. Cox. Suppression methodology and statistical disclosure
     control. Journal of the American Statistics Association, Theory
     and Method Section, 75:377-385, 1980.
                                                                       26
                      References
6.  A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy
    preserving mining of association rules. In Proc. of the 8th ACM
    SIGKDD, pages 217-228, 2002.
7.  C. Farkas and S. Jajodia. The inference problem: A survey.
    SIGKDD Explorations, 4(2):6-11, 2003.
8.  B. C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization
    for information and privacy preservation. In Proc. of the 21st IEEE
    ICDE, pages 205-216, Tokyo, Japan, 2005.
9.  S. Hettich and S. D. Bay. The UCI KDD Archive, 1999.
    http://kdd.ics.uci.edu.
10. V. S. Iyengar. Transforming data to satisfy privacy constraints. In
    Proc. of the 8th ACM SIGKDD, 2002.
11. M. Kantarcioglu, J. Jin, and C. Clifton. When do data mining
    results violate privacy? In Proc. of the 2004 ACM SIGKDD, pages
    599-604, 2004.                                                     27
                     References
12. J. Kim and W. Winkler. Masking microdata files. In ASA Proc. of
    the Section on Survey Research Methods, 1995.
13. W. Kloesgen. Knowledge discovery in databases and data
    privacy. In IEEE Expert Symposium: Knowledge Discovery in
    Databases, 1995.
14. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan
    Kaufmann, 1993.
15. L. Sweeney. Datafly: A system for providing anonymity in medical
    data. In Proc. of the 11th International Conference on Database
    Security, pages 356-381, 1998.
16. L. Sweeney. Achieving k-anonymity privacy protection using
    generalization and suppression. International Journal on
    Uncertainty, Fuzziness, and Knowledge-based Systems,
    10(5):571-588, 2002.
                                                                   28
                     References
17. V. S. Verykios, A. K. Elmagarmid, E. Bertino, Y. Saygin, and E.
    Dasseni. Association rule hiding. IEEE TKDE, 16(4):434-447,
    2004.
18. K. Wang, P. S. Yu, and S. Chakraborty. Bottom-up generalization:
    a data mining solution to privacy protection. In Proc. of the 4th
    IEEE ICDM, 2004.
19. R. W. Yip and K. N. Levitt. The design and implementation of a
    data level database inference detection system. In Proc. of the
    12th International Working Conference on Database Security XII,
    pages 253-266, 1999.




                                                                   29
                         FAQ
Q: Inference rules with low supports are insignificant
   anyway, why do we bother eliminating them?
A: Keeping those low-support inferences is a relaxation of
   our current privacy requirement. In other words, the
   suppression error will be even lower (better) than our
   current model. If the user prefers, she may introduce
   another threshold (minimum support). This will further
   improve the classification quality.




                                                             30

						
Related docs