privacy-preserving datamining

Document Sample
privacy-preserving datamining Powered By Docstoc
					                    How to compile
                 searching software so
                 that it is impossible to
                   reverse-engineer.
        (Private Keyword Search on Streaming Data)



          Rafail Ostrovsky               William Skeith
                                  UCLA

http://www.cs.ucla.edu/~rafail/                 (patent pending)
                  MOTIVATION: Problem 1.

                                                               Airport 2           Airport 3
   Each hour, we wish to                                    passenger list      passenger list
    find if any of hundreds of
    passenger lists has a               Airport 1
                                      passenger list
    name from “Possible
    Terrorists” list and if so
                                                                              Mobile code
    his/hers itinerary.                                                       (with state)
   “Possible Terrorists” list
    is classified and should         Mobile code

    not be revealed to               (with state)

    airports
                                    PROBLEM 1: Is it possible to design mobile
                                    software that can be transmitted to all airports
   Tantalizing question: can the   (including potentially revealing this software to
    airports help (and do all the   the adversary due to leaks) so that this
    search work) if they are not    software collects ONLY information needed and
    allowed to get “possible        without revealing what it is collecting at each
    terrorist” list?                node?
                                                Non-triviality requirement: must send back
                                                only needed information, not everything!
                  MOTIVATION: Problem 2.


   Looking for malicious
    insiders and/or terrorists
    communication:
                                                      Public
       (I) First, we must identify                   networks

        some “signature” criteria
        (rules) for suspicious
        behavior – typically, this is
        done by analysts.


        (II) Second, we must
        detect which                    PROBLEM 2: Is it possible to design software
                                        that can capture all messages (and network
        nodes/stations transmit         locations) that include secret/classified set of
        these signatures.               “rules”? Key challenge: the software must not
                                        reveal secret “rules”.
                                                  Non-triviality requirement: the software
                                                  must send back only locations and
   Here, we want to tackle                       messages that match given “rules”, not
                                                  everything it sees.
    part (II).
                                   What we want                                     Punch line:
                                                                                    we can send
                                                                                    executable
                                   Search software,
                                   that has a set of      Small storage
                                                                                    code publicly.
    Various data                   “rules” to choose      (that collects
    streams, consisting            which documents        selected                  (it won’t reveal its
    of flows of                    and/or packets to      documents                 secrets!)
    documents/packets              keep and which to      and/or packets)
                                   toss.




                                                                              documents/
                          Our “compiler” outputs straight line                packets that
                          executable code (with program state) and a          match
                                                                              secret
                          decryption key “D”.                                 “rules”



                            STRAIGHT LINE
Various data                                                        Small Fixed-size
streams, consisting         EXECUTABLE CODE THAT                    Program State                Decrypt
of flows of                 DOES NOT REVEAL                                                      using D
documents/packets                                                   (encrypted in a special
                            SEARCH “RULES”                          way that our code
                                                                    modifies for each
                                                                    document processed)
               Current Practice


   Continuously transfer all data to
    a secure environment.

   After data is transferred, filter in
    the classified environment, keep
    only small fraction of
    documents.
     Current
     practice:                Classified Environment

                              Filter        Storage
 D(1,3)D(1,2) D(1,1)       D(3,1)
                                (1,3)
                                (2,1)
                                (3,2)
                                (2,3)
                                (2,2)
                                (1,2)
                                (1,1)
                                (3,3)




D(2,3)D(2,2) D(2,1)



 D(3,3)  D(3,2) D(3,1) 
                                            Filter rules are
                                            written by an
                                            analyst and
                                            are classified!
    Amount of data
    that must be
    transferred to a
    classified
    environment is
    enormous!
         Current Practice

 Drawbacks:
 Communication
 Processing
 Cost   and timeliness
 How to improve performance?
 Distribute work to many locations on
  a network, where you decide “on the
  fly” which data is useful
 Seemingly ideal solution, but…
 Major problem:
   Notclear how to maintain security,
   which is the focus of this technology.
                                        Storage   Classified
… D(1,3) D(1,2)D(1,1)
                            Filter   E (D(1,2))   Environment
                                     E (D(1,3))


                                                     Decrypt
                                        Storage

… D(2,3)D(2,2)D(2,1)
                            Filter   E (D(2,2))

                                                      Storage
                                                       D(1,2)
                                                       D(1,3)
                                        Storage        D(2,2)
                            Filter
… D(3,3)D(3,2)D(3,1)



                                                  Open network
   Example Filters:
       Look for all documents that contain special
        classified keywords (or string or data-item
        and/or do not contain some other data),
        selected by an analyst.


   Privacy
     Must hide what rules are used to create the
      filter
     Output must be encrypted
              More generally:
 We define the notion of Public Key
  Program Obfuscation
 Encrypted version of a program
     Performs same functionality as un-obfuscated
      program, but:
     Produces encrypted output
     Impossible to reverse engineer
     A little more formally:
    Public Key Program Obfuscation
   Can compile any code into a “obfuscated code
    with small storage”.
   Think of the Compiler as a mapping:
       Source code  “Smart Public-Key Encryption” with
        initial Encrypted Storage + Decryption Key.
   Non-triviality: Sizes of complied program &
    encrypted storage & encrypted output are not
    much bigger, compared to uncomplied code.
   Nothing about the program is revealed, given
    compiled code + storage.
   Yet, Someone who has the decryption key get
    recover the “original” output.
Privacy
                 Related Notions
   PIR (Private Information Retrieval)
    [CGKS],[KO],[CMS]…
   Keyword PIR [KO],[CGN],[FIPR]
   Cryptographic counters [KMO]
   Program Obfuscation [BGIRSVY]…
       Here output is identical to un-obfuscated program, but
        in our case it is encrypted.
   Public Key Program Obfuscation:
       A more general notion than PIR, with lots of
        applications
                    What do we want?

                                                Storage

 … D(1,3)D(1,2)D(1,1)       Filter       E (D(1,2))
                                             E (D(1,3))
2 requirements:

correctness: only
matching documents are
saved, nothing else.
                                     Conundrum:
                                     Complied Filter Code is
efficiency: the                      not allowed to have ANY
decoding is proportional to          branches (i.e. any “if then
the length of the buffer, not        else” executables). Only
the size of the entire
stream.                              straight-line code is
                                     allowed!
      REMARK: Comparison of our work to
        [Bethencourt, Song, Waters 06]

    [OS-05]                       [BSW-06]

   Buffer size to store m       Buffer size to store m
    items: O(m log m)             items: O(m)

   Efficiency: decoding         Efficiency: decoding
    time is proportional to       time is proportional to
    the buffer size.              the length of the
                                  entire stream.
   NEXT – OUR
    CONSTRUCTION…
    Simplifying Assumptions for this
                  Talk
 All keywords come from some poly-size
  dictionary
 Truncate documents beyond a certain
  length
Sneak peak: the compiled code
   Suppose we are looking for all documents
    that contain some secret word from
    Webster dictionary.

   Here is how it looks to the adversary: For
    each document, execute the same code
    as follows:
             w1         E(*)                                     Lookup encryptions of all words
                                                                 appearing in the document and
             w2         E(*)                       D             multiply them together. Take this
             w3         E(*)                                     value and apply a fixed formula to
                                                                 it to get value g.
             w4         E(*)
             w5         E(*)
Dictionary




                    .
                    .
                    .


             wn-2       E(*)
             wn-1       E(*)                                 g

             wn         E(*)




                               (*,*,*)   (*,*,*)   (*,*,*)       (*,*,*)   (*,*,*)   (*,*,*)   (*,*,*)   (*,*,*)   (*,*,*)   (*,*,*)

                                                                            Small Output Buffer
How should a solution look?
This is a   This is a   This is a
matching
Non-        Non-
            matching    Non-
document
matching    matching
            document    matching
document
#3
#2          #1
            document    document
How do we accomplish this?
             Reminder: PKE
 Key-generation(1k)  (PK, SK)
 E(PK,m,r)  c
 D(c, SK)  m


   We will use PKE with additional properties.
   Several Solutions based on
Homomorphic Public-Key Encryptions
   For this talk: Paillier Encryption

   Properties:
       E(x) is probabilistic, in particular can encrypt a
        single bit in many different ways, s.t. any
        instances of E(0) and any instance of E(1)
        can not be distinguished.
     Homomorphic:        i.e., E(x)*E(y) = E(x+y)
          Using Paillier Encryption
   E(x)E(y) = E(x+y)
   Important to note:
      E(0)c = E(0)*…*E(0) =
              = E(0+0+….+0) = E(0)
            c
      E(1) = E(1)*…*E(1) =
              = E(1+1+…+1) = E(c)
   Assume we can somehow compute an encrypted value
    v, where we don’t know what v stands for, but v=E(0) for
    “un-interesting” documents and v=E(1) for “interesting”
    documents.
 What’s v
           c ? It is either E(0) or E(C) where
    we don’t know which one it is.
             w1         E(0)
             w2         E(1)            D       g  E(0) * E(1)        * E(0)
             w3         E(0)
                                         g = E(0) if there are no matching words
             w4         E(0)
                                         g = E(c) if there are c matching words
             w5         E(1)

                                         gD= E(0) if there are no matching words
Dictionary




                    .
                    .
                                         gD= E(c*D) if there are c matching words
                    .
                                         Thus: if we keep g=E(c) and gD=E(c*D),
             wn-2       E(1)             we can calculate D exactly.

             wn-1       E(0)                (g,gD)
             wn         E(0)




                               E(0) E(0) E(0) E(0) E(0) E(0) E(0) E(0) E(0) E(0)


                                                       Output Buffer
Here’s
another
     Collisions cause two problems:
matching
document
     1. Good documents are destroyed
    2. Non-existent documents could be
       fabricated



This is    This is              This is
matching   matching             matching
document   document             document
#1         #3                   #2
     make use of two
We’ll
 combinatorial lemmas…
     Combinatorial Lemma 1
 Claim:color survival games succeeds
 with probability > 1-neg(g)
       How to detect collisions?
   Idea: append a highly structured, (yet
    random) short combinatorial object to the
    message with the property that if 2 or
    more of them “collide” the combinatorial
    property is destroyed.
     can always detect collisions!
100|001|100|010|010|100|001|010|010


010|001|010|001|100|001|100|001|010


010|100|100|100|010|001|010|001|010
                  =
100|100|010|111|100|100|111|010|010
     Combinatorial Lemma 2
Claim: collisions are detected with
     probability > 1 - exp(-k/3)
We do the same for all
    documents!
             w1         E(*)                                  For every document in the stream
                                                              do the same: Lookup encryptions
             w2         E(*)                       D          of all words appearing in the
             w3         E(*)                                  document and multiply them
                                                              together (= g).
             w4         E(*)
             w5         E(*)                                  Compute gD and f(g)
Dictionary




                    .
                    .                                         multiply (g,gD,f(g))into g
                    .
                                                              randomly chosen locations

             wn-2       E(*)
             wn-1       E(*)                             (g,gD,f(g))
             wn         E(*)



                               (*,*,*)   (*,*,*)   (*,*,*)   (*,*,*)   (*,*,*)   (*,*,*)   (*,*,*)   (*,*,*)   (*,*,*)   (*,*,*)

                                                             Small Output Buffer
     Detecting Overflow > m
 Idea: Double buffer size from m to 2m
 If m < #documents < 2m, output
  “overflow”
 If #documents > 2m, then expected
  number of collisions is large, thus
  output “overflow” in this case as well.
    Overflow: how to always
    collect at least m items
        (with arbitrary overflow of matching documents)
   Idea: create a logarithmic (in stream size)
    number of original buffers.
       First buffer is processed for every stream item
       Second buffer takes every item in a stream with probability ½
       Third buffer takes every item with (independent) probability ¼
       i’th buffer processes items with independent probability 1/2i

   Key point: If number of documents >M, at least
    one buffer will get O(M) matching documents!
    More from the paper that we don’t
         have time to discuss…
   Reducing program size below dictionary size
    (using  – Hiding from [CMS])
   Queries containing AND (using [BGN]
    machinery)
   Eliminating negligible error (using perfect
    hashing)
   Scheme based on arbitrary homomorphic
    encryption
   Extending to words not from dictionary (with
    small error prob.)
                 Conclusions
   We introduced Private searching on streaming
    data
   More generally: Public key program obfuscation -
    - more general than PIR, or cryptographic
    counters
   Practical, efficient protocols
   Eat your cake and have it too: ensure that only
    “useful” documents are collected.
   Many possible extensions and lots of open
    problems
             THANK        YOU!

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:3
posted:6/27/2011
language:English
pages:39