How to compile searching software so that it is impossible to reverse-engineer. (Private Keyword Search on Streaming Data) Rafail Ostrovsky William Skeith UCLA http://www.cs.ucla.edu/~rafail/ (patent pending) MOTIVATION: Problem 1. Airport 2 Airport 3 Each hour, we wish to passenger list passenger list find if any of hundreds of passenger lists has a Airport 1 passenger list name from “Possible Terrorists” list and if so Mobile code his/hers itinerary. (with state) “Possible Terrorists” list is classified and should Mobile code not be revealed to (with state) airports PROBLEM 1: Is it possible to design mobile software that can be transmitted to all airports Tantalizing question: can the (including potentially revealing this software to airports help (and do all the the adversary due to leaks) so that this search work) if they are not software collects ONLY information needed and allowed to get “possible without revealing what it is collecting at each terrorist” list? node? Non-triviality requirement: must send back only needed information, not everything! MOTIVATION: Problem 2. Looking for malicious insiders and/or terrorists communication: Public (I) First, we must identify networks some “signature” criteria (rules) for suspicious behavior – typically, this is done by analysts. (II) Second, we must detect which PROBLEM 2: Is it possible to design software that can capture all messages (and network nodes/stations transmit locations) that include secret/classified set of these signatures. “rules”? Key challenge: the software must not reveal secret “rules”. Non-triviality requirement: the software must send back only locations and Here, we want to tackle messages that match given “rules”, not everything it sees. part (II). What we want Punch line: we can send executable Search software, that has a set of Small storage code publicly. Various data “rules” to choose (that collects streams, consisting which documents selected (it won’t reveal its of flows of and/or packets to documents secrets!) documents/packets keep and which to and/or packets) toss. documents/ Our “compiler” outputs straight line packets that executable code (with program state) and a match secret decryption key “D”. “rules” STRAIGHT LINE Various data Small Fixed-size streams, consisting EXECUTABLE CODE THAT Program State Decrypt of flows of DOES NOT REVEAL using D documents/packets (encrypted in a special SEARCH “RULES” way that our code modifies for each document processed) Current Practice Continuously transfer all data to a secure environment. After data is transferred, filter in the classified environment, keep only small fraction of documents. Current practice: Classified Environment Filter Storage D(1,3)D(1,2) D(1,1) D(3,1) (1,3) (2,1) (3,2) (2,3) (2,2) (1,2) (1,1) (3,3) D(2,3)D(2,2) D(2,1) D(3,3) D(3,2) D(3,1) Filter rules are written by an analyst and are classified! Amount of data that must be transferred to a classified environment is enormous! Current Practice Drawbacks: Communication Processing Cost and timeliness How to improve performance? Distribute work to many locations on a network, where you decide “on the fly” which data is useful Seemingly ideal solution, but… Major problem: Notclear how to maintain security, which is the focus of this technology. Storage Classified … D(1,3) D(1,2)D(1,1) Filter E (D(1,2)) Environment E (D(1,3)) Decrypt Storage … D(2,3)D(2,2)D(2,1) Filter E (D(2,2)) Storage D(1,2) D(1,3) Storage D(2,2) Filter … D(3,3)D(3,2)D(3,1) Open network Example Filters: Look for all documents that contain special classified keywords (or string or data-item and/or do not contain some other data), selected by an analyst. Privacy Must hide what rules are used to create the filter Output must be encrypted More generally: We define the notion of Public Key Program Obfuscation Encrypted version of a program Performs same functionality as un-obfuscated program, but: Produces encrypted output Impossible to reverse engineer A little more formally: Public Key Program Obfuscation Can compile any code into a “obfuscated code with small storage”. Think of the Compiler as a mapping: Source code “Smart Public-Key Encryption” with initial Encrypted Storage + Decryption Key. Non-triviality: Sizes of complied program & encrypted storage & encrypted output are not much bigger, compared to uncomplied code. Nothing about the program is revealed, given compiled code + storage. Yet, Someone who has the decryption key get recover the “original” output. Privacy Related Notions PIR (Private Information Retrieval) [CGKS],[KO],[CMS]… Keyword PIR [KO],[CGN],[FIPR] Cryptographic counters [KMO] Program Obfuscation [BGIRSVY]… Here output is identical to un-obfuscated program, but in our case it is encrypted. Public Key Program Obfuscation: A more general notion than PIR, with lots of applications What do we want? Storage … D(1,3)D(1,2)D(1,1) Filter E (D(1,2)) E (D(1,3)) 2 requirements: correctness: only matching documents are saved, nothing else. Conundrum: Complied Filter Code is efficiency: the not allowed to have ANY decoding is proportional to branches (i.e. any “if then the length of the buffer, not else” executables). Only the size of the entire stream. straight-line code is allowed! REMARK: Comparison of our work to [Bethencourt, Song, Waters 06] [OS-05] [BSW-06] Buffer size to store m Buffer size to store m items: O(m log m) items: O(m) Efficiency: decoding Efficiency: decoding time is proportional to time is proportional to the buffer size. the length of the entire stream. NEXT – OUR CONSTRUCTION… Simplifying Assumptions for this Talk All keywords come from some poly-size dictionary Truncate documents beyond a certain length Sneak peak: the compiled code Suppose we are looking for all documents that contain some secret word from Webster dictionary. Here is how it looks to the adversary: For each document, execute the same code as follows: w1 E(*) Lookup encryptions of all words appearing in the document and w2 E(*) D multiply them together. Take this w3 E(*) value and apply a fixed formula to it to get value g. w4 E(*) w5 E(*) Dictionary . . . wn-2 E(*) wn-1 E(*) g wn E(*) (*,*,*) (*,*,*) (*,*,*) (*,*,*) (*,*,*) (*,*,*) (*,*,*) (*,*,*) (*,*,*) (*,*,*) Small Output Buffer How should a solution look? This is a This is a This is a matching Non- Non- matching Non- document matching matching document matching document #3 #2 #1 document document How do we accomplish this? Reminder: PKE Key-generation(1k) (PK, SK) E(PK,m,r) c D(c, SK) m We will use PKE with additional properties. Several Solutions based on Homomorphic Public-Key Encryptions For this talk: Paillier Encryption Properties: E(x) is probabilistic, in particular can encrypt a single bit in many different ways, s.t. any instances of E(0) and any instance of E(1) can not be distinguished. Homomorphic: i.e., E(x)*E(y) = E(x+y) Using Paillier Encryption E(x)E(y) = E(x+y) Important to note: E(0)c = E(0)*…*E(0) = = E(0+0+….+0) = E(0) c E(1) = E(1)*…*E(1) = = E(1+1+…+1) = E(c) Assume we can somehow compute an encrypted value v, where we don’t know what v stands for, but v=E(0) for “un-interesting” documents and v=E(1) for “interesting” documents. What’s v c ? It is either E(0) or E(C) where we don’t know which one it is. w1 E(0) w2 E(1) D g E(0) * E(1) * E(0) w3 E(0) g = E(0) if there are no matching words w4 E(0) g = E(c) if there are c matching words w5 E(1) gD= E(0) if there are no matching words Dictionary . . gD= E(c*D) if there are c matching words . Thus: if we keep g=E(c) and gD=E(c*D), wn-2 E(1) we can calculate D exactly. wn-1 E(0) (g,gD) wn E(0) E(0) E(0) E(0) E(0) E(0) E(0) E(0) E(0) E(0) E(0) Output Buffer Here’s another Collisions cause two problems: matching document 1. Good documents are destroyed 2. Non-existent documents could be fabricated This is This is This is matching matching matching document document document #1 #3 #2 make use of two We’ll combinatorial lemmas… Combinatorial Lemma 1 Claim:color survival games succeeds with probability > 1-neg(g) How to detect collisions? Idea: append a highly structured, (yet random) short combinatorial object to the message with the property that if 2 or more of them “collide” the combinatorial property is destroyed. can always detect collisions! 100|001|100|010|010|100|001|010|010 010|001|010|001|100|001|100|001|010 010|100|100|100|010|001|010|001|010 = 100|100|010|111|100|100|111|010|010 Combinatorial Lemma 2 Claim: collisions are detected with probability > 1 - exp(-k/3) We do the same for all documents! w1 E(*) For every document in the stream do the same: Lookup encryptions w2 E(*) D of all words appearing in the w3 E(*) document and multiply them together (= g). w4 E(*) w5 E(*) Compute gD and f(g) Dictionary . . multiply (g,gD,f(g))into g . randomly chosen locations wn-2 E(*) wn-1 E(*) (g,gD,f(g)) wn E(*) (*,*,*) (*,*,*) (*,*,*) (*,*,*) (*,*,*) (*,*,*) (*,*,*) (*,*,*) (*,*,*) (*,*,*) Small Output Buffer Detecting Overflow > m Idea: Double buffer size from m to 2m If m < #documents < 2m, output “overflow” If #documents > 2m, then expected number of collisions is large, thus output “overflow” in this case as well. Overflow: how to always collect at least m items (with arbitrary overflow of matching documents) Idea: create a logarithmic (in stream size) number of original buffers. First buffer is processed for every stream item Second buffer takes every item in a stream with probability ½ Third buffer takes every item with (independent) probability ¼ i’th buffer processes items with independent probability 1/2i Key point: If number of documents >M, at least one buffer will get O(M) matching documents! More from the paper that we don’t have time to discuss… Reducing program size below dictionary size (using – Hiding from [CMS]) Queries containing AND (using [BGN] machinery) Eliminating negligible error (using perfect hashing) Scheme based on arbitrary homomorphic encryption Extending to words not from dictionary (with small error prob.) Conclusions We introduced Private searching on streaming data More generally: Public key program obfuscation - - more general than PIR, or cryptographic counters Practical, efficient protocols Eat your cake and have it too: ensure that only “useful” documents are collected. Many possible extensions and lots of open problems THANK YOU!