VIEWS: 18 PAGES: 9 CATEGORY: Education POSTED ON: 5/16/2010 Public Domain
Privacy-Preserving Outlier Detection Jaideep Vaidya Chris Clifton Rutgers University Purdue University 180 University Avenue 250 N. University St. Newark, NJ 07102-1803 W. Lafayette, IN 47907-2066 jsvaidya@rbs.rutgers.edu clifton@cs.purdue.edu Abstract tacks on data mining by privacy advocates; the U.S. Ter- rorism Information Awareness program was killed for this Outlier detection can lead to the discovery of truly un- reason[18]. expected knowledge in many areas such as electronic com- Outlier detection has numerous other applications that merce, credit card fraud and especially national security. also raise privacy concerns. Mining for anomalies has We look at the problem of ﬁnding outliers in large dis- been used for network intrusion detection[1, 17]; pri- tributed databases where privacy/security concerns restrict vacy advocates have responded with research to enhance the sharing of data. Both homogeneous and heterogeneous anonymity[20, 10]. Fraud discovery in the mobile phone distribution of data is considered. We propose techniques to industry has also made use of outlier detection[6]; organi- detect outliers in such scenarios while giving formal guar- zations must be careful to avoid overstepping the bounds of antees on the amount of information disclosed. privacy legislation[5]. Privacy-preserving outlier detection will ensure these concerns are balanced, allowing us to get the beneﬁts of outlier detection without being thwarted by 1. Introduction legal or technical counter-measures. This paper assumes data is distributed; the stewards of Advances in information technology and the ubiquity the data are allowed to use it, but disclosing it to others is of networked computers have made personal information a privacy violation. The problem is to ﬁnd distance-based much more available. This has lead to a privacy backlash. outliers without any party gaining knowledge beyond learn- Unfortunately, “data mining” has been the whipping-boy ing which items are outliers. Ensuring that data is not dis- for much of this backlash, witness a United States Senate closed maintains privacy, i.e., no privacy is lost beyond that proposal to forbid all “data-mining activity” by the U.S. inherently revealed in knowing the outliers. Even knowing Department of Defense[7]. Much of this is based on a mis- which items are outliers need not be revealed to all parties, taken view of data mining; the above-cited act speciﬁcally further preventing privacy breaches. discusses search for individuals. Most data mining is not The approach duplicates the results of the outlier detec- about individuals, but generalizing information. tion algorithm of [14]. The idea is that an object o is an Data mining does raise legitimate privacy concerns; the outlier if more than a percentage p of the objects in the data process of data mining often results in greater integration set are farther than distance d from o. The basic idea is that of data, increasing potential for misuse. If the data min- parties compute the portion of the answer they know, then ing results do not pose an inherent privacy threat, privacy- engage in a secure sum to compute the total distance. The preserving data mining techniques enable knowledge dis- key is that this total is (randomly) split between sites, so no- covery without requiring disclosure of private data. Privacy- body knows the actual distance. A secure protocol is used preserving methods have been developed for numerous data to determine if the actual distance between any two points mining tasks; many are described in [13]. exceeds the threshold; again the comparison results are ran- To our knowledge, privacy-preserving outlier detection domly split such that summing the splits (over a closed has not yet been addressed. Outlier detection has wide ap- ﬁeld) results in a 1 if the distance exceeds the threshold, plication; one that has received considerable attention is the or a 0 otherwise. search for terrorism. Detecting previously unknown sus- For a given object o, each site can now sum all of its picious behavior is a clear outlier detection problem. The shares of comparison results (again over the closed ﬁeld). search for terrorism has also been the ﬂash point for at- When added to the sum of shares from other sites, the result is the correct count; all that remains is to compare it with the 2.3. Outlier Detection percentage threshold p. This addition/comparison is also done with a secure protocol, revealing only the result: if o Our goal is to ﬁnd Distance Based Outliers. Knorr and is an outlier. Ng [14] deﬁne the notion of a Distance Based outlier as fol- We ﬁrst discuss the problem we are facing: the different lows: An object O in a dataset T is a DB(p,dt)-outlier if at problems posed by vertically and horizontally partitioned least fraction p of the objects in T lie at distance greater datasets, and the formal deﬁnition of outlier detection. Sec- than dt from O. Other distance based outlier techniques also tion 3 gives privacy-preserving algorithms for both horizon- exist[15, 19]. The advantages of distance based outliers are tally and vertically partitioned data. We prove the security that no explicit distribution needs to be deﬁned to determine of the algorithms in Section 4, and discuss the computa- unusualness, and that it can be applied to any feature space tional and communication complexity of the algorithms in for which we can deﬁne a distance measure. We assume Section 5. We conclude with a discussion of areas for fur- Euclidean distances, although the algorithms are easily ex- ther work on this problem. tended to general Minkowski distances. There are other non distance based techniques for ﬁnding outliers as well as sig- 2. Data Partitioning Models niﬁcant work in statistics [2], but we do not consider those in this paper and leave that for future work. The problem as we deﬁne it is that the data is inherently distributed; it is sharing (or disclosure to other parties) that 3. Privacy Preserving Outlier Detection violates privacy. The way the data is distributed / partitioned results in very different solutions. We consider two differ- We now present two algorithms for Distance Based Out- ent data partitions: horizontal and vertical. In either case, liers meeting the deﬁnition given in Section 2.3. The ﬁrst assume k different parties, P0 , . . . , Pk−1 ; m attributes; and is for horizontally partitioned data, the second for vertically n total objects. We now describe the speciﬁcs of the differ- partitioned data. The algorithm is based on the obvious one: ent data models considered. Compare points pairwise and count the number exceeding the distance threshold. The key is that all intermediate com- 2.1. Horizontally Partitioned Data putations (such as distance comparisons) leave the results randomly split between the parties involved; only the ﬁnal With horizontally partitioned (viz. distributed homo- result (if the count exceeds p%) is disclosed. geneous) data, different parties collect the same informa- The pairwise comparison of all points may seem exces- tion (features) for different objects. Each party collects in- sive. However, it is necessary to achieve a completely (even formation about m attributes, A1 , . . . , Am . Party Pi col- reasonably) secure solution (this will be discussed further k−1 lects information about ni objects, such that i=0 ni = n in Section 5.3.) The asymptotic complexity still equals that (different parties collect information about different enti- of [14]. ties). Consider the case of several banks that collect similar Note that to obtain a secure solution, all operations are data about credit card transactions but for different clients. carried out modulo some ﬁeld. We will use the ﬁeld D for Clearly, the data is horizontally partitioned. Outlier detec- distances, and F for counts of the number of entities. The tion is particularly useful in this case to determine poten- ﬁeld F must be over twice the number of objects. Limits on tially fraudulent transactions. D are based on maximum distances; details on the size are given with each algorithm. 2.2. Vertically Partitioned Data 3.1. Horizontally Partitioned Data With vertically partitioned (viz. distributed heteroge- neous) data, different parties collect different features for The key idea behind the algorithm for horizontally par- the same set of objects. Party Pi collects information about titioned data is as follows. For each object i, the protocol mi attributes, Ai,1 , . . . , Ai,mi . The total number of at- iterates over every other object j. If the party holding i also k−1 tributes, p=0 mp = m. All of the parties also hold in- holds j, it can easily ﬁnd the distance and compare against formation about the same n objects. Thus, there are a total the threshold. If two different parties hold the two objects, of n transactions (with data for each transaction really being the parties engage in a distance calculation protocol (Sec- split between the parties). Consider the case of an airline, tion 3.1.1) to get random shares of the distance. A second banking institution and federal databases. By cross correlat- protocol allows comparing the shares with the threshold, re- ing information and locating outliers we may hope to spot turning 1 if the distances exceeds the threshold, or 0 if it potential terrorist activities. does not. The key to this second protocol is that the 1 or 0 ′ ′ ′ ′ is actually two shares rq and rs , such that rq +rs = 1 (or 0) (mod F ). From one share, the party learns nothing. Once all points have been compared, the parties sum Algorithm 1 Finding DB(p,D)-outliers their shares. Since the shares add to 1 for distances exceed- Require: k parties, P0 , . . . , Pk−1 ; each holding a subset of ing the distance threshold, and 0 otherwise, the total sum the objects O. (mod F ) is the number of points for which the distance ex- Require: Fields D larger than the maximum distance ceeds the threshold. The parties do not actually compute squared, F larger than |O| this sum; instead all parties pass their (random) shares to 1: for all objects oi ∈ O {Let Pq be the party holding oi } a designate to add, and the designated party and the party do holding the point engage in a secure protocol that reveals 2: for all parties Pr do only if the sum of the shares exceeds p%. This ensures that 3: numr ← 0 (mod F ) {Initialize counters} no party learns anything except whether the point is an out- 4: end for lier. 5: for all objects oj ∈ O, oj = oi do Algorithm 1 gives the complete details. Steps 5-23 are 6: if Pq holds oj then the pairwise comparison of two points, giving each party 7: if Distance(oi , oj ) > dt {Computed locally at random shares of a 1 (if the points are far apart) or 0 (if the Pq } then points are within the distance threshold dt). The random 8: At Pq : numq ← numq + 1 (mod F ) split of shares ensures that nothing is learned by either party. 9: end if In steps 25-28, the parties (except the party Pq holding the 10: else object being evaluated) sum their shares. Again, since each 11: {Let Ps hold oj } share is a random split (and Pq holds the other part of the 12: {Using the distance computation protocol (Sec- split), no party learns anything. Finally, Pq−1 and Pq add tion 3.1.1)} and compare their shares, revealing only if the object oi is 13: Pq ← rq and Ps ← rs such that rq + rs an outlier. Note that the shares of this comparison are split, (mod D) = Distance2 (oi , oj ) and could be sent to any party (Pq in Algorithm 1, but it 14: {Using the secure comparison protocol (Section need not even be one of the Pr ). Only that party (e.g., a 3.3)} fraud prevention unit) learns if oi is an outlier, the others 15: ′ ′ Pq ← rq and Ps ← rs such that: learn nothing. 16: if rq + rs (mod D) > dt2 then ′ ′ 17: rq + rs = 1 (mod F ) 18: else ′ ′ 3.1.1 Computing distance between two points 19: rq + rs = 0 (mod F ) 20: end if Step 13 of Algorithm 1 requires computing a distance, but 21: end if ′ leaving random shares of that distance with two parties 22: At Pq : numq ← numq + rq ′ rather than revealing the result. For convenience, we ac- 23: At Ps : nums ← nums + rs tually compute shares of the square of the distance, and 24: end for compare with the square of the threshold. (This does not 25: for all Pr except Pq and Pq−1 (mod k) do change the result, since squaring is a monotonically increas- 26: Pr sends numr to Pq−1 ing function.) We now give an algorithm based on secure 27: end for scalar product for computing shares of the square of the Eu- 28: At Pq−1 : numq−1 ← i=q numi clidean distance. 29: {Using the secure comparison of Section 3.3} 30: Pq ← tempq and Pq−1 ← tempq−1 such that: Formally, let there be two parties, P1 and P2 . All com- 31: if numq + numq−1 (mod F ) > |O| ∗ p% then putations are over a ﬁeld D larger than the square of the 32: tempq + tempq−1 ← 1 {oi is an outlier} maximum distance. P1 ’s input is the point X, P2 ’s input is 33: else the point Y. The outputs are r1 and r2 respectively (indepen- 34: tempq + tempq−1 ← 0 dently uniformly distributed over D), such that r1 + r2 = 35: end if Distance2 (X, Y ) (mod D), where Distance(X, Y ) is 36: Pq−1 sends tempq−1 to Pq , revealing to Pq if oi is the Euclidean distance between the points X and Y . an outlier. Let there be m attributes, and a point X be represented 37: end for by its m-dimensional tuple (x1 , . . . , xm ). Each co-ordinate represents the value of the point for that attribute. The square of the Euclidean distance between X and Y is given by m Distance2 (X, Y ) = (xr − yr )2 r=1 = x2 − 1 2 2x1 y1 + y1 + . . . 2 . . . + x2 − 2xm ym + ym m m m m Algorithm 2 Finding DB(p,D)-outliers = x2 + r 2 yr − 2xr yr Require: k parties, P0 , . . . , Pk−1 ; each holding a subset of r=1 r=1 r=1 the attributes for all objects O. Require: dtr : local distance threshold for Pr . P1 can independently calculate r x2 . Similarly, P2 can r Require: Fields D larger than twice the maximum dis- 2 calculate r yr . As long as there is more than one attribute tance, F larger than |O| (i.e., m > 1), the remaining sum r (2xr )(−yr ) is simply 1: for all objects oi ∈ O do the scalar product of two m-dimensional vectors. P1 and m′ ← m′ 2: 0 k−1 ← 0 (mod F ) P2 engage in a secure scalar product protocol to get random 3: for all objects oj ∈ O, oj = oi do shares of the dot product. This, added to their prior calcu- 4: P0 : Randomly choose a number x from a uniform lated values, gives each party a random share of the square distribution over the ﬁeld D; x′ ← x of the distance. There are many scalar product protocols 5: for r ← 0, . . . , k − 2 do proposed in the literature [4, 21, 11]; any of these can be 6: At Pr : x′ ← x′ + Distancer (oi , oj ) − dtr used. (mod D) {Distancer is local distance at Pr } Assuming that the scalar product protocol is secure, ap- 7: Pr sends x′ to Pr+1 plying the composition theorem of [8] shows that the entire 8: end for protocol is secure. 9: At Pk−1 : x′ ← x′ +Distancek−1 (oi , oj )−dtk−1 (mod D) 3.2. Vertically Partitioned Data 10: {Using the secure comparison protocol (Section 3.3)} Vertically partitioned data introduces a different chal- 11: P0 ← m0 and Pk−1 ← mk−1 such that: lenge. Each party can compute a share of the pairwise dis- 12: if 0 < x′ + (−x) (mod D) < |D|/2 then tance locally; the sum of these shares is the total distance. 13: m0 + mk−1 = 1 (mod F ) However, the distance must not be revealed, so a secure pro- 14: else tocol is used to get shares of the pairwise comparison of 15: m0 + mk−1 = 0 (mod F ) distance and threshold. From this point, it is similar to hor- 16: end if izontal partitioning: Add the shares and determine if they 17: At P0 : m′ ← m′ + m0 (mod F ) 0 0 exceed p%. 18: At Pk−1 : m′ ′ k−1 ← mk−1 + mk−1 (mod F ) An interesting side effect of this algorithm is that the 19: end for parties need not reveal any information about the attributes 20: {Using the secure comparison of Section 3.3} they hold, or even the number of attributes. Each party lo- 21: P0 ← temp0 and Pk−1 ← tempk−1 such that: cally determines the distance threshold for its attributes (or 22: if m′ + m′ 0 k−1 (mod F ) > |O| ∗ p% then more precisely, the share of the overall threshold for its at- 23: temp0 + tempk−1 ← 1 {oi is an outlier} tributes). Instead of computing the local pairwise distance, 24: else each party computes the difference between the local pair- 25: temp0 + tempk−1 ← 0 wise distance and the local threshold. If the sum of these 26: end if differences is greater than 0, the pairwise distance exceeds 27: P0 and Pk−1 send temp0 and tempk−1 to the party the threshold. authorized to learn the result; if temp0 + temp1 = 1 Algorithm 2 gives the full details. In steps 5-9, the sites then oi is an outlier. sum their local distances. The random x added by P0 masks 28: end for the distances from each party. In steps 11-18, Parties P0 and Pk−1 get shares of the pairwise comparison result, as in Al- gorithm 1. The comparison is a test if the sum is greater than 0 (since the threshold has already been subtracted.) These two parties keep a running sum of their shares. At the end, these shares are added and compared with the percentage threshold, again as in Algorithm 1. Theorem 1 Proof of Correctness: Algorithm 2 correctly execution of the protocol, then the real execution isn’t giv- returns as output the complete set of points that are global ing away any new information. outliers. To formalize this, we will ﬁrst give some deﬁnitions from [8]. We then give the proofs of security for Algorithms P ROOF. In order to prove the correctness of Algorithm 2, 1 and 2. it is sufﬁcient to prove that a point is reported as an outlier if and only if it is truly an outlier. Consider point q. If q is an 4.1 Secure Multi-Party Computation outlier, in steps 12-16 for at least p% ∗ |O| + 1 of the other points, m0 + mk−1 = 1 (mod F ). Since |F | > |O|, it follows that m′ + m′ Yao ﬁrst postulated the two-party comparison problem 0 k−1 > |O| ∗ p%. Therefore, point q (Yao’s Millionaire Protocol) and developed a provably se- will be correctly reported as an outlier. If q is not an outlier, the same argument applies in reverse. Thus, in steps 12-16 cure solution[23]. This was extended to multiparty compu- at most p% ∗ |O| − 1 points, m0 + mk−1 = 1 (mod F ). tations by Goldreich et al.[9]. They developed a framework for secure multiparty computation, and in [8] proved that Again, since |F | > |O|, it follows that m′ + m′ 0 k−1 ≤ |O| ∗ p%. Therefore, point q will not be reported as an outlier. computing a function privately is equivalent to computing it securely. We start with the deﬁnitions for security in the semi- 3.3. Modiﬁed Secure Comparison Protocol honest model. A semi-honest party follows the rules of the protocol using its correct input, but is free to later use what At several stages in the protocol, we need to securely it sees during execution of the protocol to compromise se- compare the sum of two numbers, with the output split be- curity. A formal deﬁnition of private two-party computation tween the parties holding those numbers. This can be ac- in the semi-honest model is given below. complished using the generic circuit evaluation technique ﬁrst proposed by Yao[23]. Formally, we need a modiﬁed Deﬁnition 1 (privacy w.r.t. semi-honest behavior):[8] secure comparison protocol for two parties, A and B. The ∗ ∗ ∗ ∗ Let f : {0, 1} × {0, 1} −→ {0, 1} × {0, 1} be prob- local inputs are xa and xb and the local outputs are ya and abilistic, polynomial-time functionality, where f1 (x, y) (re- yb . All operations on input are in a ﬁeld F1 and output spectively, f2 (x, y)) denotes the ﬁrst (resp., second) ele- are in a ﬁeld F2 . ya + yb = 1 (mod F2 ) if xa + xb ment of f (x, y)); and let Π be two-party protocol for com- (mod F1 ) > 0, otherwise ya + yb = 0 (mod F2 ). A puting f . ﬁnal requirement is that ya and yb should be independently Let the view of the ﬁrst (resp., second) party dur- uniformly distributed over F (clearly the joint distribution ing an execution of protocol Π on (x, y), denoted is not uniform). Π Π view1 (x, y) (resp., view2 (x, y)), be (x, r1 , m1 , . . . , mt ) This builds on the standard secure multiparty computa- (resp., (y, r2 , m1 , . . . , mt )). r1 represent the outcome of tion circuit-based approach for solving this problem[8]. Ef- the ﬁrst (resp., r2 the second) party’s internal coin tosses, fectively, A chooses ya with a uniform distribution over F , and mi represent the ith message it has received. and provides it as an additional input to the circuit that ap- The output of the ﬁrst (resp., second) party during an propriately computes yb . The circuit is then securely eval- execution of Π on (x, y) is denoted outputΠ (x, y) (resp., 1 uated, with B receiving the output yb . The complexity of outputΠ (x, y)) and is implicit in the party’s view of the 2 this is equivalent to the complexity of Yao’s Millionaire’s execution. problem (simple secure comparison). Π privately computes f if there exist probabilistic poly- nomial time algorithms, denoted S1 , S2 such that 4. Security Analysis {(S1 (x, f1 (x, y)) , f2 (x, y))}x,y∈{0,1}∗ ≡C The security argument for this algorithm uses proof tech- Π view1 (x, y) , outputΠ (x, y) 2 x,y∈{0,1}∗ niques from Secure Multiparty Computation. The idea is {(f1 (x, y) , S2 (x, f1 (x, y)))}x,y∈{0,1}∗ ≡ C that since what a party sees during the protocol (its shares) are randomly chosen from a uniform distribution over a Π outputΠ (x, y) , view2 (x, y) 1 x,y∈{0,1}∗ ﬁeld, it learns nothing in isolation. (Of course, collusion with other parties could reveal information, since the joint where ≡C denotes computational indistinguishability. distribution of the shares is not random.) The idea of the proof is based on a simulation argument: If we can deﬁne a As we shall see, our protocol is actually somewhat stronger simulator that uses the algorithm output and a party’s own than the semi-honest model, although it does not meet the data to simulate the messages seen by a party during a real full malicious model deﬁnition of [8]. Privacy by Simulation The above deﬁnition says that a sq ) randomly from a uniform distribution over D. Then 1 computation is secure if the view of each party during the ∀x ∈ D, P r(sq = x) = |D| . Thus, sq is easily simulated execution of the protocol can be effectively simulated given by simply choosing a random value from D. Let the result the input and the output of that party. Thus, in all of our r = r (2xr )(−yr ) be ﬁxed. Then ∀x ∈ F, P r(ss = x) = proofs of security, we only need to show the existence of a 1 P r(r − sq = y) = P r(sq = r − y) = |D| . Therefore, simulator for each party that satisﬁes the above equations. the simulator for Ps can simulate this message by simply This does not quite guarantee that private information is choosing a random number from an uniform distribution protected. Whatever information can be deduced from the over D. Assuming that the scalar product protocol is se- ﬁnal result obviously cannot be kept private. For example, cure, applying the composition theorem shows that step 13 if a party learns that point A is an outlier, but point B which is secure. is close to A is not an outlier, it learns an estimate on the number of points that lie between the space covered by the Steps 15 and 30: The simulator for party Pq (respectively hypersphere for A and hypersphere for B. Here, the result Ps ) again chooses a number randomly from a uniform dis- reveals information to the site having A and B. The key to tribution, this time over the ﬁeld F . By the same argument the deﬁnition of privacy is that nothing is learned beyond as above, the actual values are uniformly distributed, so the what is inherent in the result. probability of the simulator and the real protocol choosing A key result we use is the composition theorem. We state any particular value are the same. Since a circuit for secure it for the semi-honest model. A detailed discussion of this comparison is used, using the composition theorem, no ad- theorem, as well as the proof, can be found in [8]. ditional information is leaked and step 15 is secure. Theorem 2 (Composition Theorem for the semi-honest model): Suppose that g is privately reducible to f and that Step 26: Pq−1 receives several shares numr . However, there exists a protocol for privately computing f. Then there note that numr is a sum, where all components of the sum exists a protocol for privately computing g. are random shares from Step 15. Since Pq−1 receives only shares from the Ps in step 15, and receives none from Pq , P ROOF. Refer to [8]. all of the shares in the sum are independent. The sum numr can thus be simulated by choosing a random value from a 4.2. Horizontally Partitioned Data uniform distribution over F . Theorem 3 Algorithm 1 returns as output the set of points Step 36: Since Pq knows the results (1 if oi is an outlier, that are global outliers, and reveals no other information to 0 otherwise), and tempq was simulated in step 30, it can any party provided parties do not collude. simulate tempq−1 with the results (1 or 0) −tempq mod F . P ROOF. Presuming that the number of objects |O| is The simulator clearly runs in polynomial time (the same known globally, each party can locally set up and run its as the algorithm). Since each party is able to simulate the own components of Algorithm 1 (e.g., a party only needs to view of its execution (i.e., the probability of any particu- worry about its local objects in the “For all objects” state- lar value is the same as in a real execution with the same ments at lines 1 and 5.) In the absence of some type of inputs/results) in polynomial time, the algorithm is secure secure anonymous send[20, 10] (e.g., anonymous transmis- with respect to Deﬁnition 1. sion with public key cryptography to ensure reception only by the correct party), the number of objects at each site is While the proof is formally only for the semi-honest revealed. Since at least an upper bound on the number of model, it can be seen that a malicious party in isolation can- items is inherently revealed by the running time of the algo- not learn private values (regardless of what it does, it is still rithm, we assume these values are known. possible to simulate what it sees without knowing the input The next problem is to simulate the messages seen by of the other parties.) This assumes that the underlying scalar each party during the algorithm. Communication occurs product and secure comparison protocols are secure against only at steps 13, 15, 24, 30, and 36. We now describe the malicious behavior. A malicious party can cause incorrect simulation independently. results, but it cannot learn private data values. Step 13: Pq and Ps each receive a share of the square 4.3. Vertically Partitioned Data of the distance. As can be seen in Section 3.1.1, all parts of the shares are computed locally except for shares of Theorem 4 Algorithm 2 returns as output the set of points the scalar product. Assume that the scalar product pro- that are global outliers while revealing no other information tocol chooses shares by selecting the share for Pq (call it to any party, provided parties do not collude. P ROOF. All parties know the number (and identity) of When there are three or more parties, assuming no col- objects in O. Thus they can set up the loops; the simulator lusion, we can develop much more efﬁcient solutions that just runs the algorithm to generate most of the simulation. reveal some information. While not completely secure, the The only communication is at lines 7, 11, 21, and 27. privacy versus cost tradeoff may be acceptable in some sit- uations. Step 7: Each party Ps sees x′ = x + s−1 5.1. Horizontally Partitioned Data r=0 Distancer (oi , oj ), where x is the ran- dom value chosen by P0 . P r(x′ = y) = s−1 With horizontally partitioned data, we can use a semi- P r(x + r=0 Distancer (oi , oj ) = y) = P r(x = s−1 1 y − r=0 Distancer (oi , oj )) = |D| . Thus we can trusted third party to perform comparisons and return ran- simulate the value received by choosing a random value dom shares. The two comparing parties just give the values from a uniform distribution over D. to be compared to the third party to add and compare. As long as the third party does not collude with either of the comparing parties, the comparing parties learn nothing. Steps 11 and 21: Each step is again a secure comparison, The real question is, what is disclosed to the third party? so messages are simulated as in Steps 15 and 30 of Theorem Basically, since the data is horizontally partitioned, the third 3. party has no idea about the respective locations of the two objects. All it can ﬁnd out is the distance between the two Step 27: This is again the ﬁnal result, simulated as in Step objects. While this is information that is not a part of the re- 36 of Theorem 3. temp0 is simulated by choosing a random sult, by itself it is not very signiﬁcant and allows a tremen- value, temp1 = result − temp0. By the same argument on dous increase in efﬁciency. Now, the cost of secure com- random shares used above, the distribution of simulated val- parison reduces to a total of 4 messages (which can be com- ues is indistinguishable from the distribution of the shares. bined for all comparisons performed by the pair, for a con- Again, the simulator clearly runs in polynomial time (the stant number of rounds of communication) and insigniﬁcant same as the algorithm). Since each party is able to simulate computation cost. the view of its execution (i.e., the probability of any partic- ular value is the same as in a real execution with the same 5.2. Vertically Partitioned Data inputs/results) in polynomial time, the algorithm is secure with respect to Deﬁnition 1. The simple approach used in horizontal partitioning is not suitable for vertically partitioned data. Since all of the Absent collusion and assuming a malicious-model se- parties share all of the points, partial knowledge about a cure comparison, a malicious party is unable to learn any- point does reveal useful information to a party. Instead, one thing it could not learn from altering its input. Step 7 is of the remaining parties is chosen to play the part of com- particularly sensitive to collusion, but can be improved (at pletely untrusted non-colluding party. With this assump- cost) by splitting the sum into shares and performing sev- tion, a much more efﬁcient secure comparison algorithm eral such sums (see [12] for more discussion of collusion- has been postulated by Cachin [3] that reveals nothing to resistant secure sum). the third party. The algorithm is otherwise equivalent, but the cost of the comparisons is reduced substantially. 5. Computation and Communication Analysis 5.3. Why is quadratic complexity necessary for pri- Both Algorithms 1 and 2 suffer the drawback of having vacy? quadratic computation complexity due to the nested itera- tion over all objects. This is unavoidable in a completely Current outlier detection algorithms have focussed on secure algorithm, as will be discussed in Section 5.3. ı reducing the na¨ve quadratic complexity of the problem. Due to the quadratic complexity, Algorithm 1 re- However, for any point-comparison based algorithm, quires O(n2 ) distance computations and secure compar- However, for each point it is necessary to compare with isons (steps 12-20), where n is the total number of objects. every other point. Any algorithm that excludes some points Similarly, Algorithm 2 also requires O(n2 ) secure compar- from consideration inherently compromises security. The isons (steps 10-16). While operation parallelism can be exclusion of a point from consideration reveals information used to reduce the round complexity of communication, the about the relative position of the point. Consider the model key practical issue is the computational complexity of the of vertically partitioned data. If for point A, the algorithm encryption required for the secure comparison and scalar decides to exclude point B from consideration - since all product protocols. parties know the transaction identiﬁers for points A and B, if the two points are far apart locally for one party, it knows Applications Conference, New Orleans, Louisiana, USA, that the two points have to be very close for the other par- December 10-14 2001. ties. On the other hand if a point is selected to be an outlier [5] Directive 95/46/EC of the european parliament and of the only after computing distances from x other points, then council of 24 october 1995 on the protection of individuals with regard to the processing of personal data and on the free all parties now know that the point is farther away from at movement of such data. Ofﬁcial Journal of the European least p% of those points. Thus, extra information has been Communities, No I.(281):31–50, Oct. 24 1995. revealed. [6] K. J. Ezawa and S. W. Norton. Constructing bayesian net- Even with horizontal partitioning of data, probabilistic works to predict uncollectible telecommunications accounts. estimates of point locations (or clusters) is possible. If one IEEE Expert, 11(5):45–51, Oct. 1996. of the points under consideration is owned by you, you gain [7] M. Feingold, M. Corzine, M. Wyden, and M. Nelson. Data- considerable information about other points if the search mining moratorium act of 2003. U.S. Senate Bill (proposed), process ends early – over time, allowing the determination Jan. 16 2003. [8] O. Goldreich. The Foundations of Cryptography, volume 2, of clusters at other sites based on knowing they have a high chapter General Cryptographic Protocols. Cambridge Uni- number of outliers relative to one’s own data. While not as versity Press, 2004. obvious as problems that may occur in vertically partitioned [9] O. Goldreich, S. Micali, and A. Wigderson. How to play any data, it poses an unknown hazard to privacy. mental game - a completeness theorem for protocols with This forces the run time complexity of any secure algo- honest majority. In 19th ACM Symposium on the Theory of rithm to be quadratic (always - it is no longer worst case). Computing, pages 218–229, 1987. If one is willing to compromise on security, it is possible [10] D. Goldschlag, M. Reed, and P. Syverson. Onion routing. to come up with more efﬁcient algorithms. However, the Commun. ACM, 42(2):39–41, Feb. 1999. [11] I. Ioannidis, A. Grama, and M. Atallah. A secure protocol quantiﬁcation of (in)security in this case is so amorphous for computing dot-products in clustered and distributed envi- that it might be not be justiﬁable to regard the algorithm as ronments. In The 2002 International Conference on Parallel secure. Processing, Vancouver, British Columbia, Aug. 18-21 2002. [12] g M. Kantarcıoˇ lu and C. Clifton. Privacy-preserving dis- tributed mining of association rules on horizontally parti- 6. Conclusion tioned data. IEEE Transactions on Knowledge and Data Engineering, 16(9):1026–1037, Sept. 2004. In this paper, we have presented privacy-preserving so- [13] Special section on privacy and security. SIGKDD Explo- lutions for ﬁnding distance based outliers in distributed data rations, 4(2):i–48, Jan. 2003. sets, and proven their security. One contribution of the pa- [14] E. M. Knorr and R. T. Ng. Algorithms for mining distance- based outliers in large datasets. In Proceedings of 24th In- per is to point out that quadratic complexity is a necessity ternational Conference on Very Large Data Bases (VLDB for secure solutions to the problem – at most constant-time 1998), pages 392–403, New York City, NY, USA, Aug.24- improvements are possible over the algorithms given. 27 1998. We are currently implementing these schemes and inte- [15] E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based out- grating them into software packages (e.g., Weka [22]) to liers: algorithms and applications. The VLDB Journal, 8(3- enable a practical evaluation of the computational cost. An- 4):237–253, 2000. other important problem is to ﬁgure out privacy-preserving [16] E. M. Knorr, R. T. Ng, and R. H. Zamar. Robust space trans- methods of space transformation[16], allowing additional formations for distance-based operations. In Proceedings of the seventh ACM SIGKDD international conference on distance-based operations to be done in a secure manner. Knowledge discovery and data mining, pages 126–135, San Francisco, California, 2001. ACM Press. References [17] A. Lazarevic, A. Ozgur, L. Ertoz, J. Srivastava, and V. Ku- mar. A comparative study of anomaly detection schemes in network intrusion detection. In SIAM International Con- a [1] D. Barbar´ , N. Wu, and S. Jajodia. Detecting novel network ference on Data Mining (2003), San Francisco, California, intrusions using bayes estimators. In First SIAM Interna- May 1-3 2003. tional Conference on Data Mining, Chicago, Illinois, Apr. [18] M. Lewis. Department of defense appropriations act, 2004, 5-7 2001. July 17 2003. Title VIII section 8120. Enacted as Public [2] V. Barnett and T. Lewis. Outliers in Statistical Data. John Law 108-87. Wiley and Sons, 3rd edition, 1994. [19] S. Ramaswamy, R. Rastogi, and K. Shim. Efﬁcient algo- [3] C. Cachin. Efﬁcient private bidding and auctions with an rithms for mining outliers from large data sets. In Proceed- oblivious third party. In Proceedings of the 6th ACM con- ings of the 2000 ACM SIGMOD international conference on ference on Computer and communications security, pages Management of data, pages 427–438. ACM Press, 2000. 120–127. ACM Press, 1999. [20] M. K. Reiter and A. D. Rubin. Crowds: Anonymity for Web [4] W. Du and M. J. Atallah. Privacy-preserving statistical anal- transactions. ACM Transactions on Information and System ysis. In Proceeding of the 17th Annual Computer Security Security, 1(1):66–92, Nov. 1998. [21] J. Vaidya and C. Clifton. Privacy preserving association rule mining in vertically partitioned data. In The Eighth ACM SIGKDD International Conference on Knowledge Discov- ery and Data Mining, pages 639–644, Edmonton, Alberta, Canada, July 23-26 2002. [22] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Fransisco, Oct. 1999. [23] A. C. Yao. How to generate and exchange secrets. In Pro- ceedings of the 27th IEEE Symposium on Foundations of Computer Science, pages 162–167. IEEE, 1986.