Privacy-Preserving Outlier Detection by fzs18703


									                                   Privacy-Preserving Outlier Detection

                              Jaideep Vaidya                              Chris Clifton
                             Rutgers University                        Purdue University
                           180 University Avenue                     250 N. University St.
                          Newark, NJ 07102-1803                  W. Lafayette, IN 47907-2066

                         Abstract                                tacks on data mining by privacy advocates; the U.S. Ter-
                                                                 rorism Information Awareness program was killed for this
    Outlier detection can lead to the discovery of truly un-     reason[18].
expected knowledge in many areas such as electronic com-            Outlier detection has numerous other applications that
merce, credit card fraud and especially national security.       also raise privacy concerns. Mining for anomalies has
We look at the problem of finding outliers in large dis-          been used for network intrusion detection[1, 17]; pri-
tributed databases where privacy/security concerns restrict      vacy advocates have responded with research to enhance
the sharing of data. Both homogeneous and heterogeneous          anonymity[20, 10]. Fraud discovery in the mobile phone
distribution of data is considered. We propose techniques to     industry has also made use of outlier detection[6]; organi-
detect outliers in such scenarios while giving formal guar-      zations must be careful to avoid overstepping the bounds of
antees on the amount of information disclosed.                   privacy legislation[5]. Privacy-preserving outlier detection
                                                                 will ensure these concerns are balanced, allowing us to get
                                                                 the benefits of outlier detection without being thwarted by
1. Introduction                                                  legal or technical counter-measures.
                                                                    This paper assumes data is distributed; the stewards of
   Advances in information technology and the ubiquity           the data are allowed to use it, but disclosing it to others is
of networked computers have made personal information            a privacy violation. The problem is to find distance-based
much more available. This has lead to a privacy backlash.        outliers without any party gaining knowledge beyond learn-
Unfortunately, “data mining” has been the whipping-boy           ing which items are outliers. Ensuring that data is not dis-
for much of this backlash, witness a United States Senate        closed maintains privacy, i.e., no privacy is lost beyond that
proposal to forbid all “data-mining activity” by the U.S.        inherently revealed in knowing the outliers. Even knowing
Department of Defense[7]. Much of this is based on a mis-        which items are outliers need not be revealed to all parties,
taken view of data mining; the above-cited act specifically       further preventing privacy breaches.
discusses search for individuals. Most data mining is not           The approach duplicates the results of the outlier detec-
about individuals, but generalizing information.                 tion algorithm of [14]. The idea is that an object o is an
   Data mining does raise legitimate privacy concerns; the       outlier if more than a percentage p of the objects in the data
process of data mining often results in greater integration      set are farther than distance d from o. The basic idea is that
of data, increasing potential for misuse. If the data min-       parties compute the portion of the answer they know, then
ing results do not pose an inherent privacy threat, privacy-     engage in a secure sum to compute the total distance. The
preserving data mining techniques enable knowledge dis-          key is that this total is (randomly) split between sites, so no-
covery without requiring disclosure of private data. Privacy-    body knows the actual distance. A secure protocol is used
preserving methods have been developed for numerous data         to determine if the actual distance between any two points
mining tasks; many are described in [13].                        exceeds the threshold; again the comparison results are ran-
   To our knowledge, privacy-preserving outlier detection        domly split such that summing the splits (over a closed
has not yet been addressed. Outlier detection has wide ap-       field) results in a 1 if the distance exceeds the threshold,
plication; one that has received considerable attention is the   or a 0 otherwise.
search for terrorism. Detecting previously unknown sus-             For a given object o, each site can now sum all of its
picious behavior is a clear outlier detection problem. The       shares of comparison results (again over the closed field).
search for terrorism has also been the flash point for at-        When added to the sum of shares from other sites, the result
is the correct count; all that remains is to compare it with the    2.3. Outlier Detection
percentage threshold p. This addition/comparison is also
done with a secure protocol, revealing only the result: if o            Our goal is to find Distance Based Outliers. Knorr and
is an outlier.                                                      Ng [14] define the notion of a Distance Based outlier as fol-
    We first discuss the problem we are facing: the different        lows: An object O in a dataset T is a DB(p,dt)-outlier if at
problems posed by vertically and horizontally partitioned           least fraction p of the objects in T lie at distance greater
datasets, and the formal definition of outlier detection. Sec-       than dt from O. Other distance based outlier techniques also
tion 3 gives privacy-preserving algorithms for both horizon-        exist[15, 19]. The advantages of distance based outliers are
tally and vertically partitioned data. We prove the security        that no explicit distribution needs to be defined to determine
of the algorithms in Section 4, and discuss the computa-            unusualness, and that it can be applied to any feature space
tional and communication complexity of the algorithms in            for which we can define a distance measure. We assume
Section 5. We conclude with a discussion of areas for fur-          Euclidean distances, although the algorithms are easily ex-
ther work on this problem.                                          tended to general Minkowski distances. There are other non
                                                                    distance based techniques for finding outliers as well as sig-
2. Data Partitioning Models                                         nificant work in statistics [2], but we do not consider those
                                                                    in this paper and leave that for future work.
   The problem as we define it is that the data is inherently
distributed; it is sharing (or disclosure to other parties) that    3. Privacy Preserving Outlier Detection
violates privacy. The way the data is distributed / partitioned
results in very different solutions. We consider two differ-            We now present two algorithms for Distance Based Out-
ent data partitions: horizontal and vertical. In either case,       liers meeting the definition given in Section 2.3. The first
assume k different parties, P0 , . . . , Pk−1 ; m attributes; and   is for horizontally partitioned data, the second for vertically
n total objects. We now describe the specifics of the differ-        partitioned data. The algorithm is based on the obvious one:
ent data models considered.                                         Compare points pairwise and count the number exceeding
                                                                    the distance threshold. The key is that all intermediate com-
2.1. Horizontally Partitioned Data                                  putations (such as distance comparisons) leave the results
                                                                    randomly split between the parties involved; only the final
    With horizontally partitioned (viz. distributed homo-           result (if the count exceeds p%) is disclosed.
geneous) data, different parties collect the same informa-              The pairwise comparison of all points may seem exces-
tion (features) for different objects. Each party collects in-      sive. However, it is necessary to achieve a completely (even
formation about m attributes, A1 , . . . , Am . Party Pi col-       reasonably) secure solution (this will be discussed further
lects information about ni objects, such that i=0 ni = n            in Section 5.3.) The asymptotic complexity still equals that
(different parties collect information about different enti-        of [14].
ties). Consider the case of several banks that collect similar          Note that to obtain a secure solution, all operations are
data about credit card transactions but for different clients.      carried out modulo some field. We will use the field D for
Clearly, the data is horizontally partitioned. Outlier detec-       distances, and F for counts of the number of entities. The
tion is particularly useful in this case to determine poten-        field F must be over twice the number of objects. Limits on
tially fraudulent transactions.                                     D are based on maximum distances; details on the size are
                                                                    given with each algorithm.
2.2. Vertically Partitioned Data
                                                                    3.1. Horizontally Partitioned Data
    With vertically partitioned (viz. distributed heteroge-
neous) data, different parties collect different features for           The key idea behind the algorithm for horizontally par-
the same set of objects. Party Pi collects information about        titioned data is as follows. For each object i, the protocol
mi attributes, Ai,1 , . . . , Ai,mi . The total number of at-       iterates over every other object j. If the party holding i also
tributes, p=0 mp = m. All of the parties also hold in-              holds j, it can easily find the distance and compare against
formation about the same n objects. Thus, there are a total         the threshold. If two different parties hold the two objects,
of n transactions (with data for each transaction really being      the parties engage in a distance calculation protocol (Sec-
split between the parties). Consider the case of an airline,        tion 3.1.1) to get random shares of the distance. A second
banking institution and federal databases. By cross correlat-       protocol allows comparing the shares with the threshold, re-
ing information and locating outliers we may hope to spot           turning 1 if the distances exceeds the threshold, or 0 if it
potential terrorist activities.                                     does not. The key to this second protocol is that the 1 or 0
                         ′      ′             ′   ′
is actually two shares rq and rs , such that rq +rs = 1 (or 0)
(mod F ). From one share, the party learns nothing.
    Once all points have been compared, the parties sum            Algorithm 1 Finding DB(p,D)-outliers
their shares. Since the shares add to 1 for distances exceed-
                                                                   Require: k parties, P0 , . . . , Pk−1 ; each holding a subset of
ing the distance threshold, and 0 otherwise, the total sum
                                                                       the objects O.
(mod F ) is the number of points for which the distance ex-
                                                                   Require: Fields D larger than the maximum distance
ceeds the threshold. The parties do not actually compute
                                                                       squared, F larger than |O|
this sum; instead all parties pass their (random) shares to
                                                                    1: for all objects oi ∈ O {Let Pq be the party holding oi }
a designate to add, and the designated party and the party
holding the point engage in a secure protocol that reveals
                                                                    2:   for all parties Pr do
only if the sum of the shares exceeds p%. This ensures that
                                                                    3:      numr ← 0 (mod F ) {Initialize counters}
no party learns anything except whether the point is an out-
                                                                    4:   end for
                                                                    5:   for all objects oj ∈ O, oj = oi do
    Algorithm 1 gives the complete details. Steps 5-23 are          6:      if Pq holds oj then
the pairwise comparison of two points, giving each party            7:         if Distance(oi , oj ) > dt {Computed locally at
random shares of a 1 (if the points are far apart) or 0 (if the                Pq } then
points are within the distance threshold dt). The random            8:            At Pq : numq ← numq + 1 (mod F )
split of shares ensures that nothing is learned by either party.    9:         end if
In steps 25-28, the parties (except the party Pq holding the       10:      else
object being evaluated) sum their shares. Again, since each        11:         {Let Ps hold oj }
share is a random split (and Pq holds the other part of the        12:         {Using the distance computation protocol (Sec-
split), no party learns anything. Finally, Pq−1 and Pq add                     tion 3.1.1)}
and compare their shares, revealing only if the object oi is       13:         Pq ← rq and Ps ← rs such that rq + rs
an outlier. Note that the shares of this comparison are split,                 (mod D) = Distance2 (oi , oj )
and could be sent to any party (Pq in Algorithm 1, but it          14:         {Using the secure comparison protocol (Section
need not even be one of the Pr ). Only that party (e.g., a                     3.3)}
fraud prevention unit) learns if oi is an outlier, the others      15:                 ′               ′
                                                                               Pq ← rq and Ps ← rs such that:
learn nothing.                                                     16:         if rq + rs (mod D) > dt2 then
                                                                                   ′     ′
                                                                   17:            rq + rs = 1 (mod F )
                                                                   18:         else
                                                                                   ′     ′
3.1.1 Computing distance between two points                        19:            rq + rs = 0 (mod F )
                                                                   20:         end if
Step 13 of Algorithm 1 requires computing a distance, but          21:      end if
leaving random shares of that distance with two parties            22:      At Pq : numq ← numq + rq
rather than revealing the result. For convenience, we ac-          23:      At Ps : nums ← nums + rs
tually compute shares of the square of the distance, and           24:   end for
compare with the square of the threshold. (This does not           25:   for all Pr except Pq and Pq−1 (mod k) do
change the result, since squaring is a monotonically increas-      26:      Pr sends numr to Pq−1
ing function.) We now give an algorithm based on secure            27:   end for
scalar product for computing shares of the square of the Eu-       28:   At Pq−1 : numq−1 ← i=q numi
clidean distance.                                                  29:   {Using the secure comparison of Section 3.3}
                                                                   30:   Pq ← tempq and Pq−1 ← tempq−1 such that:
   Formally, let there be two parties, P1 and P2 . All com-
                                                                   31:   if numq + numq−1 (mod F ) > |O| ∗ p% then
putations are over a field D larger than the square of the
                                                                   32:      tempq + tempq−1 ← 1 {oi is an outlier}
maximum distance. P1 ’s input is the point X, P2 ’s input is
                                                                   33:   else
the point Y. The outputs are r1 and r2 respectively (indepen-
                                                                   34:      tempq + tempq−1 ← 0
dently uniformly distributed over D), such that r1 + r2 =
                                                                   35:   end if
Distance2 (X, Y ) (mod D), where Distance(X, Y ) is
                                                                   36:   Pq−1 sends tempq−1 to Pq , revealing to Pq if oi is
the Euclidean distance between the points X and Y .
                                                                         an outlier.
   Let there be m attributes, and a point X be represented         37: end for
by its m-dimensional tuple (x1 , . . . , xm ). Each co-ordinate
represents the value of the point for that attribute.
   The square of the Euclidean distance between X and Y
is given by
  Distance2 (X, Y )     =          (xr − yr )2
                        =    x2 −
                                     2x1 y1 + y1 + . . .
                             . . . + x2 − 2xm ym + ym
                              m           m            m              Algorithm 2 Finding DB(p,D)-outliers
                        =          x2 +
                                                yr −         2xr yr   Require: k parties, P0 , . . . , Pk−1 ; each holding a subset of
                             r=1          r=1          r=1                the attributes for all objects O.
                                                                      Require: dtr : local distance threshold for Pr .
    P1 can independently calculate r x2 . Similarly, P2 can
                                         r                            Require: Fields D larger than twice the maximum dis-
calculate r yr . As long as there is more than one attribute              tance, F larger than |O|
(i.e., m > 1), the remaining sum r (2xr )(−yr ) is simply              1: for all objects oi ∈ O do
the scalar product of two m-dimensional vectors. P1 and                      m′ ← m′
                                                                       2:       0      k−1 ← 0 (mod F )
P2 engage in a secure scalar product protocol to get random            3:    for all objects oj ∈ O, oj = oi do
shares of the dot product. This, added to their prior calcu-           4:       P0 : Randomly choose a number x from a uniform
lated values, gives each party a random share of the square                     distribution over the field D; x′ ← x
of the distance. There are many scalar product protocols               5:       for r ← 0, . . . , k − 2 do
proposed in the literature [4, 21, 11]; any of these can be            6:          At Pr : x′ ← x′ + Distancer (oi , oj ) − dtr
used.                                                                              (mod D) {Distancer is local distance at Pr }
    Assuming that the scalar product protocol is secure, ap-           7:          Pr sends x′ to Pr+1
plying the composition theorem of [8] shows that the entire            8:       end for
protocol is secure.                                                    9:       At Pk−1 : x′ ← x′ +Distancek−1 (oi , oj )−dtk−1
                                                                                (mod D)
3.2. Vertically Partitioned Data                                      10:       {Using the secure comparison protocol (Section
    Vertically partitioned data introduces a different chal-          11:       P0 ← m0 and Pk−1 ← mk−1 such that:
lenge. Each party can compute a share of the pairwise dis-            12:       if 0 < x′ + (−x) (mod D) < |D|/2 then
tance locally; the sum of these shares is the total distance.         13:          m0 + mk−1 = 1 (mod F )
However, the distance must not be revealed, so a secure pro-          14:       else
tocol is used to get shares of the pairwise comparison of             15:          m0 + mk−1 = 0 (mod F )
distance and threshold. From this point, it is similar to hor-        16:       end if
izontal partitioning: Add the shares and determine if they            17:       At P0 : m′ ← m′ + m0 (mod F )
                                                                                          0         0
exceed p%.                                                            18:       At Pk−1 : m′             ′
                                                                                             k−1 ← mk−1 + mk−1           (mod F )
    An interesting side effect of this algorithm is that the          19:    end for
parties need not reveal any information about the attributes          20:    {Using the secure comparison of Section 3.3}
they hold, or even the number of attributes. Each party lo-           21:    P0 ← temp0 and Pk−1 ← tempk−1 such that:
cally determines the distance threshold for its attributes (or        22:    if m′ + m′
                                                                                  0      k−1     (mod F ) > |O| ∗ p% then
more precisely, the share of the overall threshold for its at-        23:       temp0 + tempk−1 ← 1 {oi is an outlier}
tributes). Instead of computing the local pairwise distance,          24:    else
each party computes the difference between the local pair-            25:       temp0 + tempk−1 ← 0
wise distance and the local threshold. If the sum of these            26:    end if
differences is greater than 0, the pairwise distance exceeds          27:    P0 and Pk−1 send temp0 and tempk−1 to the party
the threshold.                                                               authorized to learn the result; if temp0 + temp1 = 1
    Algorithm 2 gives the full details. In steps 5-9, the sites              then oi is an outlier.
sum their local distances. The random x added by P0 masks             28: end for
the distances from each party. In steps 11-18, Parties P0 and
Pk−1 get shares of the pairwise comparison result, as in Al-
gorithm 1. The comparison is a test if the sum is greater than
0 (since the threshold has already been subtracted.) These
two parties keep a running sum of their shares. At the end,
these shares are added and compared with the percentage
threshold, again as in Algorithm 1.
Theorem 1 Proof of Correctness: Algorithm 2 correctly                execution of the protocol, then the real execution isn’t giv-
returns as output the complete set of points that are global         ing away any new information.
outliers.                                                               To formalize this, we will first give some definitions
                                                                     from [8]. We then give the proofs of security for Algorithms
    P ROOF. In order to prove the correctness of Algorithm 2,        1 and 2.
it is sufficient to prove that a point is reported as an outlier if
and only if it is truly an outlier. Consider point q. If q is an     4.1 Secure Multi-Party Computation
outlier, in steps 12-16 for at least p% ∗ |O| + 1 of the other
points, m0 + mk−1 = 1 (mod F ). Since |F | > |O|, it
follows that m′ + m′                                                     Yao first postulated the two-party comparison problem
                 0      k−1 > |O| ∗ p%. Therefore, point q
                                                                     (Yao’s Millionaire Protocol) and developed a provably se-
will be correctly reported as an outlier. If q is not an outlier,
the same argument applies in reverse. Thus, in steps 12-16           cure solution[23]. This was extended to multiparty compu-
at most p% ∗ |O| − 1 points, m0 + mk−1 = 1 (mod F ).                 tations by Goldreich et al.[9]. They developed a framework
                                                                     for secure multiparty computation, and in [8] proved that
Again, since |F | > |O|, it follows that m′ + m′
                                             0      k−1 ≤ |O| ∗
p%. Therefore, point q will not be reported as an outlier.           computing a function privately is equivalent to computing
                                                                     it securely.
                                                                         We start with the definitions for security in the semi-
3.3. Modified Secure Comparison Protocol                              honest model. A semi-honest party follows the rules of the
                                                                     protocol using its correct input, but is free to later use what
    At several stages in the protocol, we need to securely           it sees during execution of the protocol to compromise se-
compare the sum of two numbers, with the output split be-            curity. A formal definition of private two-party computation
tween the parties holding those numbers. This can be ac-             in the semi-honest model is given below.
complished using the generic circuit evaluation technique
first proposed by Yao[23]. Formally, we need a modified                Definition 1 (privacy w.r.t. semi-honest behavior):[8]
secure comparison protocol for two parties, A and B. The                               ∗              ∗           ∗         ∗
                                                                        Let f : {0, 1} × {0, 1} −→ {0, 1} × {0, 1} be prob-
local inputs are xa and xb and the local outputs are ya and          abilistic, polynomial-time functionality, where f1 (x, y) (re-
yb . All operations on input are in a field F1 and output             spectively, f2 (x, y)) denotes the first (resp., second) ele-
are in a field F2 . ya + yb = 1 (mod F2 ) if xa + xb                  ment of f (x, y)); and let Π be two-party protocol for com-
(mod F1 ) > 0, otherwise ya + yb = 0 (mod F2 ). A                    puting f .
final requirement is that ya and yb should be independently              Let the view of the first (resp., second) party dur-
uniformly distributed over F (clearly the joint distribution         ing an execution of protocol Π on (x, y), denoted
is not uniform).                                                           Π                          Π
                                                                     view1 (x, y) (resp., view2 (x, y)), be (x, r1 , m1 , . . . , mt )
    This builds on the standard secure multiparty computa-           (resp., (y, r2 , m1 , . . . , mt )). r1 represent the outcome of
tion circuit-based approach for solving this problem[8]. Ef-         the first (resp., r2 the second) party’s internal coin tosses,
fectively, A chooses ya with a uniform distribution over F ,         and mi represent the ith message it has received.
and provides it as an additional input to the circuit that ap-          The output of the first (resp., second) party during an
propriately computes yb . The circuit is then securely eval-         execution of Π on (x, y) is denoted outputΠ (x, y) (resp.,
uated, with B receiving the output yb . The complexity of            outputΠ (x, y)) and is implicit in the party’s view of the
this is equivalent to the complexity of Yao’s Millionaire’s          execution.
problem (simple secure comparison).                                     Π privately computes f if there exist probabilistic poly-
                                                                     nomial time algorithms, denoted S1 , S2 such that
4. Security Analysis
                                                                          {(S1 (x, f1 (x, y)) , f2 (x, y))}x,y∈{0,1}∗ ≡C
   The security argument for this algorithm uses proof tech-
                                                                                  view1 (x, y) , outputΠ (x, y)
                                                                                                       2             x,y∈{0,1}∗
niques from Secure Multiparty Computation. The idea is                    {(f1 (x, y) , S2 (x, f1 (x, y)))}x,y∈{0,1}∗ ≡  C
that since what a party sees during the protocol (its shares)
are randomly chosen from a uniform distribution over a
                                                                                  outputΠ (x, y) , view2 (x, y)
                                                                                        1                            x,y∈{0,1}∗
field, it learns nothing in isolation. (Of course, collusion
with other parties could reveal information, since the joint         where ≡C denotes computational indistinguishability.
distribution of the shares is not random.) The idea of the
proof is based on a simulation argument: If we can define a           As we shall see, our protocol is actually somewhat stronger
simulator that uses the algorithm output and a party’s own           than the semi-honest model, although it does not meet the
data to simulate the messages seen by a party during a real          full malicious model definition of [8].
Privacy by Simulation The above definition says that a             sq ) randomly from a uniform distribution over D. Then
computation is secure if the view of each party during the        ∀x ∈ D, P r(sq = x) = |D| . Thus, sq is easily simulated
execution of the protocol can be effectively simulated given      by simply choosing a random value from D. Let the result
the input and the output of that party. Thus, in all of our       r = r (2xr )(−yr ) be fixed. Then ∀x ∈ F, P r(ss = x) =
proofs of security, we only need to show the existence of a                                                  1
                                                                  P r(r − sq = y) = P r(sq = r − y) = |D| . Therefore,
simulator for each party that satisfies the above equations.       the simulator for Ps can simulate this message by simply
    This does not quite guarantee that private information is     choosing a random number from an uniform distribution
protected. Whatever information can be deduced from the           over D. Assuming that the scalar product protocol is se-
final result obviously cannot be kept private. For example,        cure, applying the composition theorem shows that step 13
if a party learns that point A is an outlier, but point B which   is secure.
is close to A is not an outlier, it learns an estimate on the
number of points that lie between the space covered by the
                                                                  Steps 15 and 30: The simulator for party Pq (respectively
hypersphere for A and hypersphere for B. Here, the result
                                                                  Ps ) again chooses a number randomly from a uniform dis-
reveals information to the site having A and B. The key to
                                                                  tribution, this time over the field F . By the same argument
the definition of privacy is that nothing is learned beyond
                                                                  as above, the actual values are uniformly distributed, so the
what is inherent in the result.
                                                                  probability of the simulator and the real protocol choosing
    A key result we use is the composition theorem. We state
                                                                  any particular value are the same. Since a circuit for secure
it for the semi-honest model. A detailed discussion of this
                                                                  comparison is used, using the composition theorem, no ad-
theorem, as well as the proof, can be found in [8].
                                                                  ditional information is leaked and step 15 is secure.
Theorem 2 (Composition Theorem for the semi-honest
model): Suppose that g is privately reducible to f and that       Step 26: Pq−1 receives several shares numr . However,
there exists a protocol for privately computing f. Then there     note that numr is a sum, where all components of the sum
exists a protocol for privately computing g.                      are random shares from Step 15. Since Pq−1 receives only
                                                                  shares from the Ps in step 15, and receives none from Pq ,
  P ROOF. Refer to [8].                                           all of the shares in the sum are independent. The sum numr
                                                                  can thus be simulated by choosing a random value from a
4.2. Horizontally Partitioned Data                                uniform distribution over F .

Theorem 3 Algorithm 1 returns as output the set of points
                                                                  Step 36: Since Pq knows the results (1 if oi is an outlier,
that are global outliers, and reveals no other information to
                                                                  0 otherwise), and tempq was simulated in step 30, it can
any party provided parties do not collude.
                                                                  simulate tempq−1 with the results (1 or 0) −tempq mod F .
   P ROOF. Presuming that the number of objects |O| is               The simulator clearly runs in polynomial time (the same
known globally, each party can locally set up and run its         as the algorithm). Since each party is able to simulate the
own components of Algorithm 1 (e.g., a party only needs to        view of its execution (i.e., the probability of any particu-
worry about its local objects in the “For all objects” state-     lar value is the same as in a real execution with the same
ments at lines 1 and 5.) In the absence of some type of           inputs/results) in polynomial time, the algorithm is secure
secure anonymous send[20, 10] (e.g., anonymous transmis-          with respect to Definition 1.
sion with public key cryptography to ensure reception only
by the correct party), the number of objects at each site is          While the proof is formally only for the semi-honest
revealed. Since at least an upper bound on the number of          model, it can be seen that a malicious party in isolation can-
items is inherently revealed by the running time of the algo-     not learn private values (regardless of what it does, it is still
rithm, we assume these values are known.                          possible to simulate what it sees without knowing the input
    The next problem is to simulate the messages seen by          of the other parties.) This assumes that the underlying scalar
each party during the algorithm. Communication occurs             product and secure comparison protocols are secure against
only at steps 13, 15, 24, 30, and 36. We now describe the         malicious behavior. A malicious party can cause incorrect
simulation independently.                                         results, but it cannot learn private data values.

Step 13: Pq and Ps each receive a share of the square             4.3. Vertically Partitioned Data
of the distance. As can be seen in Section 3.1.1, all parts
of the shares are computed locally except for shares of           Theorem 4 Algorithm 2 returns as output the set of points
the scalar product. Assume that the scalar product pro-           that are global outliers while revealing no other information
tocol chooses shares by selecting the share for Pq (call it       to any party, provided parties do not collude.
   P ROOF. All parties know the number (and identity) of             When there are three or more parties, assuming no col-
objects in O. Thus they can set up the loops; the simulator       lusion, we can develop much more efficient solutions that
just runs the algorithm to generate most of the simulation.       reveal some information. While not completely secure, the
The only communication is at lines 7, 11, 21, and 27.             privacy versus cost tradeoff may be acceptable in some sit-
Step   7: Each party Ps sees x′             =    x +
  s−1                                                             5.1. Horizontally Partitioned Data
  r=0 Distancer (oi , oj ),   where x is the ran-
dom value chosen by P0 .             P r(x′ = y) =
            s−1                                                      With horizontally partitioned data, we can use a semi-
P r(x +     r=0 Distancer (oi , oj ) = y) = P r(x =
        s−1                             1
y − r=0 Distancer (oi , oj )) = |D| . Thus we can                 trusted third party to perform comparisons and return ran-
simulate the value received by choosing a random value            dom shares. The two comparing parties just give the values
from a uniform distribution over D.                               to be compared to the third party to add and compare. As
                                                                  long as the third party does not collude with either of the
                                                                  comparing parties, the comparing parties learn nothing.
Steps 11 and 21: Each step is again a secure comparison,             The real question is, what is disclosed to the third party?
so messages are simulated as in Steps 15 and 30 of Theorem        Basically, since the data is horizontally partitioned, the third
3.                                                                party has no idea about the respective locations of the two
                                                                  objects. All it can find out is the distance between the two
Step 27: This is again the final result, simulated as in Step      objects. While this is information that is not a part of the re-
36 of Theorem 3. temp0 is simulated by choosing a random          sult, by itself it is not very significant and allows a tremen-
value, temp1 = result − temp0. By the same argument on            dous increase in efficiency. Now, the cost of secure com-
random shares used above, the distribution of simulated val-      parison reduces to a total of 4 messages (which can be com-
ues is indistinguishable from the distribution of the shares.     bined for all comparisons performed by the pair, for a con-
   Again, the simulator clearly runs in polynomial time (the      stant number of rounds of communication) and insignificant
same as the algorithm). Since each party is able to simulate      computation cost.
the view of its execution (i.e., the probability of any partic-
ular value is the same as in a real execution with the same       5.2. Vertically Partitioned Data
inputs/results) in polynomial time, the algorithm is secure
with respect to Definition 1.                                         The simple approach used in horizontal partitioning is
                                                                  not suitable for vertically partitioned data. Since all of the
   Absent collusion and assuming a malicious-model se-            parties share all of the points, partial knowledge about a
cure comparison, a malicious party is unable to learn any-        point does reveal useful information to a party. Instead, one
thing it could not learn from altering its input. Step 7 is       of the remaining parties is chosen to play the part of com-
particularly sensitive to collusion, but can be improved (at      pletely untrusted non-colluding party. With this assump-
cost) by splitting the sum into shares and performing sev-        tion, a much more efficient secure comparison algorithm
eral such sums (see [12] for more discussion of collusion-        has been postulated by Cachin [3] that reveals nothing to
resistant secure sum).                                            the third party. The algorithm is otherwise equivalent, but
                                                                  the cost of the comparisons is reduced substantially.
5. Computation and Communication Analysis
                                                                  5.3. Why is quadratic complexity necessary for pri-
   Both Algorithms 1 and 2 suffer the drawback of having                vacy?
quadratic computation complexity due to the nested itera-
tion over all objects. This is unavoidable in a completely           Current outlier detection algorithms have focussed on
secure algorithm, as will be discussed in Section 5.3.                             ı
                                                                  reducing the na¨ve quadratic complexity of the problem.
   Due to the quadratic complexity, Algorithm 1 re-               However, for any point-comparison based algorithm,
quires O(n2 ) distance computations and secure compar-               However, for each point it is necessary to compare with
isons (steps 12-20), where n is the total number of objects.      every other point. Any algorithm that excludes some points
Similarly, Algorithm 2 also requires O(n2 ) secure compar-        from consideration inherently compromises security. The
isons (steps 10-16). While operation parallelism can be           exclusion of a point from consideration reveals information
used to reduce the round complexity of communication, the         about the relative position of the point. Consider the model
key practical issue is the computational complexity of the        of vertically partitioned data. If for point A, the algorithm
encryption required for the secure comparison and scalar          decides to exclude point B from consideration - since all
product protocols.                                                parties know the transaction identifiers for points A and B,
if the two points are far apart locally for one party, it knows             Applications Conference, New Orleans, Louisiana, USA,
that the two points have to be very close for the other par-                December 10-14 2001.
ties. On the other hand if a point is selected to be an outlier       [5]   Directive 95/46/EC of the european parliament and of the
only after computing distances from x other points, then                    council of 24 october 1995 on the protection of individuals
                                                                            with regard to the processing of personal data and on the free
all parties now know that the point is farther away from at
                                                                            movement of such data. Official Journal of the European
least p% of those points. Thus, extra information has been                  Communities, No I.(281):31–50, Oct. 24 1995.
revealed.                                                             [6]   K. J. Ezawa and S. W. Norton. Constructing bayesian net-
    Even with horizontal partitioning of data, probabilistic                works to predict uncollectible telecommunications accounts.
estimates of point locations (or clusters) is possible. If one              IEEE Expert, 11(5):45–51, Oct. 1996.
of the points under consideration is owned by you, you gain           [7]   M. Feingold, M. Corzine, M. Wyden, and M. Nelson. Data-
considerable information about other points if the search                   mining moratorium act of 2003. U.S. Senate Bill (proposed),
process ends early – over time, allowing the determination                  Jan. 16 2003.
                                                                      [8]   O. Goldreich. The Foundations of Cryptography, volume 2,
of clusters at other sites based on knowing they have a high
                                                                            chapter General Cryptographic Protocols. Cambridge Uni-
number of outliers relative to one’s own data. While not as                 versity Press, 2004.
obvious as problems that may occur in vertically partitioned          [9]   O. Goldreich, S. Micali, and A. Wigderson. How to play any
data, it poses an unknown hazard to privacy.                                mental game - a completeness theorem for protocols with
    This forces the run time complexity of any secure algo-                 honest majority. In 19th ACM Symposium on the Theory of
rithm to be quadratic (always - it is no longer worst case).                Computing, pages 218–229, 1987.
If one is willing to compromise on security, it is possible          [10]   D. Goldschlag, M. Reed, and P. Syverson. Onion routing.
to come up with more efficient algorithms. However, the                      Commun. ACM, 42(2):39–41, Feb. 1999.
                                                                     [11]   I. Ioannidis, A. Grama, and M. Atallah. A secure protocol
quantification of (in)security in this case is so amorphous                  for computing dot-products in clustered and distributed envi-
that it might be not be justifiable to regard the algorithm as               ronments. In The 2002 International Conference on Parallel
secure.                                                                     Processing, Vancouver, British Columbia, Aug. 18-21 2002.
                                                                     [12]                  g
                                                                            M. Kantarcıoˇ lu and C. Clifton. Privacy-preserving dis-
                                                                            tributed mining of association rules on horizontally parti-
6. Conclusion                                                               tioned data. IEEE Transactions on Knowledge and Data
                                                                            Engineering, 16(9):1026–1037, Sept. 2004.
   In this paper, we have presented privacy-preserving so-           [13]   Special section on privacy and security. SIGKDD Explo-
lutions for finding distance based outliers in distributed data              rations, 4(2):i–48, Jan. 2003.
sets, and proven their security. One contribution of the pa-         [14]   E. M. Knorr and R. T. Ng. Algorithms for mining distance-
                                                                            based outliers in large datasets. In Proceedings of 24th In-
per is to point out that quadratic complexity is a necessity
                                                                            ternational Conference on Very Large Data Bases (VLDB
for secure solutions to the problem – at most constant-time                 1998), pages 392–403, New York City, NY, USA, Aug.24-
improvements are possible over the algorithms given.                        27 1998.
   We are currently implementing these schemes and inte-             [15]   E. M. Knorr, R. T. Ng, and V. Tucakov. Distance-based out-
grating them into software packages (e.g., Weka [22]) to                    liers: algorithms and applications. The VLDB Journal, 8(3-
enable a practical evaluation of the computational cost. An-                4):237–253, 2000.
other important problem is to figure out privacy-preserving           [16]   E. M. Knorr, R. T. Ng, and R. H. Zamar. Robust space trans-
methods of space transformation[16], allowing additional                    formations for distance-based operations. In Proceedings
                                                                            of the seventh ACM SIGKDD international conference on
distance-based operations to be done in a secure manner.
                                                                            Knowledge discovery and data mining, pages 126–135, San
                                                                            Francisco, California, 2001. ACM Press.
References                                                           [17]   A. Lazarevic, A. Ozgur, L. Ertoz, J. Srivastava, and V. Ku-
                                                                            mar. A comparative study of anomaly detection schemes
                                                                            in network intrusion detection. In SIAM International Con-
 [1] D. Barbar´ , N. Wu, and S. Jajodia. Detecting novel network            ference on Data Mining (2003), San Francisco, California,
     intrusions using bayes estimators. In First SIAM Interna-              May 1-3 2003.
     tional Conference on Data Mining, Chicago, Illinois, Apr.       [18]   M. Lewis. Department of defense appropriations act, 2004,
     5-7 2001.                                                              July 17 2003. Title VIII section 8120. Enacted as Public
 [2] V. Barnett and T. Lewis. Outliers in Statistical Data. John            Law 108-87.
     Wiley and Sons, 3rd edition, 1994.                              [19]   S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algo-
 [3] C. Cachin. Efficient private bidding and auctions with an               rithms for mining outliers from large data sets. In Proceed-
     oblivious third party. In Proceedings of the 6th ACM con-              ings of the 2000 ACM SIGMOD international conference on
     ference on Computer and communications security, pages                 Management of data, pages 427–438. ACM Press, 2000.
     120–127. ACM Press, 1999.                                       [20]   M. K. Reiter and A. D. Rubin. Crowds: Anonymity for Web
 [4] W. Du and M. J. Atallah. Privacy-preserving statistical anal-          transactions. ACM Transactions on Information and System
     ysis. In Proceeding of the 17th Annual Computer Security               Security, 1(1):66–92, Nov. 1998.
[21] J. Vaidya and C. Clifton. Privacy preserving association rule
     mining in vertically partitioned data. In The Eighth ACM
     SIGKDD International Conference on Knowledge Discov-
     ery and Data Mining, pages 639–644, Edmonton, Alberta,
     Canada, July 23-26 2002.
[22] I. H. Witten and E. Frank. Data Mining: Practical Machine
     Learning Tools and Techniques with Java Implementations.
     Morgan Kaufmann, San Fransisco, Oct. 1999.
[23] A. C. Yao. How to generate and exchange secrets. In Pro-
     ceedings of the 27th IEEE Symposium on Foundations of
     Computer Science, pages 162–167. IEEE, 1986.

To top