Algorithms for Nearest Neighbor Search by hcj


									Algorithms for Nearest Neighbor
           Piotr Indyk
     Nearest Neighbor Search
• Given: a set P of n points in Rd
• Goal: a data structure, which given a query
  point q, finds the nearest neighbor p of q
  in P

          Outline of this talk
• Variants
• Motivation
• Main memory algorithms:
  – quadtrees
  – kd-trees
  – Locality Sensitive Hashing
• Secondary storage algorithms:
  – R-tree (and its variants)
  – VA-file
    Variants of nearest neighbor
• Near neighbor (range search): find one/all
  points in P within distance r from q
• Spatial join: given two sets P,Q, find all
  pairs p in P, q in Q, such that p is within
  distance r from q
• Approximate near neighbor: find one/all
  points p’ in P, whose distance to q is at
  most (1+e) times the distance from q to its
  nearest neighbor
Depends on the value of d:
• low d: graphics, vision, GIS, etc
• high d:
  – similarity search in databases (text, images etc)
  – finding pairs of similar objects (e.g., copyright
    violation detection)
  – useful subroutine for clustering
• Main memory (Computational Geometry)
  – linear scan
  – tree-based:
     • quadtree
     • kd-tree
  – hashing-based: Locality-Sensitive Hashing
• Secondary storage (Databases)
  – R-tree (and numerous variants)
  – Vector Approximation File (VA-file)
• Simplest spatial structure on Earth !
               Quadtree ctd.
• Split the space into 2d equal subsquares
• Repeat until done:
  – only one pixel left
  – only one point left
  – only a few points left
• Variants:
  – split only one dimension at a time
  – k-d-trees (in a moment)
                 Range search
• Near neighbor (range search):
  – put the root on the stack
  – repeat
     • pop the next node T from the stack
     • for each child C of T:
        – if C is a leaf, examine point(s) in C
        – if C intersects with the ball of radius r around q, add C to
          the stack
Near neighbor ctd
           Nearest neighbor
• Start range search with r = 
• Whenever a point is found, update r
• Only investigate nodes with respect to
  current r
              Quadtree ctd.
• Simple data structure
• Versatile, easy to implement
• So why doesn’t this talk end here ?
  – Empty spaces: if the points form sparse clouds,
    it takes a while to reach them
  – Space exponential in dimension
  – Time exponential in dimension, e.g., points on
    the hypercube
Space issues: example
       K-d-trees [Bentley’75]
• Main ideas:
  – only one-dimensional splits
  – instead of splitting in the middle, choose the
    split “carefully” (many variations)
  – near(est) neighbor queries: as for quadtrees
• Advantages:
  – no (or less) empty spaces
  – only linear space
• Exponential query time still possible
        Exponential query time
• What does it mean exactly ?
   – Unless we do something really stupid, query time is at
     most dn
   – Therefore, the actual query time is
                   Min[ dn, exponential(d) ]
• This is still quite bad though, when the dimension
  is around 20-30
• Unfortunately, it seems inevitable (both in theory
  and practice)
  Approximate nearest neighbor
• Can do it using (augmented) k-d trees, by
  interrupting search earlier [Arya et al’94]
• Still exponential time (in the worst case)!
• Try a different approach:
  – for exact queries, we can use binary search
    trees or hashing
  – can we adapt hashing to nearest neighbor
    search ?
    Locality-Sensitive Hashing
• Hash functions are locality-sensitive, if, for
  a random hash random function h, for any
  pair of points p,q we have:
  – Pr[h(p)=h(q)] is “high” if p is “close” to q
  – Pr[h(p)=h(q)] is “low” if p is”far” from q
      Do such functions exist ?
• Consider the hypercube, i.e.,
  – points from {0,1}d
  – Hamming distance D(p,q)= # positions on
    which p and q differ
• Define hash function h by choosing a set I
  of k random coordinates, and setting
          h(p) = projection of p on I
• Take
  – d=10, p=0101110010
  – k=2, I={2,5}
• Then h(p)=11
      h’s are locality-sensitive
• Pr[h(p)=h(q)]=(1-D(p,q)/d)k
• We can vary the probability by changing k

     Pr        k=1    Pr        k=2

           distance          distance
       How can we use LSH ?
• Choose several h1..hl
• Initialize a hash array for each hi
• Store each point p in the bucket hi(p) of the
  i-th hash array, i=1...l
• In order to answer query q
  – for each i=1..l, retrieve points in a bucket hi(q)
  – return the closest point found
  What does this algorithm do ?
• By proper choice of parameters k and l, we can
  make, for any p, the probability that
                  hi(p)=hi(q) for some i
  look like this:

• Can control:
   – Position of the slope
   – How steep it is
            The LSH algorithm
• Therefore, we can solve (approximately) the near
  neighbor problem with given parameter r
• Worst-case analysis guarantees dn1/(1+e) query time
• Practical evaluation indicates much better behavior
• Drawbacks:
      • works best for Hamming distance (although can be generalized
        to Euclidean space)
      • requires radius r to be fixed in advance
          Secondary storage
• Seek time same as time needed to transfer
  hundreds of KBs
• Grouping the data is crucial
• Different approach required:
  – in main memory, any reduction in the number
    of inspected points was good
  – on disk, this is not the case !
        Disk-based algorithms
• R-tree [Guttman’84]
  – departing point for many variations
  – over 600 citations ! (according to CiteSeer)
  – “optimistic” approach: try to answer queries in
    logarithmic time
• Vector Approximation File [WSB’98]
  – “pessimistic” approach: if we need to scan the whole
    data set, we better do it fast
• LSH works also on disk
• “Bottom-up” approach (k-d-tree was “top-
  down”) :
  – Start with a set of points/rectangles
  – Partition the set into groups of small cardinality
  – For each group, find minimum rectangle
    containing objects from this group
  – Repeat
R-tree ctd.
                 R-tree ctd.
• Advantages:
  – Supports near(est) neighbor search (similar as
  – Works for points and rectangles
  – Avoids empty spaces
  – Many variants: X-tree, SS-tree, SR-tree etc
  – Works well for low dimensions
• Not so great for high dimensions
VA-file [Weber, Schek, Blott’98]
• Approach:
  – In high-dimensional spaces, all tree-based
    indexing structures examine large fraction of
  – If we need to visit so many nodes anyway, it is
    better to scan the whole data set and avoid
    performing seeks altogether
  – 1 seek = transfer of few hundred KB
               VA-file ctd.
• Natural question: how to speed-up linear
  scan ?
• Answer: use approximation
  – Use only i bits per dimension (and speed-up the
    scan by a factor of 32/i)
  – Identify all points which could be returned as
    an answer
  – Verify the points using original data set
              Time to sum up
• “Curse of dimensionality” is indeed a curse
• In main memory, we can perform sublinear-time
  search using trees or hashing
• In secondary storage, linear scan is pretty much all
  we can do (for high dim)
• Personal thought: if linear search is all we can do,
  we are not doing too well….
• Maybe it is time to buy a few GB of RAM
• ..but at the end everything depends on your data set
• Surveys:
  – Berchtold & Keim:

  – Theodoridis:

  – Agarwal et al (range searching):
• Source code:

• References: see surveys plus very recent
  – [Buh’00,BT’00]: J. Buhler et al:
  – [HGI’00]: Haveliwala et al:
• If you have any question, feel free to e-mail
  me at

• Thank you !

To top