k-Nearest Neighbors Search in High Dimensions

Document Sample
k-Nearest Neighbors Search in High Dimensions Powered By Docstoc
					      Search k-Nearest Neighbors
           in High Dimensions
                       Tomer Peled
                       Dan Kushnir



Tell me who your neighbors are, and I'll know who you are
  Outline
       Problem definition and flavors         •
Algorithms overview - low dimensions          •
    Curse of dimensionality (d>10..20)        •
                Enchanting the curse          •
                 Locality Sensitive Hashing
     (high dimension approximate solutions)
                            l2 extension •
                     Applications (Dan) •
       Nearest Neighbor Search
         Problem definition
• Given: a set P of n points in Rd
  Over some metric
• find the nearest neighbor p of q in P



          Q?


     Distance metric
                 Applications
         Classification •              Indexing •
                            Dimension reduction •
            Clustering •                (e.g. lle)
         Segmentation •
Weight



            q?

                            color
Naïve solution
              No preprocess •
       Given a query point q •
         Go over all n points –
        Do comparison in Rd –
          query time = O(nd) •


    Keep in mind
        Common solution
     Use a data structure for acceleration •
Scale-ability with n & with d is important •
           When to use nearest neighbor
                         High level algorithms


      Parametric                               Non-parametric


                           Probability              Density             Nearest
                     distribution estimation       estimation          neighbors


        complex models           Sparse data         High dimensions


Assuming no prior knowledge about the underlying probability structure
Nearest Neighbor




        q?




    min pi  P dist(q,pi)
r,  - Nearest Neighbor



                 q?
    (1 +  ) r
                 r




   dist(q,p1)  r

   dist(q,p2)  (1 +  ) r   r2=(1 +  ) r1
     Outline
           Problem definition and flavors    •
Algorithms overview - low dimensions         •
       Curse of dimensionality (d>10..20)    •
                   Enchanting the curse      •
              Locality Sensitive Hashing
 (high dimension approximate solutions)
                              l2 extension   •
                      Applications (Dan)     •
The simplest solution
             Lion in the desert •
Quadtree

      Split the first dimension into 2

      Repeat iteratively

      Stop when each cell
      has no more than 1 data point
    Quadtree - structure
                P<X1
                              X1,Y1    P≥X1
                P<Y1                   P≥Y1
                       P<X1
                       P≥Y1     P≥X1
                                P<Y1
                        X1,Y1
Y




    X
        Query - Quadtree
                     P<X1
                                   X1,Y1    P≥X1
                     P<Y1                   P≥Y1
                            P<X1
                            P≥Y1     P≥X1
                                     P<Y1
                             X1,Y1
Y




    X
         In many cases works
    Pitfall1 – Quadtree
                     P<X1
                                    X1,Y1
                                             P≥X1
                     P<Y1                    P≥Y1
                             P<X1     P≥X1
                             P≥Y1     P<Y1
                              X1,Y1
Y


                      P<X1




    X
        In some cases doesn’t
        Pitfall1 – Quadtree



Y




        X
    In some cases nothing works
            pitfall 2 – Quadtree
            X




Y




                                         O(2 d)

Could result in Query time Exponential in #dimensions
       Space partition based algorithms
Could be improved




Multidimensional access methods / Volker Gaede, O. Gunther
    Outline
         Problem definition and flavors    •
Algorithms overview - low dimensions       •
    Curse of dimensionality (d>10..20)     •
                 Enchanting the curse      •
            Locality Sensitive Hashing
(high dimension approximate solutions)
                            l2 extension   •
                    Applications (Dan)     •
         Curse of dimensionality
                          O(nd)        Query     O( min(nd, •
                                       Naive time or space nd) )
                  D>10..20  worst than sequential scan •
                       For most geometric distributions –
         Techniques specific to high dimensions are needed •


•Prooved in theory and in practice by Barkol & Rabani 2000 & Beame-Vee 2002
Curse of dimensionality
    Some intuition

                     2

                    22


                    23

                    2d
    Outline
         Problem definition and flavors   •
Algorithms overview - low dimensions      •
     Curse of dimensionality (d>10..20)   •
                Enchanting the curse      •
           Locality Sensitive Hashing
(high dimension approximate solutions)
                           l2 extension   •
                    Applications (Dan)    •
    Preview
              General Solution – •
       Locality sensitive hashing
Implementation for Hamming space •
         Generalization to l1 & l2 •
Hash function
                Hash function

Data_Item



Hash function

                   Key


                                Bin/Bucket
                       Hash function
                                                          Data structure
  X=Number
  in the range 0..n




      X modulo 3
                                                               0
                               0..2


                                                         Storage Address

Usually we would like related Data-items to be stored at the same bin
Recall r,  - Nearest Neighbor



                    q?
       (1 +  ) r
                    r




      dist(q,p1)  r

      dist(q,p2)  (1 +  ) r   r2=(1 +  ) r1
                Locality sensitive hashing



                                           q?
                             (1 +  ) r
                                           r




       (r, ,p1,p2) Sensitive
P1 ≡Pr[I(p)=I(q)] is “high” if p is “close” to q
P2 ≡Pr[I(p)=I(q)] is “low” if p is”far” from q


                                                   r2=(1 +  ) r1
           Preview
                General Solution – •
         Locality sensitive hashing
Implementation for Hamming space •
           Generalization to l1 & l2 •
                      Hamming Space
                   Hamming space = 2N binary strings •

                  Hamming distance = #changed digits •




Richard Hamming                            a.k.a Signal distance
           Hamming Space
                        N


                 010100001111 space •
                    Hamming

                  Hamming distance •
                 010100001111
                                Distance = 4
                 010010000011


SUM(X1 XOR X2)
       L1 to Hamming Space Embedding
C=11




  2
                      p

                  8                d’=C*d


11000000000   11111111000   11000000000 11111111000
                    Hash function
11000000000 11111111000
 1       0          1      p ∈ Hd’

                          Lj Hash function    j=1..L, k=3 digits


                          Gj(p)=p|Ij         Bits sampling from p




                          Store p into bucket p|Ij    2k buckets
        101
        Construction

            p


1   2                  L
        Query

        q


1   2           L
   Alternative intuition random projections
C=11




  2
                      p

                  8                d’=C*d


11000000000   11111111000   11000000000 11111111000
   Alternative intuition random projections
C=11




  2
                      p

                  8

11000000000   11111111000   11000000000 11111111000
   Alternative intuition random projections
C=11




  2
                      p

                  8

11000000000   11111111000   11000000000 11111111000
Alternative intuition random projections

 1       0          1
11000000000 11111111000



                               110   111




                               100   101



                                           p
        101       23 Buckets   000   001
k samplings
Repeating
Repeating L times
Repeating L times
                 Secondary hashing
                2k buckets
 011


            Simple Hashing



Size=B            M*B=α*n         α=2


    M Buckets
                                     Support volume tuning
                             dataset-size vs. storage volume
                             The above hashing
                             is locality-sensitive
                                                                                k
                                                                 Distance( p, q 
                                                             same bucket) )=
                                                           in1                     •
                                        Probability (p,q                       
                                                                # dimensions 
               Probability




                                    k=1        Pr          k=2




                             Distance (q,pi)        Distance (q,pi)

Adopted from Piotr Indyk’s slides
    Preview
               General Solution – •
        Locality sensitive hashing
Implementation for Hamming space •
             Generalization to l2 •
    Direct L2 solution
                New hashing function     •
               Still based on sampling   •
             Using mathematical trick    •
 P-stable distribution for Lp distance   •
Gaussian distribution for L2 distance    •
         Central limit theorem


v1*       +v2*       +…        …+vn*          =



  (Weighted Gaussians) = Weighted Gaussian
         Central limit theorem

  v1* X1 +v2* X2    +…        …+vn* Xn        =




v1..vn = Real Numbers

X1:Xn = Independent Identically Distributed
     (i.i.d)
    Central limit theorem
                         1/ 2
                     2
 vi  X i   | vi | 
i            i        
                                X

Dot Product       Norm
                Norm  Distance
                                             1/ 2
                                      2
 ui  X i  vi  X i   | ui  vi | 
i            i            i            
                                                    X
    Features    Features
     vector 1      vector 2       Distance
                Norm  Distance
                                             1/ 2
                                      2
 ui  X i  vi  X i   | ui  vi | 
i           i            i             
                                                    X
       Dot        Dot
      Product      Product        Distance
                 The full Hashing
    d random*           Features                 phase
     numbers             vector              Random[0,w]
                                   22
1          [34 82 21]              77   d
                                            +b
                                   42


      Discretization      w
           step

                                               a v  b
                                   ha ,b (v)          
                                                w 
                 The full Hashing


                           7944           +34

                            100
                                              a v  b
7800 7900 8000 8100 8200

                                  ha ,b (v)          
                                               w 
           The full Hashing
                                     phase
                                  Random[0,w]


                 7944           +34

Discretization    100
     step

                                    a v  b
                        ha ,b (v)          
                                     w 
                  The full Hashing
i.i.d from p-stable       Features                 phase
    distribution           vector              Random[0,w]

 1                    a              v    d
                                              +b

      Discretization        w
           step

                                                 a v  b
                                     ha ,b (v)          
                                                  w 
Generalization: P-Stable distribution
                     L2 •           Lp p=eps..2 •
  Central Limit Theorem •           Generalized •
                          Central Limit Theorem
     Gaussian (normal) • P-stable distribution •
             distribution          Cauchy for L2
            P-Stable summary
                   r,  - Nearest Neighbor •
                                   Works for
                        Generalizes to 0<p<=2 •
                          Improves query time •

                                            Latest results
                                         Reported in Email by
                                          Alexander Andoni



Query time = O (dn1/(1+)log(n) )  O (dn1/(1+)^2log(n) )
         Parameters selection
90% Probability  Best quarry time performance •




          For Euclidean Space
                                                    L
        Parameters selection …
Single projection hit an  - Nearest Neighbor •
                                   with Pr=p1
  k projections hits an  - Nearest Neighbor •
                                 with Pr=p1k
  L hashings fail to collide with Pr=(1-p1k)L •

          To ensure Collision (e.g. 1-δ≥90%) •
                                       log( )
  1-   (1-p1k)L≥ 1-δ      •       L
                                     log(1  p1 )
                                               k


            For Euclidean Space
                                                K
              … Parameters selection
time Candidates verification   Candidates extraction




                                                  k
                Pros. & Cons.
Better Query Time than Spatial Data Structures
Scales well to higher dimensions and larger data size
( Sub-linear dependence )
Predictable running time
 Extra storage over-head
  Inefficient for data with distances concentrated around
average
  works best for Hamming distance (although can be
generalized to Euclidean space)
  In secondary storage, linear scan is pretty much all we
can do (for high dim)
 requires radius r to be fixed in advance      From Pioter Indyk slides
               Conclusion
                                  ..but at the end •
     everything depends on your data set
                                  Try it at home •
                                          Visit: –
http://web.mit.edu/andoni/www/LSH/index.html
     Andoni@mit.edu          Email Alex Andoni –
                        Test over your own data –
                  (C code under Red Hat Linux )
                               LSH - Applications
• Searching video clips in databases
    Hashing and Its Application to Video Identification“, Yang, Ooi, Sun).
                                                                                                      .("Hierarchical, Non-Uniform Locality Sensitive



•   Searching image databases                                                  (see the following).


•   Image segmentation                                      (see the following).


•   Image classification                                    (“Discriminant adaptive Nearest Neighbor Classification”, T. Hastie, R Tibshirani).


•   Texture classification                                     (see the following).


•   Clustering                   (see the following).


•   Embedding and manifold learning                                                                   (LLE, and many others)


•   Compression – vector quantization.
•   Search engines                            (“LSH Forest: SelfTuning Indexes for Similarity Search”, M. Bawa, T. Condie, P. Ganesan”).


•   Genomics                     (“Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing”, J. Buhler).


•   In short: whenever K-Nearest Neighbors (KNN) are
    needed.
              Motivation
• A variety of procedures in learning
  require KNN computation.
• KNN search is a computational
  bottleneck.
• LSH provides a fast approximate solution
  to the problem.
• LSH requires hash function construction
  and parameter tunning.
                                 Outline
Fast Pose Estimation with Parameter Sensitive
Hashing G. Shakhnarovich, P. Viola, and T. Darrell.
• Finding sensitive hash functions.

Mean Shift Based Clustering in High
Dimensions: A Texture Classification Example
    B. Georgescu, I. Shimshoni, and P. Meer

•   Tuning LSH parameters.
•   LSH data structure is used for algorithm
    speedups.
     Fast Pose Estimation with Parameter Sensitive
                        Hashing
              G. Shakhnarovich, P. Viola, and T. Darrell




The Problem:
Given an image x, what are the
parameters θ, in this image?                               i

i.e. angles of joints, orientation of the body,
etc.����
                   Ingredients
• Input query image with unknown angles
  (parameters).
• Database of human poses with known angles.
• Image feature extractor – edge detector.
• Distance metric in feature space dx.
• Distance metric in angles space:
                           m
           d (1 ,  2 )  1  cos(1i   2i )
                          i 1
            Example based learning
• Construct a database of example images with their known
  angles.
• Given a query image, run your favorite feature extractor.
• Compute KNN from database.
• Use these KNNs to compute the average angles of the
  query.


                       Find KNN in
  Input: query                               Output: Average
                        database of
                                              angles of KNN
                         examples
              The algorithm flow
Input Query                               Processed query


                  Features extraction




 Database of examples



                                        Output Match
Feature Extraction       PSH           LWR

               The image features
                         Image features are multi-
                         scale edge histograms:
  B                  A

                                      3
                            0,     ,  ,    ,
                                  4 2    4

                            107 ( x)   A x / 4
Feature Extraction       PSH          LWR

        PSH: The basic assumption
There are two metric spaces here: feature space (d x)
and parameter space ( d  ).
We want similarity to be measured in the angles
space, whereas LSH works on the feature space.
• Assumption: The feature space is closely
  related to the parameter space.
Feature Extraction       PSH       LWR

                Insight: Manifolds
• Manifold is a space in which
  every point has a neighborhood
  resembling a Euclid space.
• But global structure may be
  complicated: curved.
• For example: lines are 1D
  manifolds, planes are 2D
  manifolds, etc.
    q




                   Feature Space

Is this Magic?




                 Parameters Space
                     (angles)
 Feature Extraction    PSH          LWR

      Parameter Sensitive Hashing (PSH)

The trick:
 Estimate performance of different hash functions
 on examples, and select those sensitive to d  :
 The hash functions are applied in feature space
 but the KNN are valid in angle space.
Feature Extraction          PSH          LWR

    PSH as a classification problem
                          Label pairs of examples
                            with similar angles
 Compare                  Define hash functions h
 labeling                    on feature space
                         Predict labeling of similar\
                       non-similar examples by using h

                     If labeling by h is good
                     accept h, else change h
Feature Extraction            PSH                LWR

                     A pair of examples (xi , i ), ( x j , j )
                     is labeled :
                            1 if d ( i , j )  r
                           
                     yij  
                            1 if d ( i , j )  r (1   )
                           

Labels:        +1             +1               -1                -1




(r=0.25)
Feature Extraction          PSH             LWR

                                                       features
   A binary hash function:                   Feature


                           1 if  (x)  T
             h ,T ( x)  
                          -1 otherwise

   Predict the labels
                   1 if h ,T (xi )  h ,T (x j )
   yh(xi ,x j )  
   ˆ
                   1 otherwise
Feature Extraction    PSH          LWR


    h ,T will place both examples in the same
    bin or separate them :
                 




                        T       (x)
    Find the best T* that predicts the true
    labeling with the probabilit ies constraint s.
Feature Extraction                  PSH        LWR


      Local Weighted Regression (LWR)
  • Given a query image, PSH returns
    KNNs.
  • LWR uses the KNN to compute a
    weighted average of the estimated
    arg min   d ( g ( xi ,  ),  i ) K (d X ( xi , x0 ))
   *angles of the query:

                     xi N ( x0 )
                                                 
                                              
                                                 dist . weight
                   Results
Synthetic data were generated:
• 13 angles: 1 for rotation of the torso, 12 for
  joints.
• 150,000 images.
• Nuisance parameters added: clothing,
  illumination, face expression.
•   1,775,000 example pairs.
•   Selected 137 out of 5,123 meaningful features
    (how??):                          Recall:
                                      P1 is prob of positive
      18 bit hash functions (k), 150 hash tables (l). hash.
                                                      P2 is prob of bad hash.
                                                      B is the max number of
                                                      pts in a bucket.

• Without selection needed 40 bits and
  1000 hash tables.

• Test on 1000 synthetic examples:
• PSH searched only 3.4% of the data per query.
          Results – real data
• 800 images.
• Processed by a segmentation algorithm.
• 1.3% of the data were searched.
Results – real data
Interesting mismatches
 Fast pose estimation - summary
• Fast way to compute the angles of human
  body figure.
• Moving from one representation space to
  another.
• Training a sensitive hash function.
• KNN smart averaging.
         Food for Thought
• The basic assumption may be problematic
  (distance metric, representations).
• The training set should be dense.
• Texture and clutter.
• General: some features are more important
  than others and should be weighted.
 Food for Thought: Point Location in
     Different Spheres (PLDS)
• Given: n spheres in Rd , centered at P={p1,…,pn}
  with radii {r1,…,rn} .

• Goal: given a query q, preprocess the points in P
  to find point pi that its sphere ‘cover’ the query q.

                                       ri
                            q
                                  pi



                                            Courtesy of Mohamad Hegaze
Mean-Shift Based Clustering in High Dimensions: A
         Texture Classification Example
               B. Georgescu, I. Shimshoni, and P. Meer

 Motivation:
 • Clustering high dimensional data by using local
   density measurements (e.g. feature space).
 • Statistical curse of dimensionality:
   sparseness of the data.
 • Computational curse of dimensionality:
   expensive range queries.
 • LSH parameters should be adjusted for optimal
   performance.
                      Outline
•    Mean-shift in a nutshell + examples.

Our scope:
• Mean-shift in high dimensions – using LSH.
• Speedups:
    1. Finding optimal LSH parameters.
    2. Data-driven partitions into buckets.
    3. Additional speedup by using LSH data structure.
Mean-shift   LSH   LSH: optimal k,l   LSH: data       LSH: data struct
                                       partition


             Mean-Shift in a Nutshell
                                       bandwidth


                                              point
Mean-shift    LSH      LSH: optimal k,l   LSH: data    LSH: data struct
                                           partition


                    KNN in mean-shift
     Bandwidth should be inversely proportional to the
     density in the region:
                   high density - small bandwidth
                   low density - large bandwidth

     Based on kth nearest neighbor        of the point

    The bandwidth is

     Adaptive mean-shift vs. non-adaptive.
Mean-shift   LSH   LSH: optimal k,l   LSH: data    LSH: data struct
                                       partition
Mean-shift         LSH         LSH: optimal k,l         LSH: data        LSH: data struct
                                                         partition


             Image segmentation algorithm
1. Input : Data in 5D (3 color + 2 x,y) or 3D (1 gray +2 x,y)
2. Resolution controlled by the bandwidth: hs (spatial), hr (color)
3. Apply filtering




     3D:




Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’
Mean-shift      LSH         LSH: optimal k,l    LSH: data       LSH: data struct
                                                 partition
             Image segmentation algorithm
   Filtering:       pixel      value of the nearest mode


 Mean-shift
 trajectories




         original                    filtered                segmented
                       Filtering examples



                 original squirrel                 filtered




                 original baboon                   filtered
Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’
                  Segmentation examples




Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’
Mean-shift     LSH    LSH: optimal k,l   LSH: data    LSH: data struct
                                          partition


             Mean-shift in high dimensions

  Statistical curse of dimensionality:
      Sparseness of the data             variable bandwidth

  Computational curse of dimensionality:

     Expensive range queries             implemented with LSH
Mean-shift     LSH        LSH: optimal k,l   LSH: data    LSH: data struct
                                              partition


              LSH-based data structure
      • Choose L random partitions:
      Each partition includes K pairs
        (dk,vk)
      • For each point we check:
                  xi , d K  vk
       It Partitions the data into cells:
Mean-shift      LSH   LSH: optimal k,l   LSH: data    LSH: data struct
                                          partition


             Choosing the optimal K and L
    • For a query q compute
      smallest number of distances
      to points in its buckets.
Mean-shift     LSH      LSH: optimal k,l   LSH: data    LSH: data struct
                                            partition


                               d
        N Cl  n( K / d  1)
                                     C
        N C  LNCl

                                     C
      As L increasesC increases but C decreases.
     C determines the resolution of the data structure.
     Large k  smaller number of points in a cell.
     If L is too small  points might be missed,
     but if L is too big  C might include extra points
Mean-shift     LSH     LSH: optimal k,l        LSH: data    LSH: data struct
                                                partition

             Choosing optimal K and L
Determine accurately the KNN for m randomly-selected
data points.

distance (bandwidth)

Choose error threshold 

The optimal K and L should satisfy



                                          the approximate distance
Mean-shift    LSH    LSH: optimal k,l   LSH: data      LSH: data struct
                                         partition

             Choosing optimal K and L
  • For each K estimate the error for
  • In one run for all L’s:
    find the minimal L satisfying the constraint L(K)
  • Minimize time t(K,L(K)):

                                                           minimum




  Approximation            L(K) for =0.05           Running time
  error for K,L                                      t[K,L(K)]
Mean-shift        LSH   LSH: optimal k,l   LSH: data         LSH: data struct
                                            partition


                  Data driven partitions
 • In original LSH, cut values are random in the range of the
 data.
 • Suggestion: Randomly select a point from the data and
 use one of its coordinates as the cut value.




        uniform            data driven                  points/bucket
                                                        distribution
Mean-shift   LSH   LSH: optimal k,l   LSH: data    LSH: data struct
                                       partition

               Additional speedup
    Assume that all points in C will converge to the
   same mode. (C is like a type of an aggregate)


              C


              C
     Speedup results




65536 points, 1638 points sampled , k=100
    Food for thought
Low dimension   High dimension
        A thought for food…
• Choose K, L by sample learning, or take the
  traditional.
• Can one estimate K, L without sampling?
• A thought for food: does it help to know the data
  dimensionality or the data manifold?
• Intuitively: dimensionality implies the number of
  hash functions needed.
• The catch: efficient dimensionality learning requires
  KNN.
                                  15:30 cookies…..
                  Summary
• LSH suggests a compromise on accuracy for the
  gain of complexity.
• Applications that involve massive data in high
  dimension require the LSH fast performance.
• Extension of the LSH to different spaces (PSH).
• Learning the LSH parameters and hash
  functions for different applications.
                Conclusion
• ..but at the end
  everything depends on your data set
• Try it at home
  – Visit:
    http://web.mit.edu/andoni/www/LSH/index.html
  – Email Alex Andoni         Andoni@mit.edu
  – Test over your own data
    (C code under Red Hat Linux )
                 Thanks
•   Ilan Shimshoni (Haifa).
•   Mohamad Hegaze (Weizmann).
•   Alex Andoni (MIT).
•   Mica and Denis.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:77
posted:2/9/2012
language:English
pages:111