k-Nearest Neighbors Search in High Dimensions by 6I7Q1J56

VIEWS: 77 PAGES: 111

• pg 1
```									      Search k-Nearest Neighbors
in High Dimensions
Tomer Peled
Dan Kushnir

Tell me who your neighbors are, and I'll know who you are
Outline
Problem definition and flavors         •
Algorithms overview - low dimensions          •
Curse of dimensionality (d>10..20)        •
Enchanting the curse          •
Locality Sensitive Hashing
(high dimension approximate solutions)
l2 extension •
Applications (Dan) •
Nearest Neighbor Search
Problem definition
• Given: a set P of n points in Rd
Over some metric
• find the nearest neighbor p of q in P

Q?

Distance metric
Applications
Classification •              Indexing •
Dimension reduction •
Clustering •                (e.g. lle)
Segmentation •
Weight

q?

color
Naïve solution
No preprocess •
Given a query point q •
Go over all n points –
Do comparison in Rd –
query time = O(nd) •

Keep in mind
Common solution
Use a data structure for acceleration •
Scale-ability with n & with d is important •
When to use nearest neighbor
High level algorithms

Parametric                               Non-parametric

Probability              Density             Nearest
distribution estimation       estimation          neighbors

complex models           Sparse data         High dimensions

Assuming no prior knowledge about the underlying probability structure
Nearest Neighbor

q?

min pi  P dist(q,pi)
r,  - Nearest Neighbor

q?
(1 +  ) r
r

dist(q,p1)  r

dist(q,p2)  (1 +  ) r   r2=(1 +  ) r1
Outline
Problem definition and flavors    •
Algorithms overview - low dimensions         •
Curse of dimensionality (d>10..20)    •
Enchanting the curse      •
Locality Sensitive Hashing
(high dimension approximate solutions)
l2 extension   •
Applications (Dan)     •
The simplest solution
Lion in the desert •

Split the first dimension into 2

Repeat iteratively

Stop when each cell
has no more than 1 data point
P<X1
X1,Y1    P≥X1
P<Y1                   P≥Y1
P<X1
P≥Y1     P≥X1
P<Y1
X1,Y1
Y

X
P<X1
X1,Y1    P≥X1
P<Y1                   P≥Y1
P<X1
P≥Y1     P≥X1
P<Y1
X1,Y1
Y

X
In many cases works
P<X1
X1,Y1
P≥X1
P<Y1                    P≥Y1
P<X1     P≥X1
P≥Y1     P<Y1
X1,Y1
Y

P<X1

X
In some cases doesn’t

Y

X
In some cases nothing works
X

Y

O(2 d)

Could result in Query time Exponential in #dimensions
Space partition based algorithms
Could be improved

Multidimensional access methods / Volker Gaede, O. Gunther
Outline
Problem definition and flavors    •
Algorithms overview - low dimensions       •
Curse of dimensionality (d>10..20)     •
Enchanting the curse      •
Locality Sensitive Hashing
(high dimension approximate solutions)
l2 extension   •
Applications (Dan)     •
Curse of dimensionality
O(nd)        Query     O( min(nd, •
Naive time or space nd) )
D>10..20  worst than sequential scan •
For most geometric distributions –
Techniques specific to high dimensions are needed •

•Prooved in theory and in practice by Barkol & Rabani 2000 & Beame-Vee 2002
Curse of dimensionality
Some intuition

2

22

23

2d
Outline
Problem definition and flavors   •
Algorithms overview - low dimensions      •
Curse of dimensionality (d>10..20)   •
Enchanting the curse      •
Locality Sensitive Hashing
(high dimension approximate solutions)
l2 extension   •
Applications (Dan)    •
Preview
General Solution – •
Locality sensitive hashing
Implementation for Hamming space •
Generalization to l1 & l2 •
Hash function
Hash function

Data_Item

Hash function

Key

Bin/Bucket
Hash function
Data structure
X=Number
in the range 0..n

X modulo 3
0
0..2

Usually we would like related Data-items to be stored at the same bin
Recall r,  - Nearest Neighbor

q?
(1 +  ) r
r

dist(q,p1)  r

dist(q,p2)  (1 +  ) r   r2=(1 +  ) r1
Locality sensitive hashing

q?
(1 +  ) r
r

(r, ,p1,p2) Sensitive
P1 ≡Pr[I(p)=I(q)] is “high” if p is “close” to q
P2 ≡Pr[I(p)=I(q)] is “low” if p is”far” from q

r2=(1 +  ) r1
Preview
General Solution – •
Locality sensitive hashing
Implementation for Hamming space •
Generalization to l1 & l2 •
Hamming Space
Hamming space = 2N binary strings •

Hamming distance = #changed digits •

Richard Hamming                            a.k.a Signal distance
Hamming Space
N

010100001111 space •
Hamming

Hamming distance •
010100001111
Distance = 4
010010000011

SUM(X1 XOR X2)
L1 to Hamming Space Embedding
C=11

2
p

8                d’=C*d

11000000000   11111111000   11000000000 11111111000
Hash function
11000000000 11111111000
1       0          1      p ∈ Hd’

Lj Hash function    j=1..L, k=3 digits

Gj(p)=p|Ij         Bits sampling from p

Store p into bucket p|Ij    2k buckets
101
Construction

p

1   2                  L
Query

q

1   2           L
Alternative intuition random projections
C=11

2
p

8                d’=C*d

11000000000   11111111000   11000000000 11111111000
Alternative intuition random projections
C=11

2
p

8

11000000000   11111111000   11000000000 11111111000
Alternative intuition random projections
C=11

2
p

8

11000000000   11111111000   11000000000 11111111000
Alternative intuition random projections

1       0          1
11000000000 11111111000

110   111

100   101

p
101       23 Buckets   000   001
k samplings
Repeating
Repeating L times
Repeating L times
Secondary hashing
2k buckets
011

Simple Hashing

Size=B            M*B=α*n         α=2

M Buckets
Support volume tuning
dataset-size vs. storage volume
The above hashing
is locality-sensitive
k
Distance( p, q 
 same bucket) )=
in1                     •
Probability (p,q                       
    # dimensions 
Probability

k=1        Pr          k=2

Distance (q,pi)        Distance (q,pi)

Preview
General Solution – •
Locality sensitive hashing
Implementation for Hamming space •
Generalization to l2 •
Direct L2 solution
New hashing function     •
Still based on sampling   •
Using mathematical trick    •
P-stable distribution for Lp distance   •
Gaussian distribution for L2 distance    •
Central limit theorem

v1*       +v2*       +…        …+vn*          =

(Weighted Gaussians) = Weighted Gaussian
Central limit theorem

v1* X1 +v2* X2    +…        …+vn* Xn        =

v1..vn = Real Numbers

X1:Xn = Independent Identically Distributed
(i.i.d)
Central limit theorem
1/ 2
         2
 vi  X i   | vi | 
i            i        
X

Dot Product       Norm
Norm  Distance
1/ 2
             2
 ui  X i  vi  X i   | ui  vi | 
i            i            i            
X
Features    Features
vector 1      vector 2       Distance
Norm  Distance
1/ 2
              2
 ui  X i  vi  X i   | ui  vi | 
i           i            i             
X
Dot        Dot
Product      Product        Distance
The full Hashing
d random*           Features                 phase
numbers             vector              Random[0,w]
22
1          [34 82 21]              77   d
+b
42

Discretization      w
step

a v  b
ha ,b (v)          
 w 
The full Hashing

7944           +34

100
a v  b
7800 7900 8000 8100 8200

ha ,b (v)          
 w 
The full Hashing
phase
Random[0,w]

7944           +34

Discretization    100
step

a v  b
ha ,b (v)          
 w 
The full Hashing
i.i.d from p-stable       Features                 phase
distribution           vector              Random[0,w]

1                    a              v    d
+b

Discretization        w
step

a v  b
ha ,b (v)          
 w 
Generalization: P-Stable distribution
L2 •           Lp p=eps..2 •
Central Limit Theorem •           Generalized •
Central Limit Theorem
Gaussian (normal) • P-stable distribution •
distribution          Cauchy for L2
P-Stable summary
r,  - Nearest Neighbor •
Works for
Generalizes to 0<p<=2 •
Improves query time •

Latest results
Reported in Email by
Alexander Andoni

Query time = O (dn1/(1+)log(n) )  O (dn1/(1+)^2log(n) )
Parameters selection
90% Probability  Best quarry time performance •

For Euclidean Space
L
Parameters selection …
Single projection hit an  - Nearest Neighbor •
with Pr=p1
k projections hits an  - Nearest Neighbor •
with Pr=p1k
L hashings fail to collide with Pr=(1-p1k)L •

To ensure Collision (e.g. 1-δ≥90%) •
log( )
1-   (1-p1k)L≥ 1-δ      •       L
log(1  p1 )
k

For Euclidean Space
K
… Parameters selection
time Candidates verification   Candidates extraction

k
Pros. & Cons.
Better Query Time than Spatial Data Structures
Scales well to higher dimensions and larger data size
( Sub-linear dependence )
Predictable running time
Inefficient for data with distances concentrated around
average
works best for Hamming distance (although can be
generalized to Euclidean space)
In secondary storage, linear scan is pretty much all we
can do (for high dim)
requires radius r to be fixed in advance      From Pioter Indyk slides
Conclusion
..but at the end •
everything depends on your data set
Try it at home •
Visit: –
http://web.mit.edu/andoni/www/LSH/index.html
Andoni@mit.edu          Email Alex Andoni –
Test over your own data –
(C code under Red Hat Linux )
LSH - Applications
• Searching video clips in databases
Hashing and Its Application to Video Identification“, Yang, Ooi, Sun).
.("Hierarchical, Non-Uniform Locality Sensitive

•   Searching image databases                                                  (see the following).

•   Image segmentation                                      (see the following).

•   Image classification                                    (“Discriminant adaptive Nearest Neighbor Classification”, T. Hastie, R Tibshirani).

•   Texture classification                                     (see the following).

•   Clustering                   (see the following).

•   Embedding and manifold learning                                                                   (LLE, and many others)

•   Compression – vector quantization.
•   Search engines                            (“LSH Forest: SelfTuning Indexes for Similarity Search”, M. Bawa, T. Condie, P. Ganesan”).

•   Genomics                     (“Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing”, J. Buhler).

•   In short: whenever K-Nearest Neighbors (KNN) are
needed.
Motivation
• A variety of procedures in learning
require KNN computation.
• KNN search is a computational
bottleneck.
• LSH provides a fast approximate solution
to the problem.
• LSH requires hash function construction
and parameter tunning.
Outline
Fast Pose Estimation with Parameter Sensitive
Hashing G. Shakhnarovich, P. Viola, and T. Darrell.
• Finding sensitive hash functions.

Mean Shift Based Clustering in High
Dimensions: A Texture Classification Example
B. Georgescu, I. Shimshoni, and P. Meer

•   Tuning LSH parameters.
•   LSH data structure is used for algorithm
speedups.
Fast Pose Estimation with Parameter Sensitive
Hashing
G. Shakhnarovich, P. Viola, and T. Darrell

The Problem:
Given an image x, what are the
parameters θ, in this image?                               i

i.e. angles of joints, orientation of the body,
etc.����
Ingredients
• Input query image with unknown angles
(parameters).
• Database of human poses with known angles.
• Image feature extractor – edge detector.
• Distance metric in feature space dx.
• Distance metric in angles space:
m
d (1 ,  2 )  1  cos(1i   2i )
i 1
Example based learning
• Construct a database of example images with their known
angles.
• Given a query image, run your favorite feature extractor.
• Compute KNN from database.
• Use these KNNs to compute the average angles of the
query.

Find KNN in
Input: query                               Output: Average
database of
angles of KNN
examples
The algorithm flow
Input Query                               Processed query

Features extraction

Database of examples

Output Match
Feature Extraction       PSH           LWR

The image features
Image features are multi-
scale edge histograms:
B                  A

     3
0,     ,  ,    ,
4 2    4

107 ( x)   A x / 4
Feature Extraction       PSH          LWR

PSH: The basic assumption
There are two metric spaces here: feature space (d x)
and parameter space ( d  ).
We want similarity to be measured in the angles
space, whereas LSH works on the feature space.
• Assumption: The feature space is closely
related to the parameter space.
Feature Extraction       PSH       LWR

Insight: Manifolds
• Manifold is a space in which
every point has a neighborhood
resembling a Euclid space.
• But global structure may be
complicated: curved.
• For example: lines are 1D
manifolds, planes are 2D
manifolds, etc.
q

Feature Space

Is this Magic?

Parameters Space
(angles)
Feature Extraction    PSH          LWR

Parameter Sensitive Hashing (PSH)

The trick:
Estimate performance of different hash functions
on examples, and select those sensitive to d  :
The hash functions are applied in feature space
but the KNN are valid in angle space.
Feature Extraction          PSH          LWR

PSH as a classification problem
Label pairs of examples
with similar angles
Compare                  Define hash functions h
labeling                    on feature space
Predict labeling of similar\
non-similar examples by using h

If labeling by h is good
accept h, else change h
Feature Extraction            PSH                LWR

A pair of examples (xi , i ), ( x j , j )
is labeled :
 1 if d ( i , j )  r

yij  
 1 if d ( i , j )  r (1   )


Labels:        +1             +1               -1                -1

(r=0.25)
Feature Extraction          PSH             LWR

features
A binary hash function:                   Feature

 1 if  (x)  T
h ,T ( x)  
-1 otherwise

Predict the labels
 1 if h ,T (xi )  h ,T (x j )
yh(xi ,x j )  
ˆ
 1 otherwise
Feature Extraction    PSH          LWR

h ,T will place both examples in the same
bin or separate them :


T       (x)
Find the best T* that predicts the true
labeling with the probabilit ies constraint s.
Feature Extraction                  PSH        LWR

Local Weighted Regression (LWR)
• Given a query image, PSH returns
KNNs.
• LWR uses the KNN to compute a
weighted average of the estimated
  arg min   d ( g ( xi ,  ),  i ) K (d X ( xi , x0 ))
*angles of the query:

xi N ( x0 )
   
 
dist . weight
Results
Synthetic data were generated:
• 13 angles: 1 for rotation of the torso, 12 for
joints.
• 150,000 images.
illumination, face expression.
•   1,775,000 example pairs.
•   Selected 137 out of 5,123 meaningful features
(how??):                          Recall:
P1 is prob of positive
18 bit hash functions (k), 150 hash tables (l). hash.
P2 is prob of bad hash.
B is the max number of
pts in a bucket.

• Without selection needed 40 bits and
1000 hash tables.

• Test on 1000 synthetic examples:
• PSH searched only 3.4% of the data per query.
Results – real data
• 800 images.
• Processed by a segmentation algorithm.
• 1.3% of the data were searched.
Results – real data
Interesting mismatches
Fast pose estimation - summary
• Fast way to compute the angles of human
body figure.
• Moving from one representation space to
another.
• Training a sensitive hash function.
• KNN smart averaging.
Food for Thought
• The basic assumption may be problematic
(distance metric, representations).
• The training set should be dense.
• Texture and clutter.
• General: some features are more important
than others and should be weighted.
Food for Thought: Point Location in
Different Spheres (PLDS)
• Given: n spheres in Rd , centered at P={p1,…,pn}

• Goal: given a query q, preprocess the points in P
to find point pi that its sphere ‘cover’ the query q.

ri
q
pi

Mean-Shift Based Clustering in High Dimensions: A
Texture Classification Example
B. Georgescu, I. Shimshoni, and P. Meer

Motivation:
• Clustering high dimensional data by using local
density measurements (e.g. feature space).
• Statistical curse of dimensionality:
sparseness of the data.
• Computational curse of dimensionality:
expensive range queries.
• LSH parameters should be adjusted for optimal
performance.
Outline
•    Mean-shift in a nutshell + examples.

Our scope:
• Mean-shift in high dimensions – using LSH.
• Speedups:
1. Finding optimal LSH parameters.
2. Data-driven partitions into buckets.
3. Additional speedup by using LSH data structure.
Mean-shift   LSH   LSH: optimal k,l   LSH: data       LSH: data struct
partition

Mean-Shift in a Nutshell
bandwidth

point
Mean-shift    LSH      LSH: optimal k,l   LSH: data    LSH: data struct
partition

KNN in mean-shift
Bandwidth should be inversely proportional to the
density in the region:
high density - small bandwidth
low density - large bandwidth

Based on kth nearest neighbor        of the point

The bandwidth is

Mean-shift   LSH   LSH: optimal k,l   LSH: data    LSH: data struct
partition
Mean-shift         LSH         LSH: optimal k,l         LSH: data        LSH: data struct
partition

Image segmentation algorithm
1. Input : Data in 5D (3 color + 2 x,y) or 3D (1 gray +2 x,y)
2. Resolution controlled by the bandwidth: hs (spatial), hr (color)
3. Apply filtering

3D:

Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’
Mean-shift      LSH         LSH: optimal k,l    LSH: data       LSH: data struct
partition
Image segmentation algorithm
Filtering:       pixel      value of the nearest mode

Mean-shift
trajectories

original                    filtered                segmented
Filtering examples

original squirrel                 filtered

original baboon                   filtered
Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’
Segmentation examples

Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’
Mean-shift     LSH    LSH: optimal k,l   LSH: data    LSH: data struct
partition

Mean-shift in high dimensions

Statistical curse of dimensionality:
Sparseness of the data             variable bandwidth

Computational curse of dimensionality:

Expensive range queries             implemented with LSH
Mean-shift     LSH        LSH: optimal k,l   LSH: data    LSH: data struct
partition

LSH-based data structure
• Choose L random partitions:
Each partition includes K pairs
(dk,vk)
• For each point we check:
xi , d K  vk
It Partitions the data into cells:
Mean-shift      LSH   LSH: optimal k,l   LSH: data    LSH: data struct
partition

Choosing the optimal K and L
• For a query q compute
smallest number of distances
to points in its buckets.
Mean-shift     LSH      LSH: optimal k,l   LSH: data    LSH: data struct
partition

d
N Cl  n( K / d  1)
C
N C  LNCl

C
As L increasesC increases but C decreases.
C determines the resolution of the data structure.
Large k  smaller number of points in a cell.
If L is too small  points might be missed,
but if L is too big  C might include extra points
Mean-shift     LSH     LSH: optimal k,l        LSH: data    LSH: data struct
partition

Choosing optimal K and L
Determine accurately the KNN for m randomly-selected
data points.

distance (bandwidth)

Choose error threshold 

The optimal K and L should satisfy

the approximate distance
Mean-shift    LSH    LSH: optimal k,l   LSH: data      LSH: data struct
partition

Choosing optimal K and L
• For each K estimate the error for
• In one run for all L’s:
find the minimal L satisfying the constraint L(K)
• Minimize time t(K,L(K)):

minimum

Approximation            L(K) for =0.05           Running time
error for K,L                                      t[K,L(K)]
Mean-shift        LSH   LSH: optimal k,l   LSH: data         LSH: data struct
partition

Data driven partitions
• In original LSH, cut values are random in the range of the
data.
• Suggestion: Randomly select a point from the data and
use one of its coordinates as the cut value.

uniform            data driven                  points/bucket
distribution
Mean-shift   LSH   LSH: optimal k,l   LSH: data    LSH: data struct
partition

Assume that all points in C will converge to the
same mode. (C is like a type of an aggregate)

C

C
Speedup results

65536 points, 1638 points sampled , k=100
Food for thought
Low dimension   High dimension
A thought for food…
• Choose K, L by sample learning, or take the
• Can one estimate K, L without sampling?
• A thought for food: does it help to know the data
dimensionality or the data manifold?
• Intuitively: dimensionality implies the number of
hash functions needed.
• The catch: efficient dimensionality learning requires
KNN.
Summary
• LSH suggests a compromise on accuracy for the
gain of complexity.
• Applications that involve massive data in high
dimension require the LSH fast performance.
• Extension of the LSH to different spaces (PSH).
• Learning the LSH parameters and hash
functions for different applications.
Conclusion
• ..but at the end
everything depends on your data set
• Try it at home
– Visit:
http://web.mit.edu/andoni/www/LSH/index.html
– Email Alex Andoni         Andoni@mit.edu
– Test over your own data
(C code under Red Hat Linux )
Thanks
•   Ilan Shimshoni (Haifa).