k-Nearest Neighbors Search in High Dimensions
Document Sample


Search k-Nearest Neighbors
in High Dimensions
Tomer Peled
Dan Kushnir
Tell me who your neighbors are, and I'll know who you are
Outline
Problem definition and flavors •
Algorithms overview - low dimensions •
Curse of dimensionality (d>10..20) •
Enchanting the curse •
Locality Sensitive Hashing
(high dimension approximate solutions)
l2 extension •
Applications (Dan) •
Nearest Neighbor Search
Problem definition
• Given: a set P of n points in Rd
Over some metric
• find the nearest neighbor p of q in P
Q?
Distance metric
Applications
Classification • Indexing •
Dimension reduction •
Clustering • (e.g. lle)
Segmentation •
Weight
q?
color
Naïve solution
No preprocess •
Given a query point q •
Go over all n points –
Do comparison in Rd –
query time = O(nd) •
Keep in mind
Common solution
Use a data structure for acceleration •
Scale-ability with n & with d is important •
When to use nearest neighbor
High level algorithms
Parametric Non-parametric
Probability Density Nearest
distribution estimation estimation neighbors
complex models Sparse data High dimensions
Assuming no prior knowledge about the underlying probability structure
Nearest Neighbor
q?
min pi P dist(q,pi)
r, - Nearest Neighbor
q?
(1 + ) r
r
dist(q,p1) r
dist(q,p2) (1 + ) r r2=(1 + ) r1
Outline
Problem definition and flavors •
Algorithms overview - low dimensions •
Curse of dimensionality (d>10..20) •
Enchanting the curse •
Locality Sensitive Hashing
(high dimension approximate solutions)
l2 extension •
Applications (Dan) •
The simplest solution
Lion in the desert •
Quadtree
Split the first dimension into 2
Repeat iteratively
Stop when each cell
has no more than 1 data point
Quadtree - structure
P<X1
X1,Y1 P≥X1
P<Y1 P≥Y1
P<X1
P≥Y1 P≥X1
P<Y1
X1,Y1
Y
X
Query - Quadtree
P<X1
X1,Y1 P≥X1
P<Y1 P≥Y1
P<X1
P≥Y1 P≥X1
P<Y1
X1,Y1
Y
X
In many cases works
Pitfall1 – Quadtree
P<X1
X1,Y1
P≥X1
P<Y1 P≥Y1
P<X1 P≥X1
P≥Y1 P<Y1
X1,Y1
Y
P<X1
X
In some cases doesn’t
Pitfall1 – Quadtree
Y
X
In some cases nothing works
pitfall 2 – Quadtree
X
Y
O(2 d)
Could result in Query time Exponential in #dimensions
Space partition based algorithms
Could be improved
Multidimensional access methods / Volker Gaede, O. Gunther
Outline
Problem definition and flavors •
Algorithms overview - low dimensions •
Curse of dimensionality (d>10..20) •
Enchanting the curse •
Locality Sensitive Hashing
(high dimension approximate solutions)
l2 extension •
Applications (Dan) •
Curse of dimensionality
O(nd) Query O( min(nd, •
Naive time or space nd) )
D>10..20 worst than sequential scan •
For most geometric distributions –
Techniques specific to high dimensions are needed •
•Prooved in theory and in practice by Barkol & Rabani 2000 & Beame-Vee 2002
Curse of dimensionality
Some intuition
2
22
23
2d
Outline
Problem definition and flavors •
Algorithms overview - low dimensions •
Curse of dimensionality (d>10..20) •
Enchanting the curse •
Locality Sensitive Hashing
(high dimension approximate solutions)
l2 extension •
Applications (Dan) •
Preview
General Solution – •
Locality sensitive hashing
Implementation for Hamming space •
Generalization to l1 & l2 •
Hash function
Hash function
Data_Item
Hash function
Key
Bin/Bucket
Hash function
Data structure
X=Number
in the range 0..n
X modulo 3
0
0..2
Storage Address
Usually we would like related Data-items to be stored at the same bin
Recall r, - Nearest Neighbor
q?
(1 + ) r
r
dist(q,p1) r
dist(q,p2) (1 + ) r r2=(1 + ) r1
Locality sensitive hashing
q?
(1 + ) r
r
(r, ,p1,p2) Sensitive
P1 ≡Pr[I(p)=I(q)] is “high” if p is “close” to q
P2 ≡Pr[I(p)=I(q)] is “low” if p is”far” from q
r2=(1 + ) r1
Preview
General Solution – •
Locality sensitive hashing
Implementation for Hamming space •
Generalization to l1 & l2 •
Hamming Space
Hamming space = 2N binary strings •
Hamming distance = #changed digits •
Richard Hamming a.k.a Signal distance
Hamming Space
N
010100001111 space •
Hamming
Hamming distance •
010100001111
Distance = 4
010010000011
SUM(X1 XOR X2)
L1 to Hamming Space Embedding
C=11
2
p
8 d’=C*d
11000000000 11111111000 11000000000 11111111000
Hash function
11000000000 11111111000
1 0 1 p ∈ Hd’
Lj Hash function j=1..L, k=3 digits
Gj(p)=p|Ij Bits sampling from p
Store p into bucket p|Ij 2k buckets
101
Construction
p
1 2 L
Query
q
1 2 L
Alternative intuition random projections
C=11
2
p
8 d’=C*d
11000000000 11111111000 11000000000 11111111000
Alternative intuition random projections
C=11
2
p
8
11000000000 11111111000 11000000000 11111111000
Alternative intuition random projections
C=11
2
p
8
11000000000 11111111000 11000000000 11111111000
Alternative intuition random projections
1 0 1
11000000000 11111111000
110 111
100 101
p
101 23 Buckets 000 001
k samplings
Repeating
Repeating L times
Repeating L times
Secondary hashing
2k buckets
011
Simple Hashing
Size=B M*B=α*n α=2
M Buckets
Support volume tuning
dataset-size vs. storage volume
The above hashing
is locality-sensitive
k
Distance( p, q
same bucket) )=
in1 •
Probability (p,q
# dimensions
Probability
k=1 Pr k=2
Distance (q,pi) Distance (q,pi)
Adopted from Piotr Indyk’s slides
Preview
General Solution – •
Locality sensitive hashing
Implementation for Hamming space •
Generalization to l2 •
Direct L2 solution
New hashing function •
Still based on sampling •
Using mathematical trick •
P-stable distribution for Lp distance •
Gaussian distribution for L2 distance •
Central limit theorem
v1* +v2* +… …+vn* =
(Weighted Gaussians) = Weighted Gaussian
Central limit theorem
v1* X1 +v2* X2 +… …+vn* Xn =
v1..vn = Real Numbers
X1:Xn = Independent Identically Distributed
(i.i.d)
Central limit theorem
1/ 2
2
vi X i | vi |
i i
X
Dot Product Norm
Norm Distance
1/ 2
2
ui X i vi X i | ui vi |
i i i
X
Features Features
vector 1 vector 2 Distance
Norm Distance
1/ 2
2
ui X i vi X i | ui vi |
i i i
X
Dot Dot
Product Product Distance
The full Hashing
d random* Features phase
numbers vector Random[0,w]
22
1 [34 82 21] 77 d
+b
42
Discretization w
step
a v b
ha ,b (v)
w
The full Hashing
7944 +34
100
a v b
7800 7900 8000 8100 8200
ha ,b (v)
w
The full Hashing
phase
Random[0,w]
7944 +34
Discretization 100
step
a v b
ha ,b (v)
w
The full Hashing
i.i.d from p-stable Features phase
distribution vector Random[0,w]
1 a v d
+b
Discretization w
step
a v b
ha ,b (v)
w
Generalization: P-Stable distribution
L2 • Lp p=eps..2 •
Central Limit Theorem • Generalized •
Central Limit Theorem
Gaussian (normal) • P-stable distribution •
distribution Cauchy for L2
P-Stable summary
r, - Nearest Neighbor •
Works for
Generalizes to 0<p<=2 •
Improves query time •
Latest results
Reported in Email by
Alexander Andoni
Query time = O (dn1/(1+)log(n) ) O (dn1/(1+)^2log(n) )
Parameters selection
90% Probability Best quarry time performance •
For Euclidean Space
L
Parameters selection …
Single projection hit an - Nearest Neighbor •
with Pr=p1
k projections hits an - Nearest Neighbor •
with Pr=p1k
L hashings fail to collide with Pr=(1-p1k)L •
To ensure Collision (e.g. 1-δ≥90%) •
log( )
1- (1-p1k)L≥ 1-δ • L
log(1 p1 )
k
For Euclidean Space
K
… Parameters selection
time Candidates verification Candidates extraction
k
Pros. & Cons.
Better Query Time than Spatial Data Structures
Scales well to higher dimensions and larger data size
( Sub-linear dependence )
Predictable running time
Extra storage over-head
Inefficient for data with distances concentrated around
average
works best for Hamming distance (although can be
generalized to Euclidean space)
In secondary storage, linear scan is pretty much all we
can do (for high dim)
requires radius r to be fixed in advance From Pioter Indyk slides
Conclusion
..but at the end •
everything depends on your data set
Try it at home •
Visit: –
http://web.mit.edu/andoni/www/LSH/index.html
Andoni@mit.edu Email Alex Andoni –
Test over your own data –
(C code under Red Hat Linux )
LSH - Applications
• Searching video clips in databases
Hashing and Its Application to Video Identification“, Yang, Ooi, Sun).
.("Hierarchical, Non-Uniform Locality Sensitive
• Searching image databases (see the following).
• Image segmentation (see the following).
• Image classification (“Discriminant adaptive Nearest Neighbor Classification”, T. Hastie, R Tibshirani).
• Texture classification (see the following).
• Clustering (see the following).
• Embedding and manifold learning (LLE, and many others)
• Compression – vector quantization.
• Search engines (“LSH Forest: SelfTuning Indexes for Similarity Search”, M. Bawa, T. Condie, P. Ganesan”).
• Genomics (“Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing”, J. Buhler).
• In short: whenever K-Nearest Neighbors (KNN) are
needed.
Motivation
• A variety of procedures in learning
require KNN computation.
• KNN search is a computational
bottleneck.
• LSH provides a fast approximate solution
to the problem.
• LSH requires hash function construction
and parameter tunning.
Outline
Fast Pose Estimation with Parameter Sensitive
Hashing G. Shakhnarovich, P. Viola, and T. Darrell.
• Finding sensitive hash functions.
Mean Shift Based Clustering in High
Dimensions: A Texture Classification Example
B. Georgescu, I. Shimshoni, and P. Meer
• Tuning LSH parameters.
• LSH data structure is used for algorithm
speedups.
Fast Pose Estimation with Parameter Sensitive
Hashing
G. Shakhnarovich, P. Viola, and T. Darrell
The Problem:
Given an image x, what are the
parameters θ, in this image? i
i.e. angles of joints, orientation of the body,
etc.����
Ingredients
• Input query image with unknown angles
(parameters).
• Database of human poses with known angles.
• Image feature extractor – edge detector.
• Distance metric in feature space dx.
• Distance metric in angles space:
m
d (1 , 2 ) 1 cos(1i 2i )
i 1
Example based learning
• Construct a database of example images with their known
angles.
• Given a query image, run your favorite feature extractor.
• Compute KNN from database.
• Use these KNNs to compute the average angles of the
query.
Find KNN in
Input: query Output: Average
database of
angles of KNN
examples
The algorithm flow
Input Query Processed query
Features extraction
Database of examples
Output Match
Feature Extraction PSH LWR
The image features
Image features are multi-
scale edge histograms:
B A
3
0, , , ,
4 2 4
107 ( x) A x / 4
Feature Extraction PSH LWR
PSH: The basic assumption
There are two metric spaces here: feature space (d x)
and parameter space ( d ).
We want similarity to be measured in the angles
space, whereas LSH works on the feature space.
• Assumption: The feature space is closely
related to the parameter space.
Feature Extraction PSH LWR
Insight: Manifolds
• Manifold is a space in which
every point has a neighborhood
resembling a Euclid space.
• But global structure may be
complicated: curved.
• For example: lines are 1D
manifolds, planes are 2D
manifolds, etc.
q
Feature Space
Is this Magic?
Parameters Space
(angles)
Feature Extraction PSH LWR
Parameter Sensitive Hashing (PSH)
The trick:
Estimate performance of different hash functions
on examples, and select those sensitive to d :
The hash functions are applied in feature space
but the KNN are valid in angle space.
Feature Extraction PSH LWR
PSH as a classification problem
Label pairs of examples
with similar angles
Compare Define hash functions h
labeling on feature space
Predict labeling of similar\
non-similar examples by using h
If labeling by h is good
accept h, else change h
Feature Extraction PSH LWR
A pair of examples (xi , i ), ( x j , j )
is labeled :
1 if d ( i , j ) r
yij
1 if d ( i , j ) r (1 )
Labels: +1 +1 -1 -1
(r=0.25)
Feature Extraction PSH LWR
features
A binary hash function: Feature
1 if (x) T
h ,T ( x)
-1 otherwise
Predict the labels
1 if h ,T (xi ) h ,T (x j )
yh(xi ,x j )
ˆ
1 otherwise
Feature Extraction PSH LWR
h ,T will place both examples in the same
bin or separate them :
T (x)
Find the best T* that predicts the true
labeling with the probabilit ies constraint s.
Feature Extraction PSH LWR
Local Weighted Regression (LWR)
• Given a query image, PSH returns
KNNs.
• LWR uses the KNN to compute a
weighted average of the estimated
arg min d ( g ( xi , ), i ) K (d X ( xi , x0 ))
*angles of the query:
xi N ( x0 )
dist . weight
Results
Synthetic data were generated:
• 13 angles: 1 for rotation of the torso, 12 for
joints.
• 150,000 images.
• Nuisance parameters added: clothing,
illumination, face expression.
• 1,775,000 example pairs.
• Selected 137 out of 5,123 meaningful features
(how??): Recall:
P1 is prob of positive
18 bit hash functions (k), 150 hash tables (l). hash.
P2 is prob of bad hash.
B is the max number of
pts in a bucket.
• Without selection needed 40 bits and
1000 hash tables.
• Test on 1000 synthetic examples:
• PSH searched only 3.4% of the data per query.
Results – real data
• 800 images.
• Processed by a segmentation algorithm.
• 1.3% of the data were searched.
Results – real data
Interesting mismatches
Fast pose estimation - summary
• Fast way to compute the angles of human
body figure.
• Moving from one representation space to
another.
• Training a sensitive hash function.
• KNN smart averaging.
Food for Thought
• The basic assumption may be problematic
(distance metric, representations).
• The training set should be dense.
• Texture and clutter.
• General: some features are more important
than others and should be weighted.
Food for Thought: Point Location in
Different Spheres (PLDS)
• Given: n spheres in Rd , centered at P={p1,…,pn}
with radii {r1,…,rn} .
• Goal: given a query q, preprocess the points in P
to find point pi that its sphere ‘cover’ the query q.
ri
q
pi
Courtesy of Mohamad Hegaze
Mean-Shift Based Clustering in High Dimensions: A
Texture Classification Example
B. Georgescu, I. Shimshoni, and P. Meer
Motivation:
• Clustering high dimensional data by using local
density measurements (e.g. feature space).
• Statistical curse of dimensionality:
sparseness of the data.
• Computational curse of dimensionality:
expensive range queries.
• LSH parameters should be adjusted for optimal
performance.
Outline
• Mean-shift in a nutshell + examples.
Our scope:
• Mean-shift in high dimensions – using LSH.
• Speedups:
1. Finding optimal LSH parameters.
2. Data-driven partitions into buckets.
3. Additional speedup by using LSH data structure.
Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct
partition
Mean-Shift in a Nutshell
bandwidth
point
Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct
partition
KNN in mean-shift
Bandwidth should be inversely proportional to the
density in the region:
high density - small bandwidth
low density - large bandwidth
Based on kth nearest neighbor of the point
The bandwidth is
Adaptive mean-shift vs. non-adaptive.
Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct
partition
Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct
partition
Image segmentation algorithm
1. Input : Data in 5D (3 color + 2 x,y) or 3D (1 gray +2 x,y)
2. Resolution controlled by the bandwidth: hs (spatial), hr (color)
3. Apply filtering
3D:
Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’
Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct
partition
Image segmentation algorithm
Filtering: pixel value of the nearest mode
Mean-shift
trajectories
original filtered segmented
Filtering examples
original squirrel filtered
original baboon filtered
Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’
Segmentation examples
Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’
Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct
partition
Mean-shift in high dimensions
Statistical curse of dimensionality:
Sparseness of the data variable bandwidth
Computational curse of dimensionality:
Expensive range queries implemented with LSH
Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct
partition
LSH-based data structure
• Choose L random partitions:
Each partition includes K pairs
(dk,vk)
• For each point we check:
xi , d K vk
It Partitions the data into cells:
Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct
partition
Choosing the optimal K and L
• For a query q compute
smallest number of distances
to points in its buckets.
Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct
partition
d
N Cl n( K / d 1)
C
N C LNCl
C
As L increasesC increases but C decreases.
C determines the resolution of the data structure.
Large k smaller number of points in a cell.
If L is too small points might be missed,
but if L is too big C might include extra points
Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct
partition
Choosing optimal K and L
Determine accurately the KNN for m randomly-selected
data points.
distance (bandwidth)
Choose error threshold
The optimal K and L should satisfy
the approximate distance
Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct
partition
Choosing optimal K and L
• For each K estimate the error for
• In one run for all L’s:
find the minimal L satisfying the constraint L(K)
• Minimize time t(K,L(K)):
minimum
Approximation L(K) for =0.05 Running time
error for K,L t[K,L(K)]
Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct
partition
Data driven partitions
• In original LSH, cut values are random in the range of the
data.
• Suggestion: Randomly select a point from the data and
use one of its coordinates as the cut value.
uniform data driven points/bucket
distribution
Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct
partition
Additional speedup
Assume that all points in C will converge to the
same mode. (C is like a type of an aggregate)
C
C
Speedup results
65536 points, 1638 points sampled , k=100
Food for thought
Low dimension High dimension
A thought for food…
• Choose K, L by sample learning, or take the
traditional.
• Can one estimate K, L without sampling?
• A thought for food: does it help to know the data
dimensionality or the data manifold?
• Intuitively: dimensionality implies the number of
hash functions needed.
• The catch: efficient dimensionality learning requires
KNN.
15:30 cookies…..
Summary
• LSH suggests a compromise on accuracy for the
gain of complexity.
• Applications that involve massive data in high
dimension require the LSH fast performance.
• Extension of the LSH to different spaces (PSH).
• Learning the LSH parameters and hash
functions for different applications.
Conclusion
• ..but at the end
everything depends on your data set
• Try it at home
– Visit:
http://web.mit.edu/andoni/www/LSH/index.html
– Email Alex Andoni Andoni@mit.edu
– Test over your own data
(C code under Red Hat Linux )
Thanks
• Ilan Shimshoni (Haifa).
• Mohamad Hegaze (Weizmann).
• Alex Andoni (MIT).
• Mica and Denis.
Get documents about "