VIEWS: 77 PAGES: 111 POSTED ON: 2/9/2012 Public Domain
Search k-Nearest Neighbors in High Dimensions Tomer Peled Dan Kushnir Tell me who your neighbors are, and I'll know who you are Outline Problem definition and flavors • Algorithms overview - low dimensions • Curse of dimensionality (d>10..20) • Enchanting the curse • Locality Sensitive Hashing (high dimension approximate solutions) l2 extension • Applications (Dan) • Nearest Neighbor Search Problem definition • Given: a set P of n points in Rd Over some metric • find the nearest neighbor p of q in P Q? Distance metric Applications Classification • Indexing • Dimension reduction • Clustering • (e.g. lle) Segmentation • Weight q? color Naïve solution No preprocess • Given a query point q • Go over all n points – Do comparison in Rd – query time = O(nd) • Keep in mind Common solution Use a data structure for acceleration • Scale-ability with n & with d is important • When to use nearest neighbor High level algorithms Parametric Non-parametric Probability Density Nearest distribution estimation estimation neighbors complex models Sparse data High dimensions Assuming no prior knowledge about the underlying probability structure Nearest Neighbor q? min pi P dist(q,pi) r, - Nearest Neighbor q? (1 + ) r r dist(q,p1) r dist(q,p2) (1 + ) r r2=(1 + ) r1 Outline Problem definition and flavors • Algorithms overview - low dimensions • Curse of dimensionality (d>10..20) • Enchanting the curse • Locality Sensitive Hashing (high dimension approximate solutions) l2 extension • Applications (Dan) • The simplest solution Lion in the desert • Quadtree Split the first dimension into 2 Repeat iteratively Stop when each cell has no more than 1 data point Quadtree - structure P<X1 X1,Y1 P≥X1 P<Y1 P≥Y1 P<X1 P≥Y1 P≥X1 P<Y1 X1,Y1 Y X Query - Quadtree P<X1 X1,Y1 P≥X1 P<Y1 P≥Y1 P<X1 P≥Y1 P≥X1 P<Y1 X1,Y1 Y X In many cases works Pitfall1 – Quadtree P<X1 X1,Y1 P≥X1 P<Y1 P≥Y1 P<X1 P≥X1 P≥Y1 P<Y1 X1,Y1 Y P<X1 X In some cases doesn’t Pitfall1 – Quadtree Y X In some cases nothing works pitfall 2 – Quadtree X Y O(2 d) Could result in Query time Exponential in #dimensions Space partition based algorithms Could be improved Multidimensional access methods / Volker Gaede, O. Gunther Outline Problem definition and flavors • Algorithms overview - low dimensions • Curse of dimensionality (d>10..20) • Enchanting the curse • Locality Sensitive Hashing (high dimension approximate solutions) l2 extension • Applications (Dan) • Curse of dimensionality O(nd) Query O( min(nd, • Naive time or space nd) ) D>10..20 worst than sequential scan • For most geometric distributions – Techniques specific to high dimensions are needed • •Prooved in theory and in practice by Barkol & Rabani 2000 & Beame-Vee 2002 Curse of dimensionality Some intuition 2 22 23 2d Outline Problem definition and flavors • Algorithms overview - low dimensions • Curse of dimensionality (d>10..20) • Enchanting the curse • Locality Sensitive Hashing (high dimension approximate solutions) l2 extension • Applications (Dan) • Preview General Solution – • Locality sensitive hashing Implementation for Hamming space • Generalization to l1 & l2 • Hash function Hash function Data_Item Hash function Key Bin/Bucket Hash function Data structure X=Number in the range 0..n X modulo 3 0 0..2 Storage Address Usually we would like related Data-items to be stored at the same bin Recall r, - Nearest Neighbor q? (1 + ) r r dist(q,p1) r dist(q,p2) (1 + ) r r2=(1 + ) r1 Locality sensitive hashing q? (1 + ) r r (r, ,p1,p2) Sensitive P1 ≡Pr[I(p)=I(q)] is “high” if p is “close” to q P2 ≡Pr[I(p)=I(q)] is “low” if p is”far” from q r2=(1 + ) r1 Preview General Solution – • Locality sensitive hashing Implementation for Hamming space • Generalization to l1 & l2 • Hamming Space Hamming space = 2N binary strings • Hamming distance = #changed digits • Richard Hamming a.k.a Signal distance Hamming Space N 010100001111 space • Hamming Hamming distance • 010100001111 Distance = 4 010010000011 SUM(X1 XOR X2) L1 to Hamming Space Embedding C=11 2 p 8 d’=C*d 11000000000 11111111000 11000000000 11111111000 Hash function 11000000000 11111111000 1 0 1 p ∈ Hd’ Lj Hash function j=1..L, k=3 digits Gj(p)=p|Ij Bits sampling from p Store p into bucket p|Ij 2k buckets 101 Construction p 1 2 L Query q 1 2 L Alternative intuition random projections C=11 2 p 8 d’=C*d 11000000000 11111111000 11000000000 11111111000 Alternative intuition random projections C=11 2 p 8 11000000000 11111111000 11000000000 11111111000 Alternative intuition random projections C=11 2 p 8 11000000000 11111111000 11000000000 11111111000 Alternative intuition random projections 1 0 1 11000000000 11111111000 110 111 100 101 p 101 23 Buckets 000 001 k samplings Repeating Repeating L times Repeating L times Secondary hashing 2k buckets 011 Simple Hashing Size=B M*B=α*n α=2 M Buckets Support volume tuning dataset-size vs. storage volume The above hashing is locality-sensitive k Distance( p, q same bucket) )= in1 • Probability (p,q # dimensions Probability k=1 Pr k=2 Distance (q,pi) Distance (q,pi) Adopted from Piotr Indyk’s slides Preview General Solution – • Locality sensitive hashing Implementation for Hamming space • Generalization to l2 • Direct L2 solution New hashing function • Still based on sampling • Using mathematical trick • P-stable distribution for Lp distance • Gaussian distribution for L2 distance • Central limit theorem v1* +v2* +… …+vn* = (Weighted Gaussians) = Weighted Gaussian Central limit theorem v1* X1 +v2* X2 +… …+vn* Xn = v1..vn = Real Numbers X1:Xn = Independent Identically Distributed (i.i.d) Central limit theorem 1/ 2 2 vi X i | vi | i i X Dot Product Norm Norm Distance 1/ 2 2 ui X i vi X i | ui vi | i i i X Features Features vector 1 vector 2 Distance Norm Distance 1/ 2 2 ui X i vi X i | ui vi | i i i X Dot Dot Product Product Distance The full Hashing d random* Features phase numbers vector Random[0,w] 22 1 [34 82 21] 77 d +b 42 Discretization w step a v b ha ,b (v) w The full Hashing 7944 +34 100 a v b 7800 7900 8000 8100 8200 ha ,b (v) w The full Hashing phase Random[0,w] 7944 +34 Discretization 100 step a v b ha ,b (v) w The full Hashing i.i.d from p-stable Features phase distribution vector Random[0,w] 1 a v d +b Discretization w step a v b ha ,b (v) w Generalization: P-Stable distribution L2 • Lp p=eps..2 • Central Limit Theorem • Generalized • Central Limit Theorem Gaussian (normal) • P-stable distribution • distribution Cauchy for L2 P-Stable summary r, - Nearest Neighbor • Works for Generalizes to 0<p<=2 • Improves query time • Latest results Reported in Email by Alexander Andoni Query time = O (dn1/(1+)log(n) ) O (dn1/(1+)^2log(n) ) Parameters selection 90% Probability Best quarry time performance • For Euclidean Space L Parameters selection … Single projection hit an - Nearest Neighbor • with Pr=p1 k projections hits an - Nearest Neighbor • with Pr=p1k L hashings fail to collide with Pr=(1-p1k)L • To ensure Collision (e.g. 1-δ≥90%) • log( ) 1- (1-p1k)L≥ 1-δ • L log(1 p1 ) k For Euclidean Space K … Parameters selection time Candidates verification Candidates extraction k Pros. & Cons. Better Query Time than Spatial Data Structures Scales well to higher dimensions and larger data size ( Sub-linear dependence ) Predictable running time Extra storage over-head Inefficient for data with distances concentrated around average works best for Hamming distance (although can be generalized to Euclidean space) In secondary storage, linear scan is pretty much all we can do (for high dim) requires radius r to be fixed in advance From Pioter Indyk slides Conclusion ..but at the end • everything depends on your data set Try it at home • Visit: – http://web.mit.edu/andoni/www/LSH/index.html Andoni@mit.edu Email Alex Andoni – Test over your own data – (C code under Red Hat Linux ) LSH - Applications • Searching video clips in databases Hashing and Its Application to Video Identification“, Yang, Ooi, Sun). .("Hierarchical, Non-Uniform Locality Sensitive • Searching image databases (see the following). • Image segmentation (see the following). • Image classification (“Discriminant adaptive Nearest Neighbor Classification”, T. Hastie, R Tibshirani). • Texture classification (see the following). • Clustering (see the following). • Embedding and manifold learning (LLE, and many others) • Compression – vector quantization. • Search engines (“LSH Forest: SelfTuning Indexes for Similarity Search”, M. Bawa, T. Condie, P. Ganesan”). • Genomics (“Efficient Large-Scale Sequence Comparison by Locality-Sensitive Hashing”, J. Buhler). • In short: whenever K-Nearest Neighbors (KNN) are needed. Motivation • A variety of procedures in learning require KNN computation. • KNN search is a computational bottleneck. • LSH provides a fast approximate solution to the problem. • LSH requires hash function construction and parameter tunning. Outline Fast Pose Estimation with Parameter Sensitive Hashing G. Shakhnarovich, P. Viola, and T. Darrell. • Finding sensitive hash functions. Mean Shift Based Clustering in High Dimensions: A Texture Classification Example B. Georgescu, I. Shimshoni, and P. Meer • Tuning LSH parameters. • LSH data structure is used for algorithm speedups. Fast Pose Estimation with Parameter Sensitive Hashing G. Shakhnarovich, P. Viola, and T. Darrell The Problem: Given an image x, what are the parameters θ, in this image? i i.e. angles of joints, orientation of the body, etc.���� Ingredients • Input query image with unknown angles (parameters). • Database of human poses with known angles. • Image feature extractor – edge detector. • Distance metric in feature space dx. • Distance metric in angles space: m d (1 , 2 ) 1 cos(1i 2i ) i 1 Example based learning • Construct a database of example images with their known angles. • Given a query image, run your favorite feature extractor. • Compute KNN from database. • Use these KNNs to compute the average angles of the query. Find KNN in Input: query Output: Average database of angles of KNN examples The algorithm flow Input Query Processed query Features extraction Database of examples Output Match Feature Extraction PSH LWR The image features Image features are multi- scale edge histograms: B A 3 0, , , , 4 2 4 107 ( x) A x / 4 Feature Extraction PSH LWR PSH: The basic assumption There are two metric spaces here: feature space (d x) and parameter space ( d ). We want similarity to be measured in the angles space, whereas LSH works on the feature space. • Assumption: The feature space is closely related to the parameter space. Feature Extraction PSH LWR Insight: Manifolds • Manifold is a space in which every point has a neighborhood resembling a Euclid space. • But global structure may be complicated: curved. • For example: lines are 1D manifolds, planes are 2D manifolds, etc. q Feature Space Is this Magic? Parameters Space (angles) Feature Extraction PSH LWR Parameter Sensitive Hashing (PSH) The trick: Estimate performance of different hash functions on examples, and select those sensitive to d : The hash functions are applied in feature space but the KNN are valid in angle space. Feature Extraction PSH LWR PSH as a classification problem Label pairs of examples with similar angles Compare Define hash functions h labeling on feature space Predict labeling of similar\ non-similar examples by using h If labeling by h is good accept h, else change h Feature Extraction PSH LWR A pair of examples (xi , i ), ( x j , j ) is labeled : 1 if d ( i , j ) r yij 1 if d ( i , j ) r (1 ) Labels: +1 +1 -1 -1 (r=0.25) Feature Extraction PSH LWR features A binary hash function: Feature 1 if (x) T h ,T ( x) -1 otherwise Predict the labels 1 if h ,T (xi ) h ,T (x j ) yh(xi ,x j ) ˆ 1 otherwise Feature Extraction PSH LWR h ,T will place both examples in the same bin or separate them : T (x) Find the best T* that predicts the true labeling with the probabilit ies constraint s. Feature Extraction PSH LWR Local Weighted Regression (LWR) • Given a query image, PSH returns KNNs. • LWR uses the KNN to compute a weighted average of the estimated arg min d ( g ( xi , ), i ) K (d X ( xi , x0 )) *angles of the query: xi N ( x0 ) dist . weight Results Synthetic data were generated: • 13 angles: 1 for rotation of the torso, 12 for joints. • 150,000 images. • Nuisance parameters added: clothing, illumination, face expression. • 1,775,000 example pairs. • Selected 137 out of 5,123 meaningful features (how??): Recall: P1 is prob of positive 18 bit hash functions (k), 150 hash tables (l). hash. P2 is prob of bad hash. B is the max number of pts in a bucket. • Without selection needed 40 bits and 1000 hash tables. • Test on 1000 synthetic examples: • PSH searched only 3.4% of the data per query. Results – real data • 800 images. • Processed by a segmentation algorithm. • 1.3% of the data were searched. Results – real data Interesting mismatches Fast pose estimation - summary • Fast way to compute the angles of human body figure. • Moving from one representation space to another. • Training a sensitive hash function. • KNN smart averaging. Food for Thought • The basic assumption may be problematic (distance metric, representations). • The training set should be dense. • Texture and clutter. • General: some features are more important than others and should be weighted. Food for Thought: Point Location in Different Spheres (PLDS) • Given: n spheres in Rd , centered at P={p1,…,pn} with radii {r1,…,rn} . • Goal: given a query q, preprocess the points in P to find point pi that its sphere ‘cover’ the query q. ri q pi Courtesy of Mohamad Hegaze Mean-Shift Based Clustering in High Dimensions: A Texture Classification Example B. Georgescu, I. Shimshoni, and P. Meer Motivation: • Clustering high dimensional data by using local density measurements (e.g. feature space). • Statistical curse of dimensionality: sparseness of the data. • Computational curse of dimensionality: expensive range queries. • LSH parameters should be adjusted for optimal performance. Outline • Mean-shift in a nutshell + examples. Our scope: • Mean-shift in high dimensions – using LSH. • Speedups: 1. Finding optimal LSH parameters. 2. Data-driven partitions into buckets. 3. Additional speedup by using LSH data structure. Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct partition Mean-Shift in a Nutshell bandwidth point Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct partition KNN in mean-shift Bandwidth should be inversely proportional to the density in the region: high density - small bandwidth low density - large bandwidth Based on kth nearest neighbor of the point The bandwidth is Adaptive mean-shift vs. non-adaptive. Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct partition Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct partition Image segmentation algorithm 1. Input : Data in 5D (3 color + 2 x,y) or 3D (1 gray +2 x,y) 2. Resolution controlled by the bandwidth: hs (spatial), hr (color) 3. Apply filtering 3D: Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’ Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct partition Image segmentation algorithm Filtering: pixel value of the nearest mode Mean-shift trajectories original filtered segmented Filtering examples original squirrel filtered original baboon filtered Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’ Segmentation examples Mean-shift: A Robust Approach Towards Feature Space Analysis. D. Comaniciu et. al. TPAMI 02’ Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct partition Mean-shift in high dimensions Statistical curse of dimensionality: Sparseness of the data variable bandwidth Computational curse of dimensionality: Expensive range queries implemented with LSH Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct partition LSH-based data structure • Choose L random partitions: Each partition includes K pairs (dk,vk) • For each point we check: xi , d K vk It Partitions the data into cells: Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct partition Choosing the optimal K and L • For a query q compute smallest number of distances to points in its buckets. Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct partition d N Cl n( K / d 1) C N C LNCl C As L increasesC increases but C decreases. C determines the resolution of the data structure. Large k smaller number of points in a cell. If L is too small points might be missed, but if L is too big C might include extra points Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct partition Choosing optimal K and L Determine accurately the KNN for m randomly-selected data points. distance (bandwidth) Choose error threshold The optimal K and L should satisfy the approximate distance Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct partition Choosing optimal K and L • For each K estimate the error for • In one run for all L’s: find the minimal L satisfying the constraint L(K) • Minimize time t(K,L(K)): minimum Approximation L(K) for =0.05 Running time error for K,L t[K,L(K)] Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct partition Data driven partitions • In original LSH, cut values are random in the range of the data. • Suggestion: Randomly select a point from the data and use one of its coordinates as the cut value. uniform data driven points/bucket distribution Mean-shift LSH LSH: optimal k,l LSH: data LSH: data struct partition Additional speedup Assume that all points in C will converge to the same mode. (C is like a type of an aggregate) C C Speedup results 65536 points, 1638 points sampled , k=100 Food for thought Low dimension High dimension A thought for food… • Choose K, L by sample learning, or take the traditional. • Can one estimate K, L without sampling? • A thought for food: does it help to know the data dimensionality or the data manifold? • Intuitively: dimensionality implies the number of hash functions needed. • The catch: efficient dimensionality learning requires KNN. 15:30 cookies….. Summary • LSH suggests a compromise on accuracy for the gain of complexity. • Applications that involve massive data in high dimension require the LSH fast performance. • Extension of the LSH to different spaces (PSH). • Learning the LSH parameters and hash functions for different applications. Conclusion • ..but at the end everything depends on your data set • Try it at home – Visit: http://web.mit.edu/andoni/www/LSH/index.html – Email Alex Andoni Andoni@mit.edu – Test over your own data (C code under Red Hat Linux ) Thanks • Ilan Shimshoni (Haifa). • Mohamad Hegaze (Weizmann). • Alex Andoni (MIT). • Mica and Denis.