VIEWS: 1 PAGES: 7 POSTED ON: 3/27/2011 Public Domain
Outline • k-Nearest Neighbor • Locally weighted learning Machine Learning • Radial basis functions • Case-based reasoning Instance Based Learning • Lazy and eager learning 2 Key Idea Similarity Metric • Instance-based learning divides into two simple steps: • We need a measure of distance in order to 1. Store all examples in the training set. 2. When a new example arrives, retrieve those examples similar to know who are the nearest neighbours the new example and look at their classes • Assume that we have T attributes for the learning problem. Then one example point x • Instance-based learning is often termed lazy has elements xt ∈ ℜ, t=1,…T. learning, as there is typically no “transformation” of training instances into more general “statements” • The distance between two points xi xj is often • Instance-based learners never form an explicit defined as the Euclidean distance: general hypothesis regarding the target function T ∑[x – They simply compute the classification of each new query instance as needed d (x i , x j ) = ti − xtj ]2 t =1 3 4 k-NN Algorithm Example for Discrete Target Functions • For each training instance t=(x, f(x)) – Add t to the set Tr_instances • Given a query instance q to be classified – Let x1, …, xk be the k training instances in Tr_instances nearest to q – Return ˆ k f (q) = argmax∑δ (v, f ( xi )) v∈V i=1 • where V is the finite set of target class values, and δ(a,b)=1 if a=b, and 0 otherwise • Intuitively, the k-NN algorithm assigns to each new query instance the majority class among its k nearest neighbors 5 6 1 Voronoi Diagram 3-Nearest Neighbors query point qf query point qf nearest neighbor qi 3 nearest neighbors 2x,1o 7 8 k-NN Algorithm 7-Nearest Neighbors for Real-valued Target Functions query point qf • For each training instance t=(x, f(x)) – Add t to the set Tr_instances • Given a query instance q to be classified – Let x1, …, xk be the k training instances in Tr_instances nearest 7 nearest neighbors to q – Return 3x,4o k ∑ f (x ) i f ( xq ) = ˆ i =1 k 9 10 Nearest Neighbor (continuous) Nearest Neighbor (continuous) 3-nearest neighbor 5-nearest neighbor 11 12 2 When to Consider Nearest Neighbor (continuous) Nearest Neighbors 1-nearest neighbor • Instances map to points in RN • Less than 20 attributes per instance • Lots of training data Advantages: • Training is very fast • Learn complex target functions • Do not loose information Disadvantages: • Slow at query time • Easily fooled by irrelevant attributes 13 14 How to Choose k Distance Weighted k-NN • Large k: • Give more weight to neighbors closer to the – less sensitive to noise (particularly class noise) query point k – better probability estimates for discrete classes • Descrete: fˆ (q) = arg max ∑ wiδ (v, f ( xi )) v∈V i =1 – larger training sets allow larger values of k k • Small k: ∑ wi f ( xi ) wi = 1 • Real-valued: fˆ (q) = i =1 2 – captures fine structure of problem space better d ( xq − xi ) k – may be necessary with small training sets ∑ wi i =1 • Balance must be struck between large and small k • As training set approaches infinity, and k grows • Instead of only k-nearest neighbors use all large, k-NN becomes Bayes optimal training examples (Shepard’s method) 15 16 Curse of Dimensionality Euclidean Distance? o + o o • Imagine instances described by 20 attributes o ooo attribute_2 attribute_2 oo o o o + + o o but only 2 are relevant to target function o + + + ooo + +++ • Curse of dimensionality: nearest neighbor is ++ + + + oo o + + + o easily misled when instance space is high- + attribute_1 attribute_1 dimensional • One approach: • if classes are not spherical? – Stretch j-th axis by weight zj, where z1,…,zn chosen to minimize prediction error • if some attributes are more/less important – Use cross-validation to automatically choose than other attributes? weights z1,…,zn – Note setting zj to zero eliminates this dimension • if some attributes have more/less noise in alltogether (feature subset selection) them than other attributes? 17 18 3 Weighted Euclidean Distance Learning Attribute Weights • Scale attribute ranges or attribute variances ( )2 N D(c1,c2) = ∑ wi ⋅ attri (c1) − attri (c2) to make them uniform (fast and easy) i=1 • Prior knowledge • large weights => attribute is more important • small weights => attribute is less important • Numerical optimization: • zero weights => attribute doesn’t matter – gradient descent, simplex methods, genetic algorithm • Weights allow kNN to be effective with axis- – criterion is cross-validation performance parallel elliptical classes • Information Gain or Gain Ratio of single attributes 19 20 Locally Weighted Regression Linear Regression Example • k-NN approximates the target function at the single query instance • Locally weighted regression extends k-NN by constructing an explicit approximation of the target function over a local region surrounding the query instance: – Local: considers neighboring instances – Weighted: uses distance-weighting – Regression: predicts real-valued target functions 21 22 Locally-weighted regression Locally-weighted LR • Uses a linear function to approximate the target Simple regression function near the query instance: f ( x) = w0 + w1a1( x) + K+ wnan ( x) ˆ • What does this functional from remind us of? – Perceptron r 1 – Choose weights that minimize: E(w) = ∑ ( f ( x) − f ( x)) Training data 2 ˆ Predicted value using simple regression 2 x∈D Predicted value using locally weighted (piece-wise) regression – Gradient descent rule: ∆wi = η ∑ ( f ( x) − fˆ ( x)) ai ( x) x∈D 23 24 4 Local vs. Global Local Error Criteria • The delta rule is a global approximation • The simplest way is to localize the error criterion procedure • There are 3 alternatives: • We seek a local approximation procedure 1 ∑ ( f ( x) − fˆ ( x)) 2 (i.e., in the neighborhood of the query E1 ( xq ) = instance) 2 x∈k − NN 1 ∑ ( f ( x) − fˆ ( x)) K (d ( xq , x)) 2 • Can the global procedure be modified? E 2 ( xq ) = How? 2 x∈D 1 ∑ ( f ( x) − fˆ ( x)) K (d ( xq , x)) 2 E3 ( xq ) = 2 x∈k − NN 25 26 Weight Update Rule Kernel Functions • E1 ignores distances • E2 is most attractive but most expensive • E3 offers a reasonable trade-off (i.e., good approximation of E2) • LWLR-delta rule: ∆wi = η ∑ K (d (xq , x)) ( f (x) − fˆ (x)) ai (x) x∈ k −NN 27 28 Radial Basis Functions Radial Basis Function Network output f(x) • Global approximation to target function in terms of linear combination of local approximations wn linear parameters • Can approximate any function with arbitrarily small error • Used, e.g., for image classification Kernel functions • Similar to back-propagation neural network but Kn(d(xn,x))= activation function is Gaussian rather than exp(-1/2 d(xn,x)2/σ2) sigmoid input layer • Closely related to distance-weighted regression xi but ”eager” instead of ”lazy” f(x)=w0+Σn=1k wn Kn(d(xn,x)) 29 30 5 Training Radial Basis Function Networks Case-Based Reasoning • How to choose the center xn for each Kernel • Can apply instance-based learning even when X≠ℜn function Kn? • A different distance metric is needed – scatter uniformly across instance space • Case-Based Reasoning is instance-based – use distribution of training instances (clustering) learning applied to instances with symbolic logic • How to train the weights? descriptions – Choose mean xn and variance σn for each Kn • Applications: – Design: landscape, building, mechanical, conceptual non-linear optimization or EM design of aircraft sub-systems – Hold Kn fixed and use local linear regression to – Planning: repair schedules compute the optimal weights wn – Diagnosis: medical – Adversarial reasoning: legal 31 32 Case-Based Reasoning in CADET • CADET stored examples of mechanical devices • Training examples: – (qualitative function, mechanical structure) • New query: desired function • Target value: mechanical structure for this function • Distance metric: match qualitative function descriptions 33 34 Case-Based Reasoning Lazy and Eager Learning • Cases represented by rich structural • Lazy: wait for query before generalizing descriptions – k-nearest neighbors, weighted linear regression • Multiple cases retrieved (and combined) to form • Eager: generalize before seeing query solution to new problem – Radial basis function networks, decision trees, back- • Tight coupling between case retrieval and propagation, ID3, NaiveBayes problem solving • Does it matter? • Bottom line • Eager learner must create global approximation – Simple matching of cases useful for some • Lazy learner can create local approximations applications (e.g., answering help-desk queries) • If they use the same hypothesis space, lazy can – Area of ongoing research represent more complex functions (H=linear functions) 35 36 6 Literature & Software • T. Mitchell, “Machine Learning”, chapter 8, “Instance-Based Learning” • “Locally Weighted Learning”, Christopher Atkeson, Andrew Moore, Stefan Schaal ftp:/ftp.cc.gatech.edu/pub/people/cga/air.html R. Duda et ak, “Pattern recognition”, chapter 4 “Non-Parametric Techniques” • Netlab toolbox – k-nearest neighbor classification – Radial basis function networks 37 7