# Machine Learning Outline Key Idea Similarity Metric Example by nikeborome

VIEWS: 1 PAGES: 7

• pg 1
```									                                                                                                       Outline
•   k-Nearest Neighbor
•   Locally weighted learning
Machine Learning                                         •   Radial basis functions
•   Case-based reasoning
Instance Based Learning                                  •   Lazy and eager learning

2

Key Idea                                                         Similarity Metric
•   Instance-based learning divides into two simple steps:              • We need a measure of distance in order to
1. Store all examples in the training set.
2. When a new example arrives, retrieve those examples similar to
know who are the nearest neighbours
the new example and look at their classes                        • Assume that we have T attributes for the
learning problem. Then one example point x
•   Instance-based learning is often termed lazy
has elements xt ∈ ℜ, t=1,…T.
learning, as there is typically no “transformation” of
training instances into more general “statements”                   • The distance between two points xi xj is often
•   Instance-based learners never form an explicit                        defined as the Euclidean distance:
general hypothesis regarding the target function
T

∑[x
– They simply compute the classification of each new query
instance as needed                                                                  d (x i , x j ) =          ti   − xtj ]2
t =1
3                                                                             4

k-NN Algorithm
Example
for Discrete Target Functions
• For each training instance t=(x, f(x))
– Add t to the set Tr_instances
• Given a query instance q to be classified
– Let x1, …, xk be the k training instances in Tr_instances nearest
to q
– Return ˆ                  k
f (q) = argmax∑δ (v, f ( xi ))
v∈V   i=1

• where V is the finite set of target class values, and
δ(a,b)=1 if a=b, and 0 otherwise

• Intuitively, the k-NN algorithm assigns to each new
query instance the majority class among its k nearest
neighbors
5                                                                             6

1
Voronoi Diagram                      3-Nearest Neighbors

query point qf                            query point qf

nearest neighbor qi
3 nearest neighbors
2x,1o

7                                                                               8

k-NN Algorithm
7-Nearest Neighbors
for Real-valued Target Functions

query point qf                             • For each training instance t=(x, f(x))
– Add t to the set Tr_instances
• Given a query instance q to be classified
– Let x1, …, xk be the k training instances in Tr_instances nearest
7 nearest neighbors                               to q
– Return
3x,4o                                                                     k

∑ f (x )   i
f ( xq ) =
ˆ            i =1
k

9                                                                           10

Nearest Neighbor (continuous)             Nearest Neighbor (continuous)
3-nearest neighbor                         5-nearest neighbor

11                                                                           12

2
When to Consider
Nearest Neighbor (continuous)
Nearest Neighbors
1-nearest neighbor
• Instances map to points in RN
• Less than 20 attributes per instance
• Lots of training data
• Training is very fast
• Learn complex target functions
• Do not loose information
• Slow at query time
• Easily fooled by irrelevant attributes

13                                                                                                           14

How to Choose k                                            Distance Weighted k-NN
• Large k:                                                        • Give more weight to neighbors closer to the
– less sensitive to noise (particularly class noise)              query point                k
– better probability estimates for discrete classes             • Descrete: fˆ (q) = arg max ∑ wiδ (v, f ( xi ))
v∈V       i =1
– larger training sets allow larger values of k                                                              k
• Small k:                                                                                                 ∑ wi f ( xi )                     wi =
1
• Real-valued: fˆ (q) = i =1
2
– captures fine structure of problem space better                                                                                                   d ( xq − xi )
k
– may be necessary with small training sets                                                                      ∑ wi
i =1
• Balance must be struck between large and small k
• As training set approaches infinity, and k grows                • Instead of only k-nearest neighbors use all
large, k-NN becomes Bayes optimal                                 training examples (Shepard’s method)
15                                                                                                           16

Curse of Dimensionality                                                            Euclidean Distance?
o                                       + o o
• Imagine instances described by 20 attributes                                                  o ooo
attribute_2

attribute_2

oo o
o o
+ + o
o
but only 2 are relevant to target function                                                      o
+
+ + ooo
+
+++
• Curse of dimensionality: nearest neighbor is                                           ++
+
+ + oo  o
+ +
+ o
easily misled when instance space is high-                                                    +
attribute_1                                  attribute_1
dimensional
• One approach:                                                  • if classes are not spherical?
– Stretch j-th axis by weight zj, where z1,…,zn chosen
to minimize prediction error                                • if some attributes are more/less important
– Use cross-validation to automatically choose                  than other attributes?
weights z1,…,zn
– Note setting zj to zero eliminates this dimension           • if some attributes have more/less noise in
alltogether (feature subset selection)
them than other attributes?
17                                                                                                           18

3
Weighted Euclidean Distance                                                        Learning Attribute Weights
• Scale attribute ranges or attribute variances
(                     )2
N
D(c1,c2) =    ∑ wi ⋅ attri (c1) − attri (c2)                          to make them uniform (fast and easy)
i=1

• Prior knowledge
• large weights =>              attribute is more important
• small weights =>              attribute is less important
• Numerical optimization:
• zero weights =>               attribute doesn’t matter                           – gradient descent, simplex methods, genetic
algorithm
• Weights allow kNN to be effective with axis-                                     – criterion is cross-validation performance
parallel elliptical classes                                                    • Information Gain or Gain Ratio of single
attributes
19                                                                        20

Locally Weighted Regression                                                       Linear Regression Example
• k-NN approximates the target function at the
single query instance
• Locally weighted regression extends k-NN by
constructing an explicit approximation of the
target function over a local region surrounding
the query instance:
– Local: considers neighboring instances
– Weighted: uses distance-weighting
– Regression: predicts real-valued target functions

21                                                                        22

Locally-weighted regression
Locally-weighted LR
• Uses a linear function to approximate the target
Simple regression
function near the query instance:
f ( x) = w0 + w1a1( x) + K+ wnan ( x)
ˆ

• What does this functional from remind us of?
– Perceptron
r 1
– Choose weights that minimize: E(w) = ∑ ( f ( x) − f ( x))
Training data                                                                                                                         2
ˆ
Predicted value using simple regression
2 x∈D

Predicted value using locally weighted (piece-wise) regression
– Gradient descent rule: ∆wi = η   ∑ ( f ( x) − fˆ ( x)) ai ( x)
x∈D
23                                                                        24

4
Local vs. Global                                                   Local Error Criteria
• The delta rule is a global approximation                          • The simplest way is to localize the error criterion
procedure                                                         • There are 3 alternatives:
• We seek a local approximation procedure
1
∑ ( f ( x) − fˆ ( x))
2
(i.e., in the neighborhood of the query                                       E1 ( xq ) =
instance)                                                                                   2 x∈k − NN
1
∑ ( f ( x) − fˆ ( x)) K (d ( xq , x))
2
• Can the global procedure be modified?                                   E 2 ( xq ) =
How?                                                                                2 x∈D
1
∑ ( f ( x) − fˆ ( x)) K (d ( xq , x))
2
E3 ( xq ) =
2 x∈k − NN
25                                                                   26

Weight Update Rule                                                        Kernel Functions
• E1 ignores distances
• E2 is most attractive but most expensive
• E3 offers a reasonable trade-off (i.e., good
approximation of E2)
• LWLR-delta rule:

∆wi = η     ∑ K (d (xq , x)) ( f (x) − fˆ (x)) ai (x)
x∈ k −NN

27                                                                   28

output f(x)
• Global approximation to target function in terms
of linear combination of local approximations
wn     linear parameters
• Can approximate any function with arbitrarily
small error
• Used, e.g., for image classification                                                                         Kernel functions
• Similar to back-propagation neural network but                                                               Kn(d(xn,x))=
activation function is Gaussian rather than                                                                  exp(-1/2 d(xn,x)2/σ2)
sigmoid                                                                                                       input layer
• Closely related to distance-weighted regression                              xi
f(x)=w0+Σn=1k wn Kn(d(xn,x))
29                                                                   30

5
Networks                                                Case-Based Reasoning
• How to choose the center xn for each Kernel                   • Can apply instance-based learning even when
X≠ℜn
function Kn?
• A different distance metric is needed
– scatter uniformly across instance space
• Case-Based Reasoning is instance-based
– use distribution of training instances (clustering)          learning applied to instances with symbolic logic
• How to train the weights?                                       descriptions
– Choose mean xn and variance σn for each Kn                 • Applications:
– Design: landscape, building, mechanical, conceptual
non-linear optimization or EM                                    design of aircraft sub-systems
– Hold Kn fixed and use local linear regression to               – Planning: repair schedules
compute the optimal weights wn                                 – Diagnosis: medical
31                                                               32

• CADET stored examples of mechanical
devices
• Training examples:
– (qualitative function, mechanical structure)
• New query: desired function
• Target value: mechanical structure for this
function
• Distance metric: match qualitative function
descriptions
33                                                               34

Case-Based Reasoning                                            Lazy and Eager Learning
• Cases represented by rich structural
• Lazy: wait for query before generalizing
descriptions
– k-nearest neighbors, weighted linear regression
• Multiple cases retrieved (and combined) to form               • Eager: generalize before seeing query
solution to new problem                                            – Radial basis function networks, decision trees, back-
• Tight coupling between case retrieval and                            propagation, ID3, NaiveBayes
problem solving                                               •   Does it matter?
• Bottom line                                                   •   Eager learner must create global approximation
– Simple matching of cases useful for some                    •   Lazy learner can create local approximations
applications (e.g., answering help-desk queries)            •   If they use the same hypothesis space, lazy can
– Area of ongoing research                                        represent more complex functions (H=linear
functions)
35                                                               36

6
Literature & Software
• T. Mitchell, “Machine Learning”, chapter 8,
“Instance-Based Learning”
• “Locally Weighted Learning”, Christopher
Atkeson, Andrew Moore, Stefan Schaal
ftp:/ftp.cc.gatech.edu/pub/people/cga/air.html
R. Duda et ak, “Pattern recognition”, chapter 4
“Non-Parametric Techniques”
• Netlab toolbox
– k-nearest neighbor classification