Machine Learning Outline Key Idea Similarity Metric Example by nikeborome

VIEWS: 1 PAGES: 7

									                                                                                                       Outline
                                                                        •   k-Nearest Neighbor
                                                                        •   Locally weighted learning
               Machine Learning                                         •   Radial basis functions
                                                                        •   Case-based reasoning
               Instance Based Learning                                  •   Lazy and eager learning




                                                                                                                                                  2




                         Key Idea                                                         Similarity Metric
•   Instance-based learning divides into two simple steps:              • We need a measure of distance in order to
    1. Store all examples in the training set.
    2. When a new example arrives, retrieve those examples similar to
                                                                          know who are the nearest neighbours
       the new example and look at their classes                        • Assume that we have T attributes for the
                                                                          learning problem. Then one example point x
•   Instance-based learning is often termed lazy
                                                                          has elements xt ∈ ℜ, t=1,…T.
    learning, as there is typically no “transformation” of
    training instances into more general “statements”                   • The distance between two points xi xj is often
•   Instance-based learners never form an explicit                        defined as the Euclidean distance:
    general hypothesis regarding the target function
                                                                                                              T

                                                                                                             ∑[x
    – They simply compute the classification of each new query
      instance as needed                                                                  d (x i , x j ) =          ti   − xtj ]2
                                                                                                             t =1
                                                                    3                                                                             4




                                                                                     k-NN Algorithm
                         Example
                                                                              for Discrete Target Functions
                                                                        • For each training instance t=(x, f(x))
                                                                            – Add t to the set Tr_instances
                                                                        • Given a query instance q to be classified
                                                                            – Let x1, …, xk be the k training instances in Tr_instances nearest
                                                                              to q
                                                                            – Return ˆ                  k
                                                                                        f (q) = argmax∑δ (v, f ( xi ))
                                                                                                 v∈V   i=1

                                                                        • where V is the finite set of target class values, and
                                                                          δ(a,b)=1 if a=b, and 0 otherwise

                                                                        • Intuitively, the k-NN algorithm assigns to each new
                                                                          query instance the majority class among its k nearest
                                                                          neighbors
                                                                    5                                                                             6




                                                                                                                                                      1
                   Voronoi Diagram                      3-Nearest Neighbors

  query point qf                            query point qf

 nearest neighbor qi
                                          3 nearest neighbors
                                              2x,1o




                                      7                                                                               8




                                                        k-NN Algorithm
              7-Nearest Neighbors
                                               for Real-valued Target Functions

  query point qf                             • For each training instance t=(x, f(x))
                                                – Add t to the set Tr_instances
                                             • Given a query instance q to be classified
                                                – Let x1, …, xk be the k training instances in Tr_instances nearest
7 nearest neighbors                               to q
                                                – Return
    3x,4o                                                                     k

                                                                             ∑ f (x )   i
                                                                f ( xq ) =
                                                                ˆ            i =1
                                                                                    k


                                      9                                                                           10




    Nearest Neighbor (continuous)             Nearest Neighbor (continuous)
             3-nearest neighbor                         5-nearest neighbor




                                     11                                                                           12




                                                                                                                          2
                                                                                          When to Consider
  Nearest Neighbor (continuous)
                                                                                          Nearest Neighbors
          1-nearest neighbor
                                                                    • Instances map to points in RN
                                                                    • Less than 20 attributes per instance
                                                                    • Lots of training data
                                                                    Advantages:
                                                                    • Training is very fast
                                                                    • Learn complex target functions
                                                                    • Do not loose information
                                                                    Disadvantages:
                                                                    • Slow at query time
                                                                    • Easily fooled by irrelevant attributes

                                                             13                                                                                                           14




              How to Choose k                                            Distance Weighted k-NN
• Large k:                                                        • Give more weight to neighbors closer to the
  – less sensitive to noise (particularly class noise)              query point                k
  – better probability estimates for discrete classes             • Descrete: fˆ (q) = arg max ∑ wiδ (v, f ( xi ))
                                                                                                         v∈V       i =1
  – larger training sets allow larger values of k                                                              k
• Small k:                                                                                                 ∑ wi f ( xi )                     wi =
                                                                                                                                                            1
                                                                  • Real-valued: fˆ (q) = i =1
                                                                                                                                                                      2
  – captures fine structure of problem space better                                                                                                   d ( xq − xi )
                                                                                                                    k
  – may be necessary with small training sets                                                                      ∑ wi
                                                                                                                   i =1
• Balance must be struck between large and small k
• As training set approaches infinity, and k grows                • Instead of only k-nearest neighbors use all
  large, k-NN becomes Bayes optimal                                 training examples (Shepard’s method)
                                                             15                                                                                                           16




       Curse of Dimensionality                                                            Euclidean Distance?
                                                                                                   o                                       + o o
 • Imagine instances described by 20 attributes                                                  o ooo
                                                                            attribute_2




                                                                                                                          attribute_2




                                                                                                  oo o
                                                                                                   o o
                                                                                                                                         + + o
                                                                                                                                              o
   but only 2 are relevant to target function                                                      o
                                                                                                                                          +
                                                                                                                                        + + ooo
                                                                                           +
                                                                                          +++
 • Curse of dimensionality: nearest neighbor is                                           ++
                                                                                                                                            +
                                                                                                                                        + + oo  o
                                                                                           + +
                                                                                                                                            + o
   easily misled when instance space is high-                                                    +
                                                                                           attribute_1                                  attribute_1
   dimensional
 • One approach:                                                  • if classes are not spherical?
    – Stretch j-th axis by weight zj, where z1,…,zn chosen
      to minimize prediction error                                • if some attributes are more/less important
    – Use cross-validation to automatically choose                  than other attributes?
      weights z1,…,zn
    – Note setting zj to zero eliminates this dimension           • if some attributes have more/less noise in
      alltogether (feature subset selection)
                                                                    them than other attributes?
                                                             17                                                                                                           18




                                                                                                                                                                               3
    Weighted Euclidean Distance                                                        Learning Attribute Weights
                                                                                 • Scale attribute ranges or attribute variances
                                  (                     )2
                           N
             D(c1,c2) =    ∑ wi ⋅ attri (c1) − attri (c2)                          to make them uniform (fast and easy)
                          i=1

                                                                                 • Prior knowledge
• large weights =>              attribute is more important
• small weights =>              attribute is less important
                                                                                 • Numerical optimization:
• zero weights =>               attribute doesn’t matter                           – gradient descent, simplex methods, genetic
                                                                                     algorithm
• Weights allow kNN to be effective with axis-                                     – criterion is cross-validation performance
  parallel elliptical classes                                                    • Information Gain or Gain Ratio of single
                                                                                   attributes
                                                                            19                                                                        20




  Locally Weighted Regression                                                       Linear Regression Example
• k-NN approximates the target function at the
  single query instance
• Locally weighted regression extends k-NN by
  constructing an explicit approximation of the
  target function over a local region surrounding
  the query instance:
  – Local: considers neighboring instances
  – Weighted: uses distance-weighting
  – Regression: predicts real-valued target functions


                                                                            21                                                                        22




        Locally-weighted regression
                                                                                           Locally-weighted LR
                                                                                 • Uses a linear function to approximate the target
                                                             Simple regression
                                                                                   function near the query instance:
                                                                                         f ( x) = w0 + w1a1( x) + K+ wnan ( x)
                                                                                         ˆ

                                                                                 • What does this functional from remind us of?
                                                                                   – Perceptron
                                                                                                                     r 1
                                                                                   – Choose weights that minimize: E(w) = ∑ ( f ( x) − f ( x))
        Training data                                                                                                                         2
                                                                                                                                       ˆ
        Predicted value using simple regression
                                                                                                                         2 x∈D

        Predicted value using locally weighted (piece-wise) regression
                                                                                   – Gradient descent rule: ∆wi = η   ∑ ( f ( x) − fˆ ( x)) ai ( x)
                                                                                                                      x∈D
                                                                            23                                                                        24




                                                                                                                                                           4
             Local vs. Global                                                   Local Error Criteria
• The delta rule is a global approximation                          • The simplest way is to localize the error criterion
  procedure                                                         • There are 3 alternatives:
• We seek a local approximation procedure
                                                                                              1
                                                                                                  ∑ ( f ( x) − fˆ ( x))
                                                                                                                        2
  (i.e., in the neighborhood of the query                                       E1 ( xq ) =
  instance)                                                                                   2 x∈k − NN
                                                                                      1
                                                                                         ∑ ( f ( x) − fˆ ( x)) K (d ( xq , x))
                                                                                                              2
• Can the global procedure be modified?                                   E 2 ( xq ) =
  How?                                                                                2 x∈D
                                                                                     1
                                                                                         ∑ ( f ( x) − fˆ ( x)) K (d ( xq , x))
                                                                                                               2
                                                                         E3 ( xq ) =
                                                                                     2 x∈k − NN
                                                               25                                                                   26




          Weight Update Rule                                                        Kernel Functions
• E1 ignores distances
• E2 is most attractive but most expensive
• E3 offers a reasonable trade-off (i.e., good
  approximation of E2)
• LWLR-delta rule:

       ∆wi = η     ∑ K (d (xq , x)) ( f (x) − fˆ (x)) ai (x)
                 x∈ k −NN


                                                               27                                                                   28




        Radial Basis Functions                                       Radial Basis Function Network
                                                                                                         output f(x)
• Global approximation to target function in terms
  of linear combination of local approximations
                                                                                                         wn     linear parameters
• Can approximate any function with arbitrarily
  small error
• Used, e.g., for image classification                                                                         Kernel functions
• Similar to back-propagation neural network but                                                               Kn(d(xn,x))=
  activation function is Gaussian rather than                                                                  exp(-1/2 d(xn,x)2/σ2)
  sigmoid                                                                                                       input layer
• Closely related to distance-weighted regression                              xi
  but ”eager” instead of ”lazy”
                                                                          f(x)=w0+Σn=1k wn Kn(d(xn,x))
                                                               29                                                                   30




                                                                                                                                         5
      Training Radial Basis Function
                Networks                                                Case-Based Reasoning
• How to choose the center xn for each Kernel                   • Can apply instance-based learning even when
                                                                  X≠ℜn
  function Kn?
                                                                • A different distance metric is needed
   – scatter uniformly across instance space
                                                                • Case-Based Reasoning is instance-based
   – use distribution of training instances (clustering)          learning applied to instances with symbolic logic
• How to train the weights?                                       descriptions
   – Choose mean xn and variance σn for each Kn                 • Applications:
                                                                    – Design: landscape, building, mechanical, conceptual
     non-linear optimization or EM                                    design of aircraft sub-systems
   – Hold Kn fixed and use local linear regression to               – Planning: repair schedules
     compute the optimal weights wn                                 – Diagnosis: medical
                                                                    – Adversarial reasoning: legal
                                                           31                                                               32




Case-Based Reasoning in CADET
• CADET stored examples of mechanical
  devices
• Training examples:
  – (qualitative function, mechanical structure)
• New query: desired function
• Target value: mechanical structure for this
  function
• Distance metric: match qualitative function
  descriptions
                                                           33                                                               34




       Case-Based Reasoning                                            Lazy and Eager Learning
• Cases represented by rich structural
                                                                • Lazy: wait for query before generalizing
  descriptions
                                                                     – k-nearest neighbors, weighted linear regression
• Multiple cases retrieved (and combined) to form               • Eager: generalize before seeing query
  solution to new problem                                            – Radial basis function networks, decision trees, back-
• Tight coupling between case retrieval and                            propagation, ID3, NaiveBayes
  problem solving                                               •   Does it matter?
• Bottom line                                                   •   Eager learner must create global approximation
  – Simple matching of cases useful for some                    •   Lazy learner can create local approximations
    applications (e.g., answering help-desk queries)            •   If they use the same hypothesis space, lazy can
  – Area of ongoing research                                        represent more complex functions (H=linear
                                                                    functions)
                                                           35                                                               36




                                                                                                                                 6
        Literature & Software
• T. Mitchell, “Machine Learning”, chapter 8,
   “Instance-Based Learning”
• “Locally Weighted Learning”, Christopher
   Atkeson, Andrew Moore, Stefan Schaal
ftp:/ftp.cc.gatech.edu/pub/people/cga/air.html
R. Duda et ak, “Pattern recognition”, chapter 4
   “Non-Parametric Techniques”
• Netlab toolbox
    – k-nearest neighbor classification
    – Radial basis function networks
                                                  37




                                                       7

								
To top