Docstoc

lecture3

Document Sample
lecture3 Powered By Docstoc
					Last lecture summary
              Basic terminology
• tasks
  – classification
  – regression
• learner, algorithm
  – each has one or several parameters influencing its
    behavior
• model
  – one concrete combination of learner and parameters
  – tune the parameters using the training set
  – the generalization is assessed using test set
    (previously unseen data)
• learning (training)
  – supervised
     • a target vector t is known, parameters are tuned to
       achieve the best match between prediction and the
       target vector
  – unsupervised
     • training data consists of a set of input vectors x without
       any corresponding target value
     • clustering, vizualization
• for most applications, the original input
  variables must be preprocessed
      – feature selection
      – feature extraction

                 selection                                           extraction

 x1    x2   x3     x4   x5     x6 .. . x784          x1      x2     x3      x4    x5      x6 .. . x784



            x1    x5    x103   x456
                                              x* 1        x* 2    x* 3     x* 4   x* 5     x*6 .. . x*784




                                                             x*18        x*152    x*309     x*666
• feature selection/extraction = dimensionality reduction
   – generally good thing
   – curse of dimensionality
• example:
   – learner: regression (polynomial, y = w0 + w1x + w2x2 + w3x3 + …)
   – parameters: weights (coeffiients) w, order of polynomial
• weights
   – adjusted so the the sum of the squares of the errors E(w)
     (error function) is as small as possible

                                 1 N
                         E w    y  xn , w   tn 
                                                        2

                                 2 n 1
                                    predicted      known target
• order of polynomial
  – problem of model selection
  – for model comparison use MSE or RMS
    (independent from N)
                        predicted       known target

                         N

                         y  x , w   t 
                    1                            2
              MSE                  n        n
                    N   n 1

              RMS  MSE

  – training error always goes down with the
    increasing polynomial order
  – however, test error gets worse for high orders of
    polynomial (overfitting)
Training set
Test set
overfitting
                                    M=9
                                    N = 15

                                             for a given model complexity the
                                             overfitting problem becomes less
                                             severe as the size of the data set
                                             increases




                                                                        M=9
                                                                        N = 100

or in other words, the larger the
data set is, the more complex
(flexible) model can be fitted
          Bias-variance tradeoff
• large bias – model is not accurate enough, it is
  not able to accurately represent the data (large
  training error)
• large variance – overfitting occurs (the
  predictions of the model depend a lot on the
  particular sample that was used for building the
  model)
• tradeoff
  – low flexibility models have large bias and low variance
  – high flexibility models have low bias and large
    variance
• A polynomial with too few parameters (too
  low degree) will make large errors because of
  a large bias.
• A polynomial with too many parameters (too
  high degree) will make large errors because of
  a large variance.

• MSE is a good error measure because
           MSE = variance + bias2
Test-data and Cross Validation
           attributes, input/independent variables, features


                    Tid Refund Marital    Taxable
                               Status     Income Cheat

                    1    Yes    Single    125K   No
                    2    No     Married   100K   No
                    3    No     Single    70K    No

object              4    Yes    Married   120K   No
                    5    No     Divorced 95K     Yes
instance
                    6    No     Married   60K    No
sample
                    7    Yes    Divorced 220K    No            class
                    8    No     Single    85K    Yes
                    9    No     Married   75K    No
                    10   No     Single    90K    Yes
               10
                  Attribute types
• discrete
  – Has only a finite or countably infinite set of values.
  – nominal (also categorical)
     • the values are just different labels (e.g. ID number, eye color)
     • central tendency given by mode (median, mean not defined)
  – ordinal
     • their values reflect the order (e.g. ranking, height in {tall,
       medium, short})
     • central tendency given by median, mode (mean not defined)
  – binary attributes - special case of discrete attributes
• continuous (also quantitative)
  – Has real numbers as attribute values.
  – central tendency given by mean, + stdev, …
        A regression problem

                  y = f(x) + noise
                  Can we learn from this data?


y
                  Consider three methods

    x


                           taken from Cross Validation tutorial
                           by Andrew Moore
                           http://www.autonlab.org/tutorials/overfit.html
        Linear regression

                 What will the regression model
                 will look like?

                 y = ax + b
y
                 Univariate linear regression
                 with a constant term.

    x


                              taken from Cross Validation tutorial
                              by Andrew Moore
                              http://www.autonlab.org/tutorials/overfit.html
taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
        Quadratic regression

                   What will the regression model
                   will look like?

                   y = ax2 + bx + c
y


    x


                             taken from Cross Validation tutorial
                             by Andrew Moore
                             http://www.autonlab.org/tutorials/overfit.html
taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
        Join-the-dots

               Also known as piecewise linear
               nonparametric regression if that
               makes you feel better.

y


    x


                        taken from Cross Validation tutorial
                        by Andrew Moore
                        http://www.autonlab.org/tutorials/overfit.html
                 Which is best?




Why not to choose the method with the best fit to data?



                                      taken from Cross Validation tutorial
                                      by Andrew Moore
                                      http://www.autonlab.org/tutorials/overfit.html
        What do we really want ?




Why not to choose the method with the best fit to data?

            How well are you going to
              predict future data?
                                        taken from Cross Validation tutorial
                                        by Andrew Moore
                                        http://www.autonlab.org/tutorials/overfit.html
               The test set method
                         1. Randomly choose 30%
                            of data to be in test set.

                         2. The remainder is training set.

                         3. Perform regression on the
y                           training set.

                         4. Estimate future performance
       x                    with the test set.

    linear regression
    MSE = 2.4
                                    taken from Cross Validation tutorial
                                    by Andrew Moore
                                    http://www.autonlab.org/tutorials/overfit.html
              The test set method
                           1. Randomly choose 30%
                              of data to be in test set.

                           2. The remainder is training set.

                           3. Perform regression on the
y                             training set.

                           4. Estimate future performance
       x                      with the test set.

    quadratic regression
    MSE = 0.9
                                      taken from Cross Validation tutorial
                                      by Andrew Moore
                                      http://www.autonlab.org/tutorials/overfit.html
               The test set method
                         1. Randomly choose 30%
                            of data to be in test set.

                         2. The remainder is training set.

                         3. Perform regression on the
y                           training set.

                         4. Estimate future performance
       x                    with the test set.

    join-the-dots
    MSE = 2.2
                                    taken from Cross Validation tutorial
                                    by Andrew Moore
                                    http://www.autonlab.org/tutorials/overfit.html
                  Test set method
• good news
  – very simple
  – Then choose method with the best score.


• bad news
  – wastes data (we got an estimate of the best method by
    using 30% less data)        Train     Test


  – if you don’t have enough data, test set may be just
    lucky/unlucky

  test set estimator of performance
           has high variance               taken from Cross Validation tutorial
                                           by Andrew Moore
                                           http://www.autonlab.org/tutorials/overfit.html
testing error


          training error



        model complexity
• stratified division
   – same proportion of data in the training and test
     sets
• Training error can not be used as an indicator
  of model’s performance due to overfitting.
• Training data set - train a range of models, or
  a given model with a range of values for its
  parameters.
• Compare them on independent data –
  Validation set.
  – If the model design is iterated many times, then
    some overfitting to the validation data can occur
    and so it may be necessary to keep aside a third
• Test set on which the performance of the
  selected model is finally evaluated.
    LOOCV (Leave-one-out Cross Validation)


                          1.   choose one data point
                          2.   remove it from the set
                          3.   fit the remaining data points
                          4.   note your error

y
                          Repeat these steps for all points.
                               When you are done
                           report the mean square error.
      x



                                   taken from Cross Validation tutorial
                                   by Andrew Moore
                                   http://www.autonlab.org/tutorials/overfit.html
taken from Cross Validation tutorial
by Andrew Moore
                                                 MSELOOCV = 2.12




http://www.autonlab.org/tutorials/overfit.html
taken from Cross Validation tutorial
by Andrew Moore
                                                 MSELOOCV = 0.962




http://www.autonlab.org/tutorials/overfit.html
taken from Cross Validation tutorial
                                                 MSELOOCV = 3.33




by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
   Which kind of Cross Validation?

         Good            Bad
Test set Cheap           Variance
                         Wastes data
LOOCV Doesn’t waste data Expensive

       Can we get best of both worlds?

                               taken from Cross Validation tutorial
                               by Andrew Moore
                               http://www.autonlab.org/tutorials/overfit.html
               k-fold Cross Validation
                         Randomly break data set into k partitions.
                         In our case k = 3.

                         Red partition: Train on all points not in the
                          red partition. Find the test set sum of errors
                          on the red points.

                         Blue partition: Train on all points not in the
y                          blue partition. Find the test set sum of errors
                           on the blue points.

                         Green partition: Train on all points not in the
        x                 green partition. Find the test set sum of errors
                          on the green points.
    linear regression    Then report the mean error.
    MSE3fold = 2.05                         taken from Cross Validation tutorial
                                            by Andrew Moore
                                            http://www.autonlab.org/tutorials/overfit.html
Results of 3-fold Cross Validation

                     MSE3fold
       linear        2.05
       quadratic     1.11
       join-the-dots 2.93



                         taken from Cross Validation tutorial
                         by Andrew Moore
                         http://www.autonlab.org/tutorials/overfit.html
   Which kind of Cross Validation?
             Good                         Bad
Test set     Cheap.                       Variance
                                          Wastes data.
LOOCV        Doesn’t waste data.          Expensive.
3-fold       Slightly better than test-   Wastier than LOOCV.
             set.                         More expensive than
                                          test-set.
10-fold      Only wastes 10%.             Wastes 10%.
             Only 10 times more           10 times more
             expensive instead of R       expensive instead of R
             times.                       times (as LOOCV is).

R-fold is identical to LOOCV
                                                          taken from Cross Validation tutorial by Andrew Moore, http://www.autonlab.org/tutorials/overfit.html
          Model selection via CV
• We are trying to decide which model to use. For the
  polynomial regression decide about the degree of
  polynom.
• Train each machine and make a table.
          degree   MSEtrain   MSE10-fold   Choice
          1
          2
          3
          4
          5
          6

• Whichever model gave best CV score: train it with all
  the data. That’s the predictive model you’ll use.
               Selection and testing
• Complete procedure to algorithm selection and
  estimation of its quality
  1. Divide data to train/test
           Train     Test

  2. By Cross Validation on the Train choose the
     algorithm
       Train   Val


  3. Use this algorithm to construct a classifier using Train
           Train

  4. Estimate its quality on the Test
       Test
                      ?
            y


                  x

• Which class (Blue or Orange) would you predict
  for this point?
• And why?
• classification boundary
                       ?
            y


                 x

• And now?
• Classification boundary is quadratic
         y
                 ?



             x

• And now?
• And why?
Nearest Neighbors Classification
instances
• But, what does it mean similar?




       A              B                               C                          D




                  source: Kardi Teknomo’s Tutorials, http://people.revoledu.com/kardi/tutorial/index.html
• Similarity sij is quantity that reflects the
  strength of relationship between two objects
  or two features.
  – This quantity is usually having range of either -1 to
    +1 or is normalized into 0 to 1.
• Distance dij measures dissimilarity
  – Dissimilarity measure the discrepancy between
    the two objects based on several features.
  – Distance is a quantitative variable that satisfies the
    following conditions:
     • distance is always positive or zero (dij ≥ 0)
     • distance is zero if and only if it measured to itself
     • distance is symmetric (dij = dji)
• In addition, if distance satisfies triangular
  inequality |x+y| ≤ |x|+|y|, then it is called
  metric.
• Not all distances are metrics, but all metrics
  are distances.
       Distances for binary variables
     Fruit   Sphere shape   Sweet   Sour   Crunchy
                                                     p=1
     Apple        Yes        Yes    Yes      Yes     q=3
    Banana        No         Yes    No       No      r= 0
                                                     s= 0
     Apple         1          1      1       1
    Banana         0          1      0       0


•   p – number of variables positive for both objects
•   q – positive for the ith object and negative for the jth object
•   r – negative for the ith object and positive for the jth object
•   s – negative for both objects
•   t = p + q + r + s (total number of variables)
• Simple matching coefficient/distance
                ps                     qr
          sij          dij  1  sij 
                 t                       t


• Jaccard coefficient/distance
                  p                qr
          sij              dij 
                pqr             pqr
                                          •   p – number of variables
                                              positive for both objects

• Hamming distance                        •   q – positive for the ith object
                                              and negative for the jth object
                                          •   r – negative for the ith object
                      dij  q  r             and positive for the jth object
                                          •   s – negative for both objects
                                          •   t = p + q + r + s (total)
Distances for quantitative variables
• Minkowski distance (Lp norm)
                            n
                Lp        x y
                                               p
                       p
                                  i        i
                           i 1




• distance matrix – matrix with all pairwise
  distances
                                      p1           p2       p3       p4
                    p1                    0         2.828    3.162    5.099
                    p2                2.828             0    1.414    3.162
                    p3                3.162         1.414        0        2
                    p4                5.099         3.162        2        0
          Manhattan distance
• How to measure distance of two bikers in
  Manhattan ?




                                             source: wikipedia
                           n
          L1  d  x, y    xi  yi
                          i 1




y2                   y

     x
x2




     x1         y1
          Euclidean distance
                                         n
                    L2  d  x, y      x  y 
                                                       2
                                               i   i
                                        i 1


y2                        y

     x
x2




     x1              y1
                 Back to k-NN
• supervised learning
• target function f may be
  – dicrete-valued (classification)
  – real-valued (regression)
• We assign to the class which instance is most
  similar to the given point.
  Discrete-valued target function
• The unknown sample x is assigned a class that
  is most common among the k training
  examples closest to x.



                 X                       X                             X




      (a) 1-nearest neighbor   (b) 2-nearest neighbor    (c) 3-nearest neighbor


                                                        Tan, Stainbach, Kumar – Introduction to Data Mining
• k-NN never forms an explicit general
  hypothesis f’ regarding the target function f.
  – It simply computes classification of each new
    instance as needed.
• Nevertheless, we can still ask what
  classification would be assigned if we hold the
  raining examples constant and query the
  algorithm with every possible instance x.
1-NN … Voronoi tesselation
1-NN … classification boundary
                Which k is best?




        k=1                          k = 15

fitting noise, outliers   value not too small smooth
      overfitting         out distinctive behavior

                                    Hastie et al., Elements of Statistical Learning
     Real-valued target function
• Algorithms calculate the mean value of the k
  nearest training examples.
                                                   k=3

                                                 value = 12
               value = (12+14+10)/3 = 12


                                                   value = 14

                                           value = 10
           Distance-weighted NN
• Refinement: weight the contribution of each of k
  nearest neighbors according to their distance to the
  query point.
   – Give greater weight to closer neighbors.


                        k=4            unweighted
                                       • 2 votes
                              4        • 2 votes

                    5             2
                              1
                                       weighted
                                       • 1/12 + 1/22 = 1.25 votes
                                       • 1/42 + 1/52 = 0.102 votes
      Euclidean distance issues
• Certain attributes with large values can
  overwhelm the influence of other attributes
  measured on smaller scale.
• Solution: normalize the values
                     X  min  X 
        X 
         *

                  max  X   min  X    min-max normalization


                     X  mean  X 
             X 
              *
                                          Z-score standardization
                        SD  X 
                  k-NN issues
• Distance is calculated based on ALL attributes.
• Example:
  – each instance is described by 20 attributes,
    however only 2 are relevant
  – instances with identical 2 relevant attributes (i.e.
    their distance is zero in 2-D space) may be distant
    in 20-D space
  – Thus, the similarity metrics will be misleading
  – This is the manifestation of the curse of
    dimensionality
                k-NN issues
• Significant computation may be required to
  process each new query.
• To find nearest neighbors one has to evaluate
  full distance matrix.
• Efficient indexing of stored training examples
  helps
  – kd-tree
• instance based learning (memory based learning)
   – family of learning algorithms that, instead of
     performing explicit generalization, compare new
     problem instances with instances seen in training
     which have been stored in memory
   – it is a kind of lazy learning
• lazy learning
   – generalization beyond the training data is delayed
     until a query is made to the system
   – opposed to eager learning - system tries to generalize
     the training data before receiving queries
• lazy learners – e.g. k-NN
Literature

				
DOCUMENT INFO