; lecture3
Documents
User Generated
Resources
Learning Center
Your Federal Quarterly Tax Payments are due April 15th

lecture3

VIEWS: 9 PAGES: 66

• pg 1
```									Last lecture summary
Basic terminology
– classification
– regression
• learner, algorithm
– each has one or several parameters influencing its
behavior
• model
– one concrete combination of learner and parameters
– tune the parameters using the training set
– the generalization is assessed using test set
(previously unseen data)
• learning (training)
– supervised
• a target vector t is known, parameters are tuned to
achieve the best match between prediction and the
target vector
– unsupervised
• training data consists of a set of input vectors x without
any corresponding target value
• clustering, vizualization
• for most applications, the original input
variables must be preprocessed
– feature selection
– feature extraction

selection                                           extraction

x1    x2   x3     x4   x5     x6 .. . x784          x1      x2     x3      x4    x5      x6 .. . x784

x1    x5    x103   x456
x* 1        x* 2    x* 3     x* 4   x* 5     x*6 .. . x*784

x*18        x*152    x*309     x*666
• feature selection/extraction = dimensionality reduction
– generally good thing
– curse of dimensionality
• example:
– learner: regression (polynomial, y = w0 + w1x + w2x2 + w3x3 + …)
– parameters: weights (coeffiients) w, order of polynomial
• weights
– adjusted so the the sum of the squares of the errors E(w)
(error function) is as small as possible

1 N
E w    y  xn , w   tn 
2

2 n 1
predicted      known target
• order of polynomial
– problem of model selection
– for model comparison use MSE or RMS
(independent from N)
predicted       known target

N

 y  x , w   t 
1                            2
MSE                  n        n
N   n 1

RMS  MSE

– training error always goes down with the
increasing polynomial order
– however, test error gets worse for high orders of
polynomial (overfitting)
Training set
Test set
overfitting
M=9
N = 15

for a given model complexity the
overfitting problem becomes less
severe as the size of the data set
increases

M=9
N = 100

or in other words, the larger the
data set is, the more complex
(flexible) model can be fitted
• large bias – model is not accurate enough, it is
not able to accurately represent the data (large
training error)
• large variance – overfitting occurs (the
predictions of the model depend a lot on the
particular sample that was used for building the
model)
– low flexibility models have large bias and low variance
– high flexibility models have low bias and large
variance
• A polynomial with too few parameters (too
low degree) will make large errors because of
a large bias.
• A polynomial with too many parameters (too
high degree) will make large errors because of
a large variance.

• MSE is a good error measure because
MSE = variance + bias2
Test-data and Cross Validation
attributes, input/independent variables, features

Tid Refund Marital    Taxable
Status     Income Cheat

1    Yes    Single    125K   No
2    No     Married   100K   No
3    No     Single    70K    No

object              4    Yes    Married   120K   No
5    No     Divorced 95K     Yes
instance
6    No     Married   60K    No
sample
7    Yes    Divorced 220K    No            class
8    No     Single    85K    Yes
9    No     Married   75K    No
10   No     Single    90K    Yes
10
Attribute types
• discrete
– Has only a finite or countably infinite set of values.
– nominal (also categorical)
• the values are just different labels (e.g. ID number, eye color)
• central tendency given by mode (median, mean not defined)
– ordinal
• their values reflect the order (e.g. ranking, height in {tall,
medium, short})
• central tendency given by median, mode (mean not defined)
– binary attributes - special case of discrete attributes
• continuous (also quantitative)
– Has real numbers as attribute values.
– central tendency given by mean, + stdev, …
A regression problem

y = f(x) + noise
Can we learn from this data?

y
Consider three methods

x

taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
Linear regression

What will the regression model
will look like?

y = ax + b
y
Univariate linear regression
with a constant term.

x

taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html

What will the regression model
will look like?

y = ax2 + bx + c
y

x

taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
Join-the-dots

Also known as piecewise linear
nonparametric regression if that
makes you feel better.

y

x

taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
Which is best?

Why not to choose the method with the best fit to data?

taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
What do we really want ?

Why not to choose the method with the best fit to data?

How well are you going to
predict future data?
taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
The test set method
1. Randomly choose 30%
of data to be in test set.

2. The remainder is training set.

3. Perform regression on the
y                           training set.

4. Estimate future performance
x                    with the test set.

linear regression
MSE = 2.4
taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
The test set method
1. Randomly choose 30%
of data to be in test set.

2. The remainder is training set.

3. Perform regression on the
y                             training set.

4. Estimate future performance
x                      with the test set.

MSE = 0.9
taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
The test set method
1. Randomly choose 30%
of data to be in test set.

2. The remainder is training set.

3. Perform regression on the
y                           training set.

4. Estimate future performance
x                    with the test set.

join-the-dots
MSE = 2.2
taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
Test set method
• good news
– very simple
– Then choose method with the best score.

– wastes data (we got an estimate of the best method by
using 30% less data)        Train     Test

– if you don’t have enough data, test set may be just
lucky/unlucky

test set estimator of performance
has high variance               taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
testing error

training error

model complexity
• stratified division
– same proportion of data in the training and test
sets
• Training error can not be used as an indicator
of model’s performance due to overfitting.
• Training data set - train a range of models, or
a given model with a range of values for its
parameters.
• Compare them on independent data –
Validation set.
– If the model design is iterated many times, then
some overfitting to the validation data can occur
and so it may be necessary to keep aside a third
• Test set on which the performance of the
selected model is finally evaluated.
LOOCV (Leave-one-out Cross Validation)

1.   choose one data point
2.   remove it from the set
3.   fit the remaining data points

y
Repeat these steps for all points.
When you are done
report the mean square error.
x

taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
taken from Cross Validation tutorial
by Andrew Moore
MSELOOCV = 2.12

http://www.autonlab.org/tutorials/overfit.html
taken from Cross Validation tutorial
by Andrew Moore
MSELOOCV = 0.962

http://www.autonlab.org/tutorials/overfit.html
taken from Cross Validation tutorial
MSELOOCV = 3.33

by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
Which kind of Cross Validation?

Test set Cheap           Variance
Wastes data
LOOCV Doesn’t waste data Expensive

Can we get best of both worlds?

taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
k-fold Cross Validation
Randomly break data set into k partitions.
In our case k = 3.

Red partition: Train on all points not in the
red partition. Find the test set sum of errors
on the red points.

Blue partition: Train on all points not in the
y                          blue partition. Find the test set sum of errors
on the blue points.

Green partition: Train on all points not in the
x                 green partition. Find the test set sum of errors
on the green points.
linear regression    Then report the mean error.
MSE3fold = 2.05                         taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
Results of 3-fold Cross Validation

MSE3fold
linear        2.05
join-the-dots 2.93

taken from Cross Validation tutorial
by Andrew Moore
http://www.autonlab.org/tutorials/overfit.html
Which kind of Cross Validation?
Test set     Cheap.                       Variance
Wastes data.
LOOCV        Doesn’t waste data.          Expensive.
3-fold       Slightly better than test-   Wastier than LOOCV.
set.                         More expensive than
test-set.
10-fold      Only wastes 10%.             Wastes 10%.
Only 10 times more           10 times more
times.                       times (as LOOCV is).

R-fold is identical to LOOCV
taken from Cross Validation tutorial by Andrew Moore, http://www.autonlab.org/tutorials/overfit.html
Model selection via CV
• We are trying to decide which model to use. For the
polynomial regression decide about the degree of
polynom.
• Train each machine and make a table.
degree   MSEtrain   MSE10-fold   Choice
1
2
3
4
5
6

• Whichever model gave best CV score: train it with all
the data. That’s the predictive model you’ll use.
Selection and testing
• Complete procedure to algorithm selection and
estimation of its quality
1. Divide data to train/test
Train     Test

2. By Cross Validation on the Train choose the
algorithm
Train   Val

3. Use this algorithm to construct a classifier using Train
Train

4. Estimate its quality on the Test
Test
?
y

x

• Which class (Blue or Orange) would you predict
for this point?
• And why?
• classification boundary
?
y

x

• And now?
y
?

x

• And now?
• And why?
Nearest Neighbors Classification
instances
• But, what does it mean similar?

A              B                               C                          D

source: Kardi Teknomo’s Tutorials, http://people.revoledu.com/kardi/tutorial/index.html
• Similarity sij is quantity that reflects the
strength of relationship between two objects
or two features.
– This quantity is usually having range of either -1 to
+1 or is normalized into 0 to 1.
• Distance dij measures dissimilarity
– Dissimilarity measure the discrepancy between
the two objects based on several features.
– Distance is a quantitative variable that satisfies the
following conditions:
• distance is always positive or zero (dij ≥ 0)
• distance is zero if and only if it measured to itself
• distance is symmetric (dij = dji)
• In addition, if distance satisfies triangular
inequality |x+y| ≤ |x|+|y|, then it is called
metric.
• Not all distances are metrics, but all metrics
are distances.
Distances for binary variables
Fruit   Sphere shape   Sweet   Sour   Crunchy
p=1
Apple        Yes        Yes    Yes      Yes     q=3
Banana        No         Yes    No       No      r= 0
s= 0
Apple         1          1      1       1
Banana         0          1      0       0

•   p – number of variables positive for both objects
•   q – positive for the ith object and negative for the jth object
•   r – negative for the ith object and positive for the jth object
•   s – negative for both objects
•   t = p + q + r + s (total number of variables)
• Simple matching coefficient/distance
ps                     qr
sij          dij  1  sij 
t                       t

• Jaccard coefficient/distance
p                qr
sij              dij 
pqr             pqr
•   p – number of variables
positive for both objects

• Hamming distance                        •   q – positive for the ith object
and negative for the jth object
•   r – negative for the ith object
dij  q  r             and positive for the jth object
•   s – negative for both objects
•   t = p + q + r + s (total)
Distances for quantitative variables
• Minkowski distance (Lp norm)
n
Lp        x y
p
p
i        i
i 1

• distance matrix – matrix with all pairwise
distances
p1           p2       p3       p4
p1                    0         2.828    3.162    5.099
p2                2.828             0    1.414    3.162
p3                3.162         1.414        0        2
p4                5.099         3.162        2        0
Manhattan distance
• How to measure distance of two bikers in
Manhattan ?

source: wikipedia
n
L1  d  x, y    xi  yi
i 1

y2                   y

x
x2

x1         y1
Euclidean distance
n
L2  d  x, y      x  y 
2
i   i
i 1

y2                        y

x
x2

x1              y1
Back to k-NN
• supervised learning
• target function f may be
– dicrete-valued (classification)
– real-valued (regression)
• We assign to the class which instance is most
similar to the given point.
Discrete-valued target function
• The unknown sample x is assigned a class that
is most common among the k training
examples closest to x.

X                       X                             X

(a) 1-nearest neighbor   (b) 2-nearest neighbor    (c) 3-nearest neighbor

Tan, Stainbach, Kumar – Introduction to Data Mining
• k-NN never forms an explicit general
hypothesis f’ regarding the target function f.
– It simply computes classification of each new
instance as needed.
• Nevertheless, we can still ask what
classification would be assigned if we hold the
raining examples constant and query the
algorithm with every possible instance x.
1-NN … Voronoi tesselation
1-NN … classification boundary
Which k is best?

k=1                          k = 15

fitting noise, outliers   value not too small smooth
overfitting         out distinctive behavior

Hastie et al., Elements of Statistical Learning
Real-valued target function
• Algorithms calculate the mean value of the k
nearest training examples.
k=3

value = 12
value = (12+14+10)/3 = 12

value = 14

value = 10
Distance-weighted NN
• Refinement: weight the contribution of each of k
nearest neighbors according to their distance to the
query point.
– Give greater weight to closer neighbors.

k=4            unweighted

5             2
1
weighted
• 1/12 + 1/22 = 1.25 votes
• 1/42 + 1/52 = 0.102 votes
Euclidean distance issues
• Certain attributes with large values can
overwhelm the influence of other attributes
measured on smaller scale.
• Solution: normalize the values
X  min  X 
X 
*

max  X   min  X    min-max normalization

X  mean  X 
X 
*
Z-score standardization
SD  X 
k-NN issues
• Distance is calculated based on ALL attributes.
• Example:
– each instance is described by 20 attributes,
however only 2 are relevant
– instances with identical 2 relevant attributes (i.e.
their distance is zero in 2-D space) may be distant
in 20-D space
– Thus, the similarity metrics will be misleading
– This is the manifestation of the curse of
dimensionality
k-NN issues
• Significant computation may be required to
process each new query.
• To find nearest neighbors one has to evaluate
full distance matrix.
• Efficient indexing of stored training examples
helps
– kd-tree
• instance based learning (memory based learning)
– family of learning algorithms that, instead of
performing explicit generalization, compare new
problem instances with instances seen in training
which have been stored in memory
– it is a kind of lazy learning
• lazy learning
– generalization beyond the training data is delayed
until a query is made to the system
– opposed to eager learning - system tries to generalize
the training data before receiving queries
• lazy learners – e.g. k-NN
Literature

```
To top