VIEWS: 9 PAGES: 66 POSTED ON: 12/30/2010
Last lecture summary Basic terminology • tasks – classification – regression • learner, algorithm – each has one or several parameters influencing its behavior • model – one concrete combination of learner and parameters – tune the parameters using the training set – the generalization is assessed using test set (previously unseen data) • learning (training) – supervised • a target vector t is known, parameters are tuned to achieve the best match between prediction and the target vector – unsupervised • training data consists of a set of input vectors x without any corresponding target value • clustering, vizualization • for most applications, the original input variables must be preprocessed – feature selection – feature extraction selection extraction x1 x2 x3 x4 x5 x6 .. . x784 x1 x2 x3 x4 x5 x6 .. . x784 x1 x5 x103 x456 x* 1 x* 2 x* 3 x* 4 x* 5 x*6 .. . x*784 x*18 x*152 x*309 x*666 • feature selection/extraction = dimensionality reduction – generally good thing – curse of dimensionality • example: – learner: regression (polynomial, y = w0 + w1x + w2x2 + w3x3 + …) – parameters: weights (coeffiients) w, order of polynomial • weights – adjusted so the the sum of the squares of the errors E(w) (error function) is as small as possible 1 N E w y xn , w tn 2 2 n 1 predicted known target • order of polynomial – problem of model selection – for model comparison use MSE or RMS (independent from N) predicted known target N y x , w t 1 2 MSE n n N n 1 RMS MSE – training error always goes down with the increasing polynomial order – however, test error gets worse for high orders of polynomial (overfitting) Training set Test set overfitting M=9 N = 15 for a given model complexity the overfitting problem becomes less severe as the size of the data set increases M=9 N = 100 or in other words, the larger the data set is, the more complex (flexible) model can be fitted Bias-variance tradeoff • large bias – model is not accurate enough, it is not able to accurately represent the data (large training error) • large variance – overfitting occurs (the predictions of the model depend a lot on the particular sample that was used for building the model) • tradeoff – low flexibility models have large bias and low variance – high flexibility models have low bias and large variance • A polynomial with too few parameters (too low degree) will make large errors because of a large bias. • A polynomial with too many parameters (too high degree) will make large errors because of a large variance. • MSE is a good error measure because MSE = variance + bias2 Test-data and Cross Validation attributes, input/independent variables, features Tid Refund Marital Taxable Status Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No object 4 Yes Married 120K No 5 No Divorced 95K Yes instance 6 No Married 60K No sample 7 Yes Divorced 220K No class 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Attribute types • discrete – Has only a finite or countably infinite set of values. – nominal (also categorical) • the values are just different labels (e.g. ID number, eye color) • central tendency given by mode (median, mean not defined) – ordinal • their values reflect the order (e.g. ranking, height in {tall, medium, short}) • central tendency given by median, mode (mean not defined) – binary attributes - special case of discrete attributes • continuous (also quantitative) – Has real numbers as attribute values. – central tendency given by mean, + stdev, … A regression problem y = f(x) + noise Can we learn from this data? y Consider three methods x taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html Linear regression What will the regression model will look like? y = ax + b y Univariate linear regression with a constant term. x taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html Quadratic regression What will the regression model will look like? y = ax2 + bx + c y x taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html Join-the-dots Also known as piecewise linear nonparametric regression if that makes you feel better. y x taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html Which is best? Why not to choose the method with the best fit to data? taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html What do we really want ? Why not to choose the method with the best fit to data? How well are you going to predict future data? taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html The test set method 1. Randomly choose 30% of data to be in test set. 2. The remainder is training set. 3. Perform regression on the y training set. 4. Estimate future performance x with the test set. linear regression MSE = 2.4 taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html The test set method 1. Randomly choose 30% of data to be in test set. 2. The remainder is training set. 3. Perform regression on the y training set. 4. Estimate future performance x with the test set. quadratic regression MSE = 0.9 taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html The test set method 1. Randomly choose 30% of data to be in test set. 2. The remainder is training set. 3. Perform regression on the y training set. 4. Estimate future performance x with the test set. join-the-dots MSE = 2.2 taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html Test set method • good news – very simple – Then choose method with the best score. • bad news – wastes data (we got an estimate of the best method by using 30% less data) Train Test – if you don’t have enough data, test set may be just lucky/unlucky test set estimator of performance has high variance taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html testing error training error model complexity • stratified division – same proportion of data in the training and test sets • Training error can not be used as an indicator of model’s performance due to overfitting. • Training data set - train a range of models, or a given model with a range of values for its parameters. • Compare them on independent data – Validation set. – If the model design is iterated many times, then some overfitting to the validation data can occur and so it may be necessary to keep aside a third • Test set on which the performance of the selected model is finally evaluated. LOOCV (Leave-one-out Cross Validation) 1. choose one data point 2. remove it from the set 3. fit the remaining data points 4. note your error y Repeat these steps for all points. When you are done report the mean square error. x taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html taken from Cross Validation tutorial by Andrew Moore MSELOOCV = 2.12 http://www.autonlab.org/tutorials/overfit.html taken from Cross Validation tutorial by Andrew Moore MSELOOCV = 0.962 http://www.autonlab.org/tutorials/overfit.html taken from Cross Validation tutorial MSELOOCV = 3.33 by Andrew Moore http://www.autonlab.org/tutorials/overfit.html Which kind of Cross Validation? Good Bad Test set Cheap Variance Wastes data LOOCV Doesn’t waste data Expensive Can we get best of both worlds? taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html k-fold Cross Validation Randomly break data set into k partitions. In our case k = 3. Red partition: Train on all points not in the red partition. Find the test set sum of errors on the red points. Blue partition: Train on all points not in the y blue partition. Find the test set sum of errors on the blue points. Green partition: Train on all points not in the x green partition. Find the test set sum of errors on the green points. linear regression Then report the mean error. MSE3fold = 2.05 taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html Results of 3-fold Cross Validation MSE3fold linear 2.05 quadratic 1.11 join-the-dots 2.93 taken from Cross Validation tutorial by Andrew Moore http://www.autonlab.org/tutorials/overfit.html Which kind of Cross Validation? Good Bad Test set Cheap. Variance Wastes data. LOOCV Doesn’t waste data. Expensive. 3-fold Slightly better than test- Wastier than LOOCV. set. More expensive than test-set. 10-fold Only wastes 10%. Wastes 10%. Only 10 times more 10 times more expensive instead of R expensive instead of R times. times (as LOOCV is). R-fold is identical to LOOCV taken from Cross Validation tutorial by Andrew Moore, http://www.autonlab.org/tutorials/overfit.html Model selection via CV • We are trying to decide which model to use. For the polynomial regression decide about the degree of polynom. • Train each machine and make a table. degree MSEtrain MSE10-fold Choice 1 2 3 4 5 6 • Whichever model gave best CV score: train it with all the data. That’s the predictive model you’ll use. Selection and testing • Complete procedure to algorithm selection and estimation of its quality 1. Divide data to train/test Train Test 2. By Cross Validation on the Train choose the algorithm Train Val 3. Use this algorithm to construct a classifier using Train Train 4. Estimate its quality on the Test Test ? y x • Which class (Blue or Orange) would you predict for this point? • And why? • classification boundary ? y x • And now? • Classification boundary is quadratic y ? x • And now? • And why? Nearest Neighbors Classification instances • But, what does it mean similar? A B C D source: Kardi Teknomo’s Tutorials, http://people.revoledu.com/kardi/tutorial/index.html • Similarity sij is quantity that reflects the strength of relationship between two objects or two features. – This quantity is usually having range of either -1 to +1 or is normalized into 0 to 1. • Distance dij measures dissimilarity – Dissimilarity measure the discrepancy between the two objects based on several features. – Distance is a quantitative variable that satisfies the following conditions: • distance is always positive or zero (dij ≥ 0) • distance is zero if and only if it measured to itself • distance is symmetric (dij = dji) • In addition, if distance satisfies triangular inequality |x+y| ≤ |x|+|y|, then it is called metric. • Not all distances are metrics, but all metrics are distances. Distances for binary variables Fruit Sphere shape Sweet Sour Crunchy p=1 Apple Yes Yes Yes Yes q=3 Banana No Yes No No r= 0 s= 0 Apple 1 1 1 1 Banana 0 1 0 0 • p – number of variables positive for both objects • q – positive for the ith object and negative for the jth object • r – negative for the ith object and positive for the jth object • s – negative for both objects • t = p + q + r + s (total number of variables) • Simple matching coefficient/distance ps qr sij dij 1 sij t t • Jaccard coefficient/distance p qr sij dij pqr pqr • p – number of variables positive for both objects • Hamming distance • q – positive for the ith object and negative for the jth object • r – negative for the ith object dij q r and positive for the jth object • s – negative for both objects • t = p + q + r + s (total) Distances for quantitative variables • Minkowski distance (Lp norm) n Lp x y p p i i i 1 • distance matrix – matrix with all pairwise distances p1 p2 p3 p4 p1 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0 Manhattan distance • How to measure distance of two bikers in Manhattan ? source: wikipedia n L1 d x, y xi yi i 1 y2 y x x2 x1 y1 Euclidean distance n L2 d x, y x y 2 i i i 1 y2 y x x2 x1 y1 Back to k-NN • supervised learning • target function f may be – dicrete-valued (classification) – real-valued (regression) • We assign to the class which instance is most similar to the given point. Discrete-valued target function • The unknown sample x is assigned a class that is most common among the k training examples closest to x. X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor Tan, Stainbach, Kumar – Introduction to Data Mining • k-NN never forms an explicit general hypothesis f’ regarding the target function f. – It simply computes classification of each new instance as needed. • Nevertheless, we can still ask what classification would be assigned if we hold the raining examples constant and query the algorithm with every possible instance x. 1-NN … Voronoi tesselation 1-NN … classification boundary Which k is best? k=1 k = 15 fitting noise, outliers value not too small smooth overfitting out distinctive behavior Hastie et al., Elements of Statistical Learning Real-valued target function • Algorithms calculate the mean value of the k nearest training examples. k=3 value = 12 value = (12+14+10)/3 = 12 value = 14 value = 10 Distance-weighted NN • Refinement: weight the contribution of each of k nearest neighbors according to their distance to the query point. – Give greater weight to closer neighbors. k=4 unweighted • 2 votes 4 • 2 votes 5 2 1 weighted • 1/12 + 1/22 = 1.25 votes • 1/42 + 1/52 = 0.102 votes Euclidean distance issues • Certain attributes with large values can overwhelm the influence of other attributes measured on smaller scale. • Solution: normalize the values X min X X * max X min X min-max normalization X mean X X * Z-score standardization SD X k-NN issues • Distance is calculated based on ALL attributes. • Example: – each instance is described by 20 attributes, however only 2 are relevant – instances with identical 2 relevant attributes (i.e. their distance is zero in 2-D space) may be distant in 20-D space – Thus, the similarity metrics will be misleading – This is the manifestation of the curse of dimensionality k-NN issues • Significant computation may be required to process each new query. • To find nearest neighbors one has to evaluate full distance matrix. • Efficient indexing of stored training examples helps – kd-tree • instance based learning (memory based learning) – family of learning algorithms that, instead of performing explicit generalization, compare new problem instances with instances seen in training which have been stored in memory – it is a kind of lazy learning • lazy learning – generalization beyond the training data is delayed until a query is made to the system – opposed to eager learning - system tries to generalize the training data before receiving queries • lazy learners – e.g. k-NN Literature