VIEWS: 32 PAGES: 69 POSTED ON: 7/10/2014
Chapter 3 Data Mining Concepts: Data Preparation and Model Evaluation Data Mining 2011 - Volinsky - Columbia University 1 Data Preparation • Data in the real world is dirty – incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data – noisy: containing errors or outliers – inconsistent: containing discrepancies in codes or names • No quality data, no quality mining results! – Quality decisions must be based on quality data – Data warehouse needs consistent integration of quality data – Assessment of quality reflects on confidence in results Data Mining 2011 - Volinsky - Columbia University 2 Preparing Data for Analysis • Think about your data – how is it measured, what does it mean? – nominal or categorical • jersey numbers, ids, colors, simple labels • sometimes recoded into integers - careful! – ordinal • rank has meaning - numeric value not necessarily • educational attainment, military rank – integer valued • distances between numeric values have meaning • temperature, time – ratio • zero value has meaning - means that fractions and ratios are sensible • money, age, height, • It might seem obvious what a given data value is, but not always – pain index, movie ratings, etc Data Mining 2011 - Volinsky - Columbia University 3 Investigate your data carefully! • Example: lapsed donors to a charity: (KDD Cup 1998) – Made their last donation to PVA 13 to 24 months prior to June 1997 – 200,000 (training and test sets) – Who should get the current mailing? – What is the cost effective strategy? – “tcode” was an important variable… Data Mining 2011 - Volinsky - Columbia University 4 Data Mining 2011 - Volinsky - Columbia University 5 Data Mining 2011 - Volinsky - Columbia University 6 Data Mining 2011 - Volinsky - Columbia University 7 Data Mining 2011 - Volinsky - Columbia University 8 Tasks in Data Preprocessing • Data cleaning – Check for data quality – Missing data • Data transformation – Normalization and aggregation • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results • Data discretization – Combination of reduction and transformation but with particular importance, especially for numerical data Data Mining 2011 - Volinsky - Columbia University 9 Data Cleaning / Quality • Individual measurements – Random noise in individual measurements • Outliers • Random data entry errors • Noise in label assignment (e.g., class labels in medical data sets) • can be corrected or smoothed out – Systematic errors • E.g., all ages > 99 recorded as 99 • More individuals aged 20, 30, 40, etc than expected – Missing information • Missing at random – Questions on a questionnaire that people randomly forget to fill in • Missing systematically – Questions that people don’t want to answer – Patients who are too ill for a certain test Data Mining 2011 - Volinsky - Columbia University 10 Missing Data • Data is not always available – E.g., many records have no recorded value for several attributes, • survey respondents • disparate sources of data • Missing data may be due to – equipment malfunction – data not entered properly – data not available – Different versions of data have been merged – Try and figure it out!!! Data Mining 2011 - Volinsky - Columbia University 11 How to Handle Missing Data? • Ignore the tuple – Only feasible for a small % of missing values • Use a global constant (such as variable mean) to fill in the missing value: – “unknown” as a category – For continuous data, this will decrease variance significantly • Use a random value to fill in the missing value – Preserves variance, and ‘does no harm’ • Use imputation – nearest neighbor – model based (regression or Bayesian based) Data Mining 2011 - Volinsky - Columbia University 12 Missing Data • What do I choose for a given situation? • What you do depends on – the data - how much is missing? are they ‘important’ values? – the model - can it handle missing values? – Is the data missing at random? – there is no right answer! – Always check robustness of results Data Mining 2011 - Volinsky - Columbia University 13 Noisy Data • Noise: random error or variance in a measured variable • Incorrect attribute values (outliers) may due to – faulty data collection – data entry problems – technology limitation – YOU! – Try and figure it out • Other data problems which requires data cleaning – duplicate records – incomplete data – inconsistent data Data Mining 2011 - Volinsky - Columbia University 14 Data Transformation • Can help reduce influence of extreme values • Variance reduction: – Often very useful when dealing with skewed data (e.g. incomes) – square root, reciprocal, logarithm, raising to a power – Logit: transforms probabilities from 0 to 1 to real-line • Normalization: scaled to fall within a small, specified range – Sometimes we like to have all variables on the same scale – min-max normalization – Standardization / z-score normalization • Attribute/feature construction – New attributes constructed from the given ones Data Mining 2011 - Volinsky - Columbia University 15 Dealing with massive data • What if the data simply does not fit on my computer (or R crashes)? – Sample sample sample • be careful to do proper randomization and stratification – Find a smaller question • Use tools to reduce dataset and reframe question – Use a database • Mysql is a good (and free) one – Investigate data reduction strategies • Can reduce either n or p Data Mining 2011 - Volinsky - Columbia University 16 Data Reduction: Dimension Reduction • In general, incurs loss of information about x • If dimensionality p is very large (e.g., 1000’s), representing the data in a lower- dimensional space may make learning more reliable, – e.g., clustering example • 100 dimensional data • if cluster structure is only present in 2 of the dimensions, the others are just noise • if other 98 dimensions are just noise (relative to cluster structure), then clusters will be much easier to discover if we just focus on the 2d space • Dimension reduction can also provide interpretation/insight – e.g for 2d visualization purposes Data Mining 2011 - Volinsky - Columbia University 17 Data Reduction: Dimension Reduction • Feature selection (i.e., attribute subset selection): – Use EDA to find useless variables – Use exhaustive search on a simple model (e.g. regression) • Can be computationally expensive – Use heuristic methods like stepwise methods (forward / backward selection) • Can get trapped in local minima Data Mining 2011 - Volinsky - Columbia University 18 Data Reduction (n): Sampling • Don’t forget about sampling! • Choose a representative subset of the data – Simple random sampling may be ok but beware of skewed variables. • Stratified sampling methods – Approximate the percentage of each class (or subpopulation of interest) in the overall database – Used in conjunction with skewed data – Propensity scores may be useful if response is unbalanced. Data Mining 2011 - Volinsky - Columbia University 19 Data Reduction: Projection Methods • Projections: the shadow of a multidimensional object on a lower dimensional space • Mathematically: multiplying an n x p data matrix by an orthonormal p x d projection matrix Alternatively: Courtesy: Cook, Buja, Lee, Wickham Data Mining 2011 - Volinsky - Columbia University 20 Projections Courtesy: Cook, Buja, Lee, Wickham Data Mining 2011 - Volinsky - Columbia University 21 Data Reduction: Principal Components • One of several projection methods • Idea: Find a projection of your data in a lower dimension, that maximizes the amount of information retained • Information = variance • Works for numeric data only • Used when the number of dimensions is large Data Mining 2011 - Volinsky - Columbia University 22 PCA Example Direction of 1st x2 principal component vector (highest variance projection) x1 Data Mining 2011 - Volinsky - Columbia University 23 PCA Example Direction of 1st x2 principal component vector (highest variance projection) x1 Direction of 2nd principal component vector Data Mining 2011 - Volinsky - Columbia University 24 Principal Components • Sequentially extracts optimal maximal variance “direction” • All directions ‘principal components’ are orthoganal • Note: variables must be standardized!! x = Original points in p- Projection matrix of Original points dimentional space orthogonal directions projected into d dimensions Principal components are related to the covariance of the original data – Technically: the first PC is the eigenvector for the first eigenvalue of the covariance of X – Highly correlated data reduces nicely ‘scree’ plot can help assess how many PC to use…. Data Mining 2011 - Volinsky - Columbia University 25 Example: Music Data Left variables Scree plot What’s wrong with this picture? Data Mining 2011 - Volinsky - Columbia University 26 Example: Music Data Scaled data Scree plot Loadings = Coefficients (weights) of varaibles in projection vector Data Mining 2011 - Volinsky - Columbia University 27 Data Reduction: Multidimensional Scaling • Start with an n x p matrix of observations and variables • Create an n x n matrix of distances (similarities) – Feasible when n small(ish) – 0’s on the diagonal – Symmetric • Or, you may have a distance of matrices to start with – Relationships, networks, etc • MDS: – finds a representation of these points in a lower-dimensional space usually 2), where the distances in this space best represent the original distances Data Mining 2011 - Volinsky - Columbia University 28 Price Fuel FuelTank Cadillac 34.7 16 18.0 Camaro 15.1 19 15.5 Corsica 11.4 25 15.6 Civic 12 42 11.9 • Example: – Distance between X and Y? Cadillac Camaro Corsica Civic Cadillac 0 20.0 25.9 38.1 Camaro 20 0 7.1 26.9 Corsica 25.09 7.05 0 21.84 Civic 38.1 26.9 21.84 0 Data Mining 2011 - Volinsky - Columbia University 29 Multidimensional Scaling (MDS) • MDS score function (“stress”) Original Euclidean distance dissimilarities in “embedded” k-dim space • Local minimum is found via algorithmic methods – (the algorithm is gradient descent) • Morse code example Data Mining 2011 - Volinsky - Columbia University 30 MDS: face data Data Mining 2011 - Volinsky - Columbia University 31 MDS: 2d embedding of face images Similar faces are close to each other Sometimes the axes can have an interpretation Data Mining 2011 - Volinsky - Columbia University 32 Data Mining 2011 - Volinsky - Columbia University 33 Model Evaluation Data Mining 2011 - Volinsky - Columbia University 34 Evaluating Models: in-sample How good is (a,b)? For a given (x,y), the score function S measures how good the model fits: This is just one of many possible score functions Data Mining 2011 - Volinsky - Columbia University 35 Evaluating Models: In-Sample • In-sample: error goes to zero with enough parameters (k): goodness of fit increases with parameters (k) •High bias: doesn’t fit data well, but generalizable and robust High variance: non robust to changes or new data, but low error Score function should embody the comprimise: score(model) = Goodness-of-fit - penalty(k) e.g. Bayesian Information Criterion Data Mining 2011 - Volinsky - Columbia University 36 In v. Out • In-sample evaluation – Uses all of the data to fit parameters – Focus: how well does my model ‘fit’ the data – Penalties to decide on number of parameters • Out-of-sample evaluation – Split data into training and test sets – Focus: how well does my model predict things – Prediction error is all that matters • Statistics traditionally looks at in-sample where as data mining / machine learning typically uses out-of -sample Data Mining 2011 - Volinsky - Columbia University 37 Evaluating Models: Out-of-sample • Fit model on part of data • Evaluate on out-of-sample • If model is overfit, will not perform well on out-of-sample data Data Mining 2011 - Volinsky - Columbia University 38 Data Partitioning • Randomly partition data into training and test set • Training set – data used to train/build the model. – Estimate parameters (e.g., for a linear regression), build decision tree, build artificial network, etc. • Test set – a set of examples not used for model induction. The model’s performance is evaluated on unseen data. Aka out-of-sample data. • Generalization Error: Model error on the test data. Set of test Set of training examples examples Data Mining 2011 - Volinsky - Columbia University 39 Complexity and Generalization Score Function Optimal model complexity Stest(q) e.g., squared error Strain(q) Complexity = degrees of freedom in the model (e.g., number of variables) Data Mining 2011 - Volinsky - Columbia University 40 Holding out data • The holdout method reserves a certain amount for testing and uses the remainder for training – Usually: one third for testing, the rest for training • For “unbalanced” datasets, random samples might not be representative – Few or none instances of some classes • Stratified sample: – Make sure that each class is represented with approximately equal proportions in both subsets 41 Data Mining 2011 - Volinsky - Columbia University 41 Repeated holdout method • Holdout estimate can be made more reliable by repeating the process with different subsamples – In each iteration, a certain proportion is randomly selected for training (possibly with stratification) – The error rates on the different iterations are averaged to yield an overall error rate • This is called the repeated holdout method 42 Data Mining 2011 - Volinsky - Columbia University 42 Cross-validation • Most popular and effective type of repeated holdout is cross-validation • Cross-validation avoids overlapping test sets – First step: data is split into k subsets of equal size – Second step: each subset in turn is used for testing and the remainder for training • This is called k-fold cross-validation • Often the subsets are stratified before the cross- validation is performed 43 Data Mining 2011 - Volinsky - Columbia University 43 Cross-validation example: 44 Data Mining 2011 - Volinsky - Columbia University 44 44 More on cross-validation • Standard data-mining method for evaluation: stratified ten-fold cross-validation • Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate • Stratification reduces the estimate’s variance • Even better: repeated stratified cross-validation – E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the sampling variance) • Error estimate is the mean across all repetitions 45 Data Mining 2011 - Volinsky - Columbia University 45 Leave-One-Out cross-validation • Leave-One-Out: a particular form of cross-validation: – Set number of folds to number of training instances – I.e., for n training instances, build classifier n times • Makes best use of the data • Involves no random subsampling • Computationally expensive, but good performance 46 Data Mining 2011 - Volinsky - Columbia University 46 Leave-One-Out-CV and stratification • Disadvantage of Leave-One-Out-CV: stratification is not possible – It guarantees a non-stratified sample because there is only one instance in the test set! • Extreme example: random dataset split equally into two classes – Best model predicts majority class – 50% accuracy on fresh data – Leave-One-Out-CV estimate is 100% error! 47 Data Mining 2011 - Volinsky - Columbia University 47 Three way data splits • One problem with CV is since data is being used jointly to fit model and estimate error, the error could be biased downward. • If the goal is a real estimate of error (as opposed to which model is best), you may want a three way split: – Training set: examples used for learning – Validation set: used to tune parameters – Test set: never used in the model fitting process, used at the end for unbiased estimate of hold out error Data Mining 2011 - Volinsky - Columbia University 48 Classification Evaluation • Score for continuous response based on squared error • What if response is binary or categorical? – classification problems – e.g., fraud or not, boy or girl, etc. simple example: Inputs Output Model’s Correct/ prediction incorrect prediction Single No of Age Income>50K Good/ Good/ cards Bad risk Bad risk 0 1 28 1 1 1 :) 1 2 56 0 0 0 :) 0 5 61 1 0 1 :( 0 1 28 1 1 1 :) 49 … … … … … … Mining 2011 - Volinsky - Columbia University Data … Evaluation of Classification actual outcome Accuracy = (a+d) / (a+b+c+d) 1 0 – Not always the best choice • Assume 1% fraud, • model predicts no fraud 1 a b • What is the accuracy? predicted outcome 0 c d Actual Class Fraud No Fraud Fraud 0 0 Predicted Class No Fraud 10 990 Data Mining 2011 - Volinsky - Columbia University 50 Evaluation of Classification Other options: – recall or sensitivity (how many of those that are really positive did you predict?): • a/(a+c) – precision (how many of those predicted positive really are?) • a/(a+b) actual Precision and recall are always in tension outcome 1 0 – Increasing one tends to decrease another – Document retrieval example 1 a b predicted outcome Data Mining 2011 - Volinsky - Columbia University 0 c d 51 Evaluation of Classification Yet another option: – recall or sensitivity (how many of the positives did you get right?): • a/(a+c) – Specificity (how many of the negatives did you get right?) • d/(b+d) Sensitivity and sensitivity have the same tension actual outcome Different fields use different metrics 1 0 1 a b predicted outcome Data Mining 2011 - Volinsky - Columbia University 0 c d 52 Evaluation for a Thresholded Response • Many classification models output probabilities • These probabilities get thresholded to make a prediction. • Classification accuracy depends on the threshold – good models give low probabilities to Y=0 and high probabilities to Y=1. Data Mining 2011 - Volinsky - Columbia University 53 predicted probabilities Suppose we use a cutoff of 0.5… actual outcome 1 0 1 8 3 predicted outcome 0 0 9 Test Data Data Mining 2011 - Volinsky - Columbia University 54 Suppose we use a cutoff of 0.5… actual outcome 1 0 8 sensitivity: = 100% 8+0 1 8 3 predicted outcome 9 specificity: = 75% 0 9+3 0 9 we want both of these to be high Data Mining 2011 - Volinsky - Columbia University 55 Suppose we use a cutoff of 0.8… actual outcome 1 0 6 sensitivity: = 75% 6+2 1 6 2 predicted outcome 10 specificity: = 83% 0 10+2 2 10 Data Mining 2011 - Volinsky - Columbia University 56 • Note there are 20 possible thresholds • Plotting all values of sensitivity vs. specificity gives a sense of model performance by seeing the tradeoff with different thresholds • Note if threshold = minimum actual outcome c=d=0 so sens=1; spec=0 1 0 • If threshold = maximum 1 a=b=0 so sens=0; spec=1 a b • If model is perfect sens=1; spec=1 0 c d Data Mining 2011 - Volinsky - Columbia University 57 ROC curve plots sensitivity vs. (1-specificity) – also known as false positive rate Always goes from (0,0) to (1,1) The more area in the upper left, the better Random model is on the diagonal “Area under the curve” (AUC) is a common measure of predictive performance Data Mining 2011 - Volinsky - Columbia University 58 Another Look at ROC Curves Pts without Pts with the disease disease Test Result Data Mining 2011 - Volinsky - Columbia University 59 Threshold Call these patients “negative” Call these patients “positive” Test Result Data Mining 2011 - Volinsky - Columbia University 60 Some definitions ... Call these patients “negative” Call these patients “positive” True Positives Test Result without the disease with the disease Data Mining 2011 - Volinsky - Columbia University 61 Call these patients “negative” Call these patients “positive” Test Result False Positives without the disease with the disease Data Mining 2011 - Volinsky - Columbia University 62 Call these patients “negative” Call these patients “positive” True negatives Test Result without the disease with the disease Data Mining 2011 - Volinsky - Columbia University 63 Call these patients “negative” Call these patients “positive” False negatives Test Result without the disease with the disease Data Mining 2011 - Volinsky - Columbia University 64 Moving the Threshold: right ‘‘-’’ ‘‘+’’ Test Result without the disease with the disease Data Mining 2011 - Volinsky - Columbia University 65 Moving the Threshold: left ‘‘-’’ ‘‘+’’ Test Result without the disease with the disease Data Mining 2011 - Volinsky - Columbia University 66 ROC curve 100% True Positive Rate (sensitivity) 0% 0% 100% False Positive Rate (1-specificity) Data Mining 2011 - Volinsky - Columbia University 67 Comparing Models • Highest AUC wins • But pay attention to ‘Occam’s Razor’ – ‘the best theory is the smallest one that describes all the facts’ – Also known as the ‘parsimony principle’ – If two models are similar, pick the simpler one Incorporating cost functions • Not all errors are the same: – Loan payments • A bad loan costs us much more than a lost customer – Medical tests • Cost of false alarm vs. missed diagnosis – Spam • Cost of spam getting through vs. filtering actual important mail outcome 1 0 • Building algorithm to minimize cost is the same as adding weight to false neg 1 0 C(FP ) and false pos predicted outcome Data Mining 2011 - Volinsky - Columbia University 0 C(FN) 69 0