Docstoc

Chapter 3 Data Mining Concepts_ Data Preparation_ Model

Document Sample
Chapter 3 Data Mining Concepts_ Data Preparation_ Model Powered By Docstoc
					              Chapter 3
        Data Mining Concepts:
Data Preparation and Model Evaluation




           Data Mining 2011 - Volinsky - Columbia University   1
                   Data Preparation

• Data in the real world is dirty
   – incomplete: lacking attribute values, lacking certain attributes
     of interest, or containing only aggregate data
   – noisy: containing errors or outliers
   – inconsistent: containing discrepancies in codes or names
• No quality data, no quality mining results!
   – Quality decisions must be based on quality data
   – Data warehouse needs consistent integration of quality data
   – Assessment of quality reflects on confidence in results


                     Data Mining 2011 - Volinsky - Columbia University   2
                 Preparing Data for Analysis
• Think about your data
    – how is it measured, what does it mean?
    – nominal or categorical
         • jersey numbers, ids, colors, simple labels
         • sometimes recoded into integers - careful!
    – ordinal
         • rank has meaning - numeric value not necessarily
         • educational attainment, military rank
    – integer valued
         • distances between numeric values have meaning
         • temperature, time
    – ratio
         • zero value has meaning - means that fractions and ratios are sensible
         • money, age, height,

• It might seem obvious what a given data value is, but not always
    – pain index, movie ratings, etc



                              Data Mining 2011 - Volinsky - Columbia University    3
        Investigate your data carefully!

• Example: lapsed donors to a charity: (KDD
  Cup 1998)
  – Made their last donation to PVA 13 to 24
    months prior to June 1997
  – 200,000 (training and test sets)
  – Who should get the current mailing?
  – What is the cost effective strategy?
  – “tcode” was an important variable…


                Data Mining 2011 - Volinsky - Columbia University   4
Data Mining 2011 - Volinsky - Columbia University   5
Data Mining 2011 - Volinsky - Columbia University   6
Data Mining 2011 - Volinsky - Columbia University   7
Data Mining 2011 - Volinsky - Columbia University   8
                   Tasks in Data Preprocessing

• Data cleaning
   – Check for data quality
   – Missing data
• Data transformation
   – Normalization and aggregation
• Data reduction
   – Obtains reduced representation in volume but produces the same or
     similar analytical results
• Data discretization
   – Combination of reduction and transformation but with particular
     importance, especially for numerical data

                        Data Mining 2011 - Volinsky - Columbia University   9
                     Data Cleaning / Quality
• Individual measurements
   – Random noise in individual measurements
       •   Outliers
       •   Random data entry errors
       •   Noise in label assignment (e.g., class labels in medical data sets)
       •   can be corrected or smoothed out

   – Systematic errors
       • E.g., all ages > 99 recorded as 99
       • More individuals aged 20, 30, 40, etc than expected

   – Missing information
       • Missing at random
             – Questions on a questionnaire that people randomly forget to fill in
       • Missing systematically
             – Questions that people don’t want to answer
             – Patients who are too ill for a certain test
                              Data Mining 2011 - Volinsky - Columbia University      10
                     Missing Data
• Data is not always available
   – E.g., many records have no recorded value for several attributes,
       • survey respondents
       • disparate sources of data

• Missing data may be due to
   – equipment malfunction
   – data not entered properly
   – data not available
   – Different versions of data have been merged
   – Try and figure it out!!!



                          Data Mining 2011 - Volinsky - Columbia University   11
                    How to Handle Missing Data?

• Ignore the tuple
    – Only feasible for a small % of missing values

• Use a global constant (such as variable mean) to fill in the missing value:
    – “unknown” as a category
    – For continuous data, this will decrease variance significantly

• Use a random value to fill in the missing value
    – Preserves variance, and ‘does no harm’

• Use imputation
    – nearest neighbor
    – model based (regression or Bayesian based)




                                  Data Mining 2011 - Volinsky - Columbia University   12
                           Missing Data

• What do I choose for a given situation?


• What you do depends on
   – the data - how much is missing? are they ‘important’ values?
   – the model - can it handle missing values?
   – Is the data missing at random?
   – there is no right answer!
   – Always check robustness of results



                        Data Mining 2011 - Volinsky - Columbia University   13
                     Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values (outliers) may due to
   –   faulty data collection
   –   data entry problems
   –   technology limitation
   –   YOU!
   –   Try and figure it out


• Other data problems which requires data cleaning
   – duplicate records
   – incomplete data
   – inconsistent data



                                Data Mining 2011 - Volinsky - Columbia University   14
                  Data Transformation

• Can help reduce influence of extreme values
• Variance reduction:
   – Often very useful when dealing with skewed data (e.g. incomes)
   – square root, reciprocal, logarithm, raising to a power
   – Logit: transforms probabilities from 0 to 1 to real-line


• Normalization: scaled to fall within a small, specified range
   – Sometimes we like to have all variables on the same scale
   – min-max normalization
   – Standardization / z-score normalization


• Attribute/feature construction
   – New attributes constructed from the given ones
                       Data Mining 2011 - Volinsky - Columbia University   15
             Dealing with massive data

• What if the data simply does not fit on my
  computer (or R crashes)?

   – Sample sample sample
      • be careful to do proper randomization and stratification
   – Find a smaller question
      • Use tools to reduce dataset and reframe question
   – Use a database
      • Mysql is a good (and free) one
   – Investigate data reduction strategies
      • Can reduce either n or p

                      Data Mining 2011 - Volinsky - Columbia University   16
         Data Reduction: Dimension Reduction

• In general, incurs loss of information about x

• If dimensionality p is very large (e.g., 1000’s), representing the data in a lower-
  dimensional space may make learning more reliable,
   – e.g., clustering example
        • 100 dimensional data
        • if cluster structure is only present in 2 of the dimensions, the others
          are just noise
        • if other 98 dimensions are just noise (relative to cluster structure),
          then clusters will be much easier to discover if we just focus on the 2d
          space

• Dimension reduction can also provide interpretation/insight
   – e.g for 2d visualization purposes



                             Data Mining 2011 - Volinsky - Columbia University   17
       Data Reduction: Dimension Reduction
• Feature selection (i.e., attribute subset selection):
    – Use EDA to find useless variables
    – Use exhaustive search on a simple model (e.g. regression)
        • Can be computationally expensive
    – Use heuristic methods like stepwise methods (forward / backward selection)
       • Can get trapped in local minima




                                Data Mining 2011 - Volinsky - Columbia University   18
        Data Reduction (n): Sampling
• Don’t forget about sampling!
• Choose a representative subset of the data
   – Simple random sampling may be ok but beware of skewed
     variables.
• Stratified sampling methods
   – Approximate the percentage of each class (or
     subpopulation of interest) in the overall database
   – Used in conjunction with skewed data
   – Propensity scores may be useful if response is
     unbalanced.


                    Data Mining 2011 - Volinsky - Columbia University   19
                   Data Reduction: Projection Methods

     • Projections: the shadow of a multidimensional object on a
       lower dimensional space
     • Mathematically: multiplying an n x p data matrix by an
       orthonormal p x d projection matrix




         Alternatively:


Courtesy: Cook, Buja, Lee, Wickham
                                     Data Mining 2011 - Volinsky - Columbia University   20
                                          Projections




Courtesy: Cook, Buja, Lee, Wickham   Data Mining 2011 - Volinsky - Columbia University   21
     Data Reduction: Principal Components

• One of several projection methods
• Idea: Find a projection of your data in a lower dimension,
  that maximizes the amount of information retained
• Information = variance


• Works for numeric data only
• Used when the number of dimensions is large




                     Data Mining 2011 - Volinsky - Columbia University   22
      PCA Example

                                                         Direction of 1st
x2                                                       principal component vector
                                                         (highest variance projection)




                                                                x1




     Data Mining 2011 - Volinsky - Columbia University                         23
      PCA Example

                                                         Direction of 1st
x2                                                       principal component vector
                                                         (highest variance projection)




                                                                x1

                       Direction of 2nd
                       principal component vector

     Data Mining 2011 - Volinsky - Columbia University                         24
                         Principal Components
• Sequentially extracts optimal maximal variance “direction”
• All directions ‘principal components’ are orthoganal
• Note: variables must be standardized!!



                          x                                            =


 Original points in p-         Projection matrix of                               Original points
  dimentional space            orthogonal directions                              projected into d
                                                                                    dimensions
Principal components are related to the covariance of the original data
    – Technically: the first PC is the eigenvector for the first eigenvalue of the
      covariance of X
    – Highly correlated data reduces nicely
‘scree’ plot can help assess how many PC to use….
                              Data Mining 2011 - Volinsky - Columbia University                      25
                     Example: Music Data
Left variables




Scree plot




What’s wrong with this picture?


                         Data Mining 2011 - Volinsky - Columbia University   26
                     Example: Music Data
Scaled data




Scree plot




Loadings =
Coefficients (weights) of
  varaibles in projection
  vector
                            Data Mining 2011 - Volinsky - Columbia University   27
     Data Reduction: Multidimensional Scaling

• Start with an n x p matrix of observations and variables
• Create an n x n matrix of distances (similarities)
   – Feasible when n small(ish)
   – 0’s on the diagonal
   – Symmetric
• Or, you may have a distance of matrices to start with
   – Relationships, networks, etc



• MDS:
   – finds a representation of these points in a lower-dimensional space usually
     2), where the distances in this space best represent the original distances



                          Data Mining 2011 - Volinsky - Columbia University   28
                    Price                        Fuel                        FuelTank
    Cadillac        34.7                         16                          18.0
    Camaro          15.1                         19                          15.5
    Corsica         11.4                         25                          15.6
    Civic           12                           42                          11.9



• Example:
  – Distance between X and Y?

               Cadillac                Camaro                  Corsica              Civic
    Cadillac   0                       20.0                    25.9                 38.1
    Camaro     20                      0                       7.1                  26.9
    Corsica    25.09                   7.05                    0                    21.84
    Civic      38.1                    26.9                    21.84                0
                         Data Mining 2011 - Volinsky - Columbia University                  29
            Multidimensional Scaling (MDS)

• MDS score function (“stress”)




                         Original                                   Euclidean distance
                         dissimilarities                            in “embedded” k-dim space




• Local minimum is found via algorithmic methods
   – (the algorithm is gradient descent)


• Morse code example

                          Data Mining 2011 - Volinsky - Columbia University                 30
MDS: face data




Data Mining 2011 - Volinsky - Columbia University   31
           MDS: 2d embedding of face images




Similar faces are close
     to each other

 Sometimes the axes
    can have an
   interpretation




                          Data Mining 2011 - Volinsky - Columbia University   32
Data Mining 2011 - Volinsky - Columbia University   33
Model Evaluation




Data Mining 2011 - Volinsky - Columbia University   34
              Evaluating Models: in-sample




How good is (a,b)?
For a given (x,y), the score function S measures how good
the model fits:




This is just one of many possible score functions
                        Data Mining 2011 - Volinsky - Columbia University   35
               Evaluating Models: In-Sample
• In-sample: error goes to zero with enough parameters (k):




goodness of fit increases with parameters (k)
•High bias: doesn’t fit data well, but generalizable and robust
High variance: non robust to changes or new data, but low error
Score function should embody the comprimise:
        score(model) = Goodness-of-fit - penalty(k)
e.g. Bayesian Information Criterion
                   Data Mining 2011 - Volinsky - Columbia University   36
                               In v. Out

• In-sample evaluation
   – Uses all of the data to fit parameters
   – Focus: how well does my model ‘fit’ the data
   – Penalties to decide on number of parameters

• Out-of-sample evaluation
   – Split data into training and test sets
   – Focus: how well does my model predict things
   – Prediction error is all that matters


• Statistics traditionally looks at in-sample where as
  data mining / machine learning typically uses out-of
  -sample
                        Data Mining 2011 - Volinsky - Columbia University   37
      Evaluating Models: Out-of-sample

• Fit model on part of data
• Evaluate on out-of-sample
• If model is overfit, will not perform well on
  out-of-sample data




                Data Mining 2011 - Volinsky - Columbia University   38
                           Data Partitioning


• Randomly partition data into training and test set
• Training set – data used to train/build the model.
    – Estimate parameters (e.g., for a linear regression), build decision tree, build
      artificial network, etc.
• Test set – a set of examples not used for model induction. The model’s
  performance is evaluated on unseen data. Aka out-of-sample data.
• Generalization Error: Model error on the test data.




                                                                         Set of test
                Set of training examples
                                                                         examples



                             Data Mining 2011 - Volinsky - Columbia University          39
           Complexity and Generalization




 Score
Function               Optimal model
                        complexity                                         Stest(q)
  e.g.,
squared
 error     Strain(q)

                                 Complexity = degrees
                                of freedom in the model
                               (e.g., number of variables)

                       Data Mining 2011 - Volinsky - Columbia University
                                                                                      40
                            Holding out data

     • The holdout method reserves a certain amount for
       testing and uses the remainder for training
        – Usually: one third for testing, the rest for training
     • For “unbalanced” datasets, random samples might
       not be representative
        – Few or none instances of some classes
     • Stratified sample:
        – Make sure that each class is represented with
          approximately equal proportions in both subsets



41                          Data Mining 2011 - Volinsky - Columbia University   41
                 Repeated holdout method

     • Holdout estimate can be made more reliable
       by repeating the process with different
       subsamples
        – In each iteration, a certain proportion is
          randomly selected for training (possibly with
          stratification)
        – The error rates on the different iterations are
          averaged to yield an overall error rate
     • This is called the repeated holdout method

42                     Data Mining 2011 - Volinsky - Columbia University   42
                         Cross-validation

     • Most popular and effective type of repeated holdout
       is cross-validation
     • Cross-validation avoids overlapping test sets
        – First step: data is split into k subsets of equal size
        – Second step: each subset in turn is used for testing and
          the remainder for training
     • This is called k-fold cross-validation
     • Often the subsets are stratified before the cross-
       validation is performed



43                       Data Mining 2011 - Volinsky - Columbia University   43
     Cross-validation example:




44         Data Mining 2011 - Volinsky - Columbia University   44   44
                   More on cross-validation

     • Standard data-mining method for evaluation:
       stratified ten-fold cross-validation
     • Why ten? Extensive experiments have shown that
       this is the best choice to get an accurate estimate
     • Stratification reduces the estimate’s variance
     • Even better: repeated stratified cross-validation
        – E.g. ten-fold cross-validation is repeated ten times and
          results are averaged (reduces the sampling variance)
     • Error estimate is the mean across all
       repetitions
45                       Data Mining 2011 - Volinsky - Columbia University   45
               Leave-One-Out cross-validation

     •   Leave-One-Out:
         a particular form of cross-validation:
          –   Set number of folds to number of training instances
          –   I.e., for n training instances, build classifier n times
     •   Makes best use of the data
     •   Involves no random subsampling
     •   Computationally expensive, but good performance




46                            Data Mining 2011 - Volinsky - Columbia University   46
             Leave-One-Out-CV and stratification

     •   Disadvantage of Leave-One-Out-CV: stratification is not
         possible
         –    It guarantees a non-stratified sample because there is only one
              instance in the test set!
     •   Extreme example: random dataset split equally into two
         classes
         –    Best model predicts majority class
         –    50% accuracy on fresh data
         –    Leave-One-Out-CV estimate is 100% error!




47                           Data Mining 2011 - Volinsky - Columbia University   47
                 Three way data splits

• One problem with CV is since data is being used
  jointly to fit model and estimate error, the error
  could be biased downward.
• If the goal is a real estimate of error (as opposed to
  which model is best), you may want a three way
  split:
   – Training set: examples used for learning
   – Validation set: used to tune parameters
   – Test set: never used in the model fitting process, used at
     the end for unbiased estimate of hold out error


                    Data Mining 2011 - Volinsky - Columbia University   48
               Classification Evaluation
• Score for continuous response based on squared error
• What if response is binary or categorical?
     – classification problems
     – e.g., fraud or not, boy or girl, etc.
simple example:
                Inputs                           Output           Model’s      Correct/
                                                                  prediction   incorrect
                                                                               prediction

 Single No of    Age     Income>50K              Good/            Good/
        cards                                    Bad risk         Bad risk

 0       1       28      1                       1                1                :)
 1       2       56      0                       0                0                :)
 0       5       61      1                       0                1                :(
 0       1       28      1                       1                1                :)
                                                                                        49
 …       …       …                             …                  …
                         … Mining 2011 - Volinsky - Columbia University
                         Data
                                                                                   …
                     Evaluation of Classification
                    actual
                    outcome           Accuracy = (a+d) / (a+b+c+d)
                    1      0                  – Not always the best choice
                                                       • Assume 1% fraud,
                                                       • model predicts no fraud
               1    a      b                           • What is the accuracy?
   predicted
   outcome
               0    c      d
                                                          Actual Class

                                                   Fraud                     No Fraud
                   Fraud                                0                           0

Predicted Class
                   No Fraud                            10                          990

                               Data Mining 2011 - Volinsky - Columbia University         50
              Evaluation of Classification

Other options:
   – recall or sensitivity (how many of those that are really positive did
     you predict?):
       • a/(a+c)
   – precision (how many of those predicted positive really are?)
       • a/(a+b)

                                                                                            actual
Precision and recall are always in tension                                                  outcome
                                                                                            1    0
   – Increasing one tends to decrease another
   – Document retrieval example
                                                                                        1   a    b
                                                                            predicted
                                                                            outcome
                        Data Mining 2011 - Volinsky - Columbia University
                                                                                        0   c d
                                                                                            51
              Evaluation of Classification

Yet another option:
   – recall or sensitivity (how many of the positives did you get right?):
       • a/(a+c)
   – Specificity (how many of the negatives did you get right?)
       • d/(b+d)



Sensitivity and sensitivity have the same tension                                           actual
                                                                                            outcome
Different fields use different metrics                                                      1    0


                                                                                        1   a    b
                                                                            predicted
                                                                            outcome
                        Data Mining 2011 - Volinsky - Columbia University
                                                                                        0   c d
                                                                                            52
Evaluation for a Thresholded Response

                               • Many classification models
                                 output probabilities
                               • These probabilities get
                                 thresholded to make a
                                 prediction.
                               • Classification accuracy
                                 depends on the threshold –
                                 good models give low
                                 probabilities to Y=0 and high
                                 probabilities to Y=1.



        Data Mining 2011 - Volinsky - Columbia University   53
                           predicted probabilities


                                      Suppose we use a cutoff of
                                      0.5…
                                                                actual outcome
                                                                1           0


                                                1
                                                                8          3
                              predicted
                              outcome

                                                0
                                                                0          9
Test Data
            Data Mining 2011 - Volinsky - Columbia University                    54
   Suppose we use a cutoff of
   0.5…
              actual outcome
              1                  0
                                                                              8
                                                        sensitivity:                 = 100%
                                                                             8+0
        1
             8                3
predicted
outcome
                                                                               9
                                                         specificity:                = 75%
        0                                                                     9+3
             0                  9
                                                              we want both of these to be high



                    Data Mining 2011 - Volinsky - Columbia University                  55
   Suppose we use a cutoff of
   0.8…
              actual outcome
              1                  0
                                                                         6
                                                        sensitivity:           = 75%
                                                                        6+2
        1
             6                2
predicted
outcome
                                                                         10
                                                         specificity:          = 83%
        0                                                               10+2
             2                  10



                    Data Mining 2011 - Volinsky - Columbia University            56
•   Note there are 20 possible thresholds
•   Plotting all values of sensitivity vs. specificity gives a sense
    of model performance by seeing the tradeoff with
    different thresholds


•   Note if threshold = minimum
                                                                              actual outcome
    c=d=0 so sens=1; spec=0                                                   1           0
•   If threshold = maximum
                                                                         1
    a=b=0 so sens=0; spec=1                                                   a          b
•   If model is perfect
    sens=1; spec=1                                                       0
                                                                              c          d
                          Data Mining 2011 - Volinsky - Columbia University               57
                                                    ROC curve plots sensitivity vs.
                                                    (1-specificity) – also known as
                                                    false positive rate


                                                    Always goes from (0,0) to (1,1)


                                                    The more area in the upper left,
                                                    the better


                                                    Random model is on the
                                                    diagonal


                                                    “Area under the curve” (AUC)
                                                    is a common measure of
                                                    predictive performance
Data Mining 2011 - Volinsky - Columbia University                          58
              Another Look at ROC Curves


Pts without                                                         Pts with
the disease                                                         disease




                             Test Result


                Data Mining 2011 - Volinsky - Columbia University        59
                 Threshold

Call these patients “negative”                      Call these patients “positive”




                                Test Result


                   Data Mining 2011 - Volinsky - Columbia University           60
              Some definitions ...

   Call these patients “negative”                      Call these patients “positive”

                                                                          True Positives




                                   Test Result

without the disease
with the disease      Data Mining 2011 - Volinsky - Columbia University                61
   Call these patients “negative”                      Call these patients “positive”




                                   Test Result                            False
                                                                          Positives
without the disease
with the disease      Data Mining 2011 - Volinsky - Columbia University               62
   Call these patients “negative”                      Call these patients “positive”



True
negatives




                                   Test Result

without the disease
with the disease      Data Mining 2011 - Volinsky - Columbia University           63
   Call these patients “negative”                      Call these patients “positive”


 False
 negatives




                                   Test Result

without the disease
with the disease      Data Mining 2011 - Volinsky - Columbia University           64
           Moving the Threshold: right



   ‘‘-’’                                                                  ‘‘+’’




                                   Test Result

without the disease
with the disease      Data Mining 2011 - Volinsky - Columbia University      65
         Moving the Threshold: left



‘‘-’’                                                                     ‘‘+’’




                                   Test Result

without the disease
with the disease      Data Mining 2011 - Volinsky - Columbia University      66
                           ROC curve
100%
True Positive Rate
(sensitivity)




            0%

                     0%                                                100%
                          False Positive Rate
                          (1-specificity)
                             Data Mining 2011 - Volinsky - Columbia University   67
                           Comparing Models


• Highest AUC wins
• But pay attention to
  ‘Occam’s Razor’
   – ‘the best theory is the
     smallest one that describes
     all the facts’
   – Also known as the
     ‘parsimony principle’
   – If two models are similar,
     pick the simpler one
             Incorporating cost functions

• Not all errors are the same:
   – Loan payments
       • A bad loan costs us much more than a lost
         customer
   – Medical tests
       • Cost of false alarm vs. missed diagnosis
   – Spam
       • Cost of spam getting through vs. filtering
                                                                                              actual
         important mail
                                                                                              outcome
                                                                                              1       0
• Building algorithm to minimize cost is
  the same as adding weight to false neg                                                 1    0
                                                                                                     C(FP
                                                                                                       )
  and false pos                                                              predicted
                                                                             outcome
                         Data Mining 2011 - Volinsky - Columbia University
                                                                                         0   C(FN)
                                                                                               69
                                                                                                      0

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:7/10/2014
language:English
pages:69