Docstoc

Data Mining With Decision Trees

Document Sample
Data Mining With Decision Trees Powered By Docstoc
					Data Mining With Decision Trees


               Craig A. Struble,
               Ph.D.
               Marquette University
    Overview

       Decision Trees
       Rules and Language Bias
       Constructing Decision Trees
       Some Analyses
       Heuristics
       Quality Assessment
       Extensions

                   MSCS282 - Data Mining With
2                      Decision Trees
    Goals

       Explore the complete data mining process
       Understand decision trees as a model
       Understand how to construct a decision tree
       Recognize the language bias, search bias, and
        overfitting avoidance bias for decision trees
       Be able to assess the performance of decision
        trees

                   MSCS282 - Data Mining With
3                      Decision Trees
    Decision Trees

       A graph (tree) based model used primarily for
        classification
       Extensively studied
        –   Quinlan is the primary contributor to the field
       Applications are wide ranging
        –   Data mining
        –   Aircraft flying
        –   Medical diagnosis
        –   Etc.

                          MSCS282 - Data Mining With
4                             Decision Trees
    Decision Trees




             MSCS282 - Data Mining With
5                Decision Trees
    What kind of data?

       Initially, we will restrict the data to having only
        nominal values
        –   We’ll explore numeric/continuous values later
       Number of attributes doesn’t matter
        –   Beware of the “curse of dimensionality” though
        –   We’ll see this later



                      MSCS282 - Data Mining With
6                         Decision Trees
    Classification Rules

       It is relatively straight forward to convert a
        decision tree into a set of rules for classification
             If tear production rate = reduced then recommendation = none.
             If age = young and astigmatic = no and tear production rate = normal
                then recommendation = soft
             If age = pre-presbyopic and astigmatic = no and tear production
                rate = normal then recommendation = soft
             If age = presbyopic and spectacle prescription = myope and
                astigmatic = no then recommendation = none
             If spectacle prescription = hypermetrope and astigmatic = no and
                tear production rate = normal then recommendation = soft
             If spectacle prescription = myope and astigmatic = yes and
                tear production rate = normal then recommendation = hard
             If age = young and astigmatic = yes and tear production rate =
                 normal
                then recommendation = hard
             If age = pre-presbyopic and spectacle prescription = hypermetrope
                and astigmatic = yes then recommendation = none
             If age = presbyopic and spectacle prescription = hypermetrope
                and astigmatic = yes then recommendation = none

                         MSCS282 - Data Mining With
7                            Decision Trees
    Language Bias

       Decision trees are restricted to functions that can be
        represented by rules of the form
        if X and Y then A
        if X and W and V then B
        if Y and V then A
       That is, decision trees represent collections of
        implications
       The rules can be combined with or
        if Y and (X or V) then A

                       MSCS282 - Data Mining With
8                          Decision Trees
    Language Bias

       Examples of functions not well represented by
        decision trees
        –   Parity: output is true if an even number of attributes
            are true
        –   Majority: output is true if more than half of the
            attributes are true




                       MSCS282 - Data Mining With
9                          Decision Trees
     Propositional Logic

        Essentially, decision trees can represent any function
         in propositional logic
         –   A, B, C: propositional variables
         –   and, or, not, => (implies), <=> (equivalent): connectives
        A proposition is a statement that is either true or false
         The sky is blue.                                color of sky = blue
        Hence, decision trees are an example of a
         propositional learner.


                            MSCS282 - Data Mining With
10                              Decision Trees
     Constructing Decision Trees
     Example   Alt   Bar   Fri     Hun   Pat     Price   Rain   Res   Type      Est     Wait?
     1         Yes   No    No      Yes   Some    $$$     No     Yes   French    0-10    Yes
     2         Yes   No    No      Yes   Full    $       No     No    Thai      30-60   No
     3         No    Yes   No      No    Some    $       No     No    Burger    0-10    Yes
     4         Yes   No    Yes     Yes   Full    $       No     No    Thai      10-30   Yes
     5         Yes   No    Yes     No    Full    $$$     No     Yes   French    >60     No
     6         No    Yes   No      Yes   Some    $$      Yes    Yes   Italian   0-10    Yes
     7         No    Yes   No      No    None    $       Yes    No    Burger    0-10    No
     8         No    No    No      Yes   Some    $$      Yes    Yes   Thai      0-10    Yes
     9         No    Yes   Yes     No    Full    $       Yes    No    Burger    >60     No
     10        Yes   Yes   Yes     Yes   Full    $$$     No     Yes   Italian   10-30   No
     11        No    No    No      No    None    $       No     No    Thai      0-10    No
     12        Yes   Yes   Yes     Yes   Full    $       No     No    Burger    30-60   Yes
                                 MSCS282 - Data Mining With
11                                   Decision Trees
     Select an Attribute

                             Alt




               MSCS282 - Data Mining With
12                 Decision Trees
     Partition The Data

                                    Alt
                                                   No   {3,6,7,8,9,11}
          {1,2,4,5,10,12} Yes




                      MSCS282 - Data Mining With
13                        Decision Trees
     Select Next Attribute

                                       Alt
           {1,2,4,5,10,12}   Yes                      No   {3,6,7,8,9,11}


                             Res
        {1,5,10}   Yes                   No   {2,4,12}




                         MSCS282 - Data Mining With
14                           Decision Trees
     Continue Selecting Attributes

                                              Alt
                  {1,2,4,5,10,12}   Yes                      No   {3,6,7,8,9,11}


                                    Res
              {1,5,10}    Yes                   No   {2,4,12}



                         Fri
                                                             This process continues along a
     {5,10} Yes                     No {1}                   subtree until all instances have
                                                             the same label.
            No                      Yes
                                MSCS282 - Data Mining With
15                                  Decision Trees
     Basic Algorithm
     algorithm LearnDecisionTree(examples, attributes, default) returns a decision tree
               inputs:      examples, a set of examples
                            attributes, a set of attributes
                            default, default value for goal attribute
               if examples is empty then return default
               else if all examples have same value for goal attribute then return value
               else
                            best = ChooseAttribute(attributes, examples)
                            tree = a new decision tree with root test best
                            for each value vi of best do
                                        examplesi = {elements of examples with best = vi}
                                        subtree = LearnDecisionTree(examplesi, attributes – best,
                                                                            MajorityValue(examples))
                                        add a branch to tree with label vi and subtree subtree
                            return tree

                                 MSCS282 - Data Mining With
16                                   Decision Trees
     Analysis of Basic Algorithm

        Let m be the number of attributes
        Let n be the number of instances
        Assumption: Depth of tree is O(log n)
        For each level of the tree all n instances are
         considered (best = vi)
         –   O(n log n) work for a single attribute over the entire tree
        Total cost is O(mn log n) since all attributes are
         eventually considered.


                           MSCS282 - Data Mining With
17                             Decision Trees
     How Many Possible Decision
     Trees?
        Assume a set of m non-goal boolean attributes
        We can construct a decision tree for each boolean
         function with m non-goal attributes
        There are 2m possible ways to assign the attributes
        The number of different functions is the number of
         subsets of the rows, assign those rows in the subset a
         value of true.
                              m
        So, there must be 22 possible decision trees!
        How do we select the best one?

                      MSCS282 - Data Mining With
18                        Decision Trees
     Applying Heuristics

        In the basic algorithm, the ChooseAttribute
         function makes an arbitrary choice of an
         attribute to build the tree.
        We can make this function try to choose the
         “best” attribute to avoid making poor choices
        This in effect biases the search.



                     MSCS282 - Data Mining With
19                       Decision Trees
     Information Theory
        One method for assessing attribute quality
        Described by Shannon and Weaver (1949)
        Measurement of the expected amount of information in
         terms of bits
         –   These are not your ordinary computer bits
         –   Often information is fractional
        Other Applications
         –   Compression
         –   Feature selection
        This is the ID3 algorithm for decision tree construction.
                         MSCS282 - Data Mining With
20                           Decision Trees
     Notation

        Let vi be a possible answer (value of attribute)
        Let P(vi) be the probability of getting answer vi
         from a random data element
        The information content I of the knowing the
         actual answer is
                                          n
             I ( P(v1 ),..., P(vn ))    P(vi ) log 2 P(vi )
                                         i 1


                       MSCS282 - Data Mining With
21                         Decision Trees
     Example
        Consider a fair coin, P(heads) = P(tails) = ½

           I (1 / 2,1 / 2)  1 / 2 log 2 1 / 2  1 / 2 log 2 1 / 2  1 bit
        Consider an unfair coin, P(heads) = 0.99 and
         P(tails)=0.01
           I (0.99,0.01)  0.99 log 2 0.99  0.01 log 2 0.01  0.08 bits
        The value of the actual answer is reduced if you know
         there is a bias

                          MSCS282 - Data Mining With
22                            Decision Trees
     Application to Decision Trees
        Measure the value of information after splitting the
         instances by an attribute A
        Attribute A splits the instances E into subsets E1, …, Ea
         where a is the number of values A can have
                                         a     Ei
               Remainder( A)                      I ( P(v1i ),...,P(vni ))
                                        i 1   E
         where P(v1i) is the probability of an element in Ei having
         value v1 for the goal attribute, etc.
         –   Number of elements in Ei having v1 divided by |Ei |

                          MSCS282 - Data Mining With
23                            Decision Trees
     Application to Decision Trees

        The information gain of an attribute A is
           Gain ( A)  I ( P(v1 ),..., P(vn ))  Remainder ( A)
         or the amount of information before selecting
         the attribute minus how much is still needed
         afterwards (the values are for the goal
         attribute)
        Heuristic: select attribute with highest gain

                       MSCS282 - Data Mining With
24                         Decision Trees
      Example

           Calculate for Patrons and Type
     Gain ( Patrons)  I (6 / 12 ,6 / 12 )  [2 / 12 I (0 / 2,2 / 2)  4 / 12 I (4 / 4,0 / 4)  6 / 12 I (2 / 6,4 / 6)]  0.541 bits
     Gain (Type )  I (6 / 12 ,6 / 12 )  [2 / 12 I (1 / 2,1 / 2)  2 / 12 I (1 / 2,1 / 2)  4 / 12 I (2 / 4,2 / 4)  4 / 12 I (2 / 4,2 / 4)]  0 bits


           Which attribute would be chosen?
           Exercise: calculate information gain of Alt



                                               MSCS282 - Data Mining With
25                                                 Decision Trees
     Carrying On

        When you use information gain in lower levels
         of the tree, remember your set of instances
         under consideration changes
         –   The decision tree construction procedure is
             recursive
         –   This is the single most common mistake when
             calculating information gain by hand



                       MSCS282 - Data Mining With
26                         Decision Trees
     Highly Branching Attributes

        Highly branching attributes might generate spurious
         attributes with high gain
        Correct for this by using the gain ratio
         –   Calculate the information of the split
                                     E1 E2       Ea 
                      Split( A)  I    ,
                                     E E   ,...,    
                                                 E 
         –   Calculate Gain(A)/Split(A)
         –   Choose attribute with highest gain ratio


                          MSCS282 - Data Mining With
27                            Decision Trees
     Assessing Decision Trees

        Two kinds of assessments that we may want
         –   Assess the performance of a single model
         –   Assess the performance of a data mining technique
        What kinds of metrics can we use?
         –   Model size
         –   Accuracy



                          MSCS282 - Data Mining With
28                            Decision Trees
     Comparing Model Size
        Suppose two models with the same accuracy
        Choose the model with smaller size
         –   Ockham’s razor: The most likely hypothesis is the simplest
             one that is consistent with all observations.
         –   Can be used as a heuristic (other data mining techniques)
        Why?
         –   Efficiency
         –   Generality
        The problem of finding the smallest model is often
         intractable
         –   NP-complete for decision tree learning
                          MSCS282 - Data Mining With
29                            Decision Trees
     Accuracy
        Measurement of the correctness of the technique
         –   Success rate
        Definitions
         –   True positive: a positive instance that is correctly classified
         –   True negative: a negative instance correctly classified
         –   False positive: a negative instance classified as a positive one
         –   False negative: a positive instance classified as a negative one
        Accuracy is f = (|tp| + |tn|) / |E|
        Sometimes we’re more accepting of some errors
         –   Spam filter

                           MSCS282 - Data Mining With
30                             Decision Trees
     Testing Procedures
        In general, instances are split into two disjoint sets
         –   Training set: the set of instances used to build the model
         –   Test set: the set of instances used to test the accuracy


                                                Training Set
                           Test Set


        In both sets, the correct labeling is known

                          MSCS282 - Data Mining With
31                            Decision Trees
     Testing Dilemma

        We’d like both sets to be as large as possible
        Try to create sets that are representative of
         possible data
        As the number of attributes grows, the size of a
         representative set grows exponentially. (Why?)




                     MSCS282 - Data Mining With
32                       Decision Trees
     Assessing a Single Model

        Each test instance constitutes a Bernoulli trial of the
         model.
         –   Mean and variance of single trial are p and p(1-p)
         –   For N instances, f is a random variable with mean p, variance
             is p(1-p)/N
         –   For large N (>100), the distribution of f approaches a normal
             distribution (bell curve)
        Calculate P(-z <= X <= z) = c, where z defines the
         confidence interval and c defines the confidence


                          MSCS282 - Data Mining With
33                            Decision Trees
     Assessing a Single Model

        The accuracy f needs to have 0 mean and unit
         variance
                              f p          
                 P  z                   z  c
                            p(1  p) / N    
                                            
        Values for c and z can be found in standard
         statistical texts
        Solve for p,which is shown in the text

                     MSCS282 - Data Mining With
34                       Decision Trees
     Assessing a Single Model

        Two models are significantly different if their
         confidence intervals for p do not overlap
        Choose the model with a “better” confidence
         interval for p




                     MSCS282 - Data Mining With
35                       Decision Trees
     Assessing a Method
        n-fold cross-validation
         –   Split the instances into n equal sized partitions
                 Make sure each partition is as representative as possible
         –   Run n training and testing sessions, treating each partition as
             a testing set during one session
         –   Calculate accuracy and error rates
                 Means and standard deviation
         –   10 fold tests are common
        Leave-one-out (or jackknife)
         –   Special case of n-fold cross validation
         –   Use for small datasets
         –   Each instance is its own test set.
                            MSCS282 - Data Mining With
36                              Decision Trees
     WEKA Output




             MSCS282 - Data Mining With
37               Decision Trees
     WEKA Output




             MSCS282 - Data Mining With
38               Decision Trees
     Extensions to Basic Algorithm

        Numeric Attributes
        Missing Values
        Overfitting Avoidance (Pruning)
        Interpreting Decision Trees




                    MSCS282 - Data Mining With
39                      Decision Trees
     Handling Numeric Attributes

        Recall that decision trees work for nominal
         attributes
         –   Can’t have infinite number of branches
        Our approach is to convert numeric attributes
         into ordinal (nominal) attributes
        This process is called discretization



                       MSCS282 - Data Mining With
40                         Decision Trees
     Discretization

        Binary split (weather data)
        Select a breakpoint between values with
         maximum information gain (equivalently,
         lowest Remainder)
         –   For each breakpoint calculate gain for less than and
             greater than the breakpoint.
         –   For n values, this is an O(n) process (assuming
             instances are sorted already).

                        MSCS282 - Data Mining With
41                          Decision Trees
     Discretization

        Example
         64    65   68    69    70     71    72        75   80   81   83    85
         Yes   No   Yes Yes Yes No           No        Yes No    Yes Yes No
                                             Yes No

     Remainder ( 70 .5)  5 / 14 I (4 / 5,1 / 5)  9 / 14 I (4 / 9,5 / 9)  0.254
     Remainder ( 73 .5)  8 / 14 I (5 / 8,3 / 8)  6 / 14 I (3 / 6,3 / 6)  0.309

        You can reuse continuous attributes, but
         causes difficulty in interpreting the results.
                          MSCS282 - Data Mining With
42                            Decision Trees
     Discretization

        Equal-interval (equiwidth) binning splits the
         range into n equal sized ranges
         –   (max – min) / n is the range width
         –   Often distributes the instances unevenly
        Equal-frequency (equidepth) binning splits into
         n bins containing an equal (or close to equal)
         number of instances
         –   Identify splits until the histogram is flat

                         MSCS282 - Data Mining With
43                           Decision Trees
     Discretization




               MSCS282 - Data Mining With
44                 Decision Trees
     Discretization

        Entropy (information content) based
         –   Requires class labeling (goal attribute)
        Recursively apply the approach on slide 41
         –   Select the breakpoint B with lowest Remainder
         –   Recursively select breakpoint with lowest remainder on each
             of the two partitions
         –   Stop splitting when some criterion is met
                 Minimum description length in section 5.9
                 If Gain(<B) < t,for some threshold t
                    –   A formula for determining t is given in the book.


                               MSCS282 - Data Mining With
45                                 Decision Trees
     Handling Missing Values
        Ignore instances with missing values
         –   Pretty harsh, and missing value might not be important
        Ignore attributes with missing values
         –   Again, may not be feasible
        Treat missing value as another nominal value
         –   Fine if missing a value has significant meaning
        Estimate missing values
         –   Data imputation: regression, nearest neighbor, mean, mode,
             etc.
         –   We’ll cover this in more detail later in the semester

                          MSCS282 - Data Mining With
46                            Decision Trees
     Handling Missing Values

        Follow the leader
         –   An instance with a missing value for a tested
             attribute is sent down the branch with the most
             instances

                                      Temp
                                <75                   >= 75
                  5 instances                                 3 instances


                    Instance included on the left branch

                         MSCS282 - Data Mining With
47                           Decision Trees
     Handling Missing Values

        “Partition” the instance (branches show # of
         instances)
                               Temp
                    5 5/8                     3 3/8

                     Sunny                    Wind
            2 5/8              3        1     1       1 3/8




                        MSCS282 - Data Mining With
48                          Decision Trees
     Pruning

        To avoid overfitting, we can prune or simplify a
         decision tree.
         –   More efficient, Ockham’s Razor
        Prepruning tries to decide a priori when to stop
         creating subtrees
         –   This turns out to be fairly difficult to do well in
             practice
        Postpruning simplifies an existing decision tree
                         MSCS282 - Data Mining With
49                           Decision Trees
     Postpruning

        Subtree replacement replaces a subtree with a
         single leaf node
                       Alt                                          Alt
           Yes                                              Yes

                                                             Yes
            Price
                        $$$                                 12/15
     $
            $$
     Yes Yes             No
     4/5         7/8     1/2


                               MSCS282 - Data Mining With
50                                 Decision Trees
     Postpruning

          Subtree raising moves a subtree to a higher
           level in the decision tree, subsuming its parent
                         Alt
                                                                                Alt
              Yes
                                                                        Yes
                   Res
         No                Yes                                          Price
                                                                  $             $$$
                               No                                       $$
              Price            4/4                                Yes Yes       No
     $                   $$$
              $$                                                  4/5    7/9    4/5
     Yes Yes             No
                                     MSCS282 - Data Mining With
51   4/5       7/8       1/2             Decision Trees
     Postpruning

        When do we want to perform subtree replacement or
         subtree raising?
         –   Consider the estimated error of the pruning operation
        Estimating error
         –   With a test set, similar to accuracy except replace f=(tp+tn)/|E|
             with f=(fp+fn)/|E|, the error rate and use confidence of 25%
         –   The confidence can be tweaked to achieve better performance
         –   Without a test set, consider number of misclassified training
             instances as errors, and take pessimistic estimate of error rate.


                          MSCS282 - Data Mining With
52                            Decision Trees
     Using Error Estimate

        To determine if a node should be replaced,
         compare the error rate estimate for the node
         with the combined error rates of the children.
         Replace the node if its error rate is less than
         combined rates of its children.
               Price
         $             $$$       5/15 err(1/5,5) + 8/15 err(1/8, 8) + 2/15 err(1/2,2) = 0.33
               $$                err(3/15, 15) = 0.28
         Yes Yes       No
         4/5    7/8    1/2
                             MSCS282 - Data Mining With
53                               Decision Trees
     Interpreting Decision Trees
        Although the decision is used for classification, you can use the
         classification rules from the decision tree to describe concepts
                 If tear production rate = reduced then recommendation = none.
                 If age = young and astigmatic = no and tear production rate = normal
                    then recommendation = soft
                 If age = pre-presbyopic and astigmatic = no and tear production
                    rate = normal then recommendation = soft
                 If age = presbyopic and spectacle prescription = myope and
                    astigmatic = no then recommendation = none
                 If spectacle prescription = hypermetrope and astigmatic = no and
                    tear production rate = normal then recommendation = soft
                 If spectacle prescription = myope and astigmatic = yes and
                    tear production rate = normal then recommendation = hard
                 If age = young and astigmatic = yes and tear production rate =
                     normal
                    then recommendation = hard
                 If age = pre-presbyopic and spectacle prescription = hypermetrope
                    and astigmatic = yes then recommendation = none
                 If age = presbyopic and spectacle prescription = hypermetrope
                    and astigmatic = yes then recommendation = none


                            MSCS282 - Data Mining With
54                              Decision Trees
     Interpreting Decision Trees

        A description of hard contact wearers,
         appropriate for “regular people”

          In general, a nearsighted person with an astigmatism
              and normal tear production should be prescribed
                               hard contacts.




                      MSCS282 - Data Mining With
55                        Decision Trees
     Summary

        Decision trees are a classification technique
        They can represent any function representable
         with propositional logic
        Heuristics such as information content are
         used to select relevant attributes
        Pruning is used to avoid over fitting
        The output of decision trees can be used for
         descriptive as well as predictive purposes

                    MSCS282 - Data Mining With
56                      Decision Trees

				
DOCUMENT INFO