Docstoc

DSCI4520_TreeAlgorithms_4.ppt - University of North Texas

Document Sample
DSCI4520_TreeAlgorithms_4.ppt - University of North Texas Powered By Docstoc
					                    DSCI 4520/5240 DBDSS
DSCI 4520/5240
 DATA MINING           (DATA MINING)



                 DSCI 4520/5240 Lecture 4

            Decision Tree Algorithms


   Some slide material taken from: Witten & Frank 2000, Olson & Shi 2007,
                     de Ville 2006, SAS Education 2005
                                                                 Lecture 4 - 1
DSCI 4520/5240
                              Objective
 DATA MINING




           Review of some Decision Tree algorithms.




                                                       Lecture 4 - 2
                    Decision Trees: Credit risk example
DSCI 4520/5240
 DATA MINING



   This example is related to determining
   credit risks. We have a total of 10
   people. 6 are good risks and 4 are bad.
   We apply splits to the tree based on
   employment status. When we break
   this down, we find that there are 7
   employed and 3 not employed. Of the
   3 that are not employed, all of them
   are bad credit risks and thus we have
   learned something about our data.

  Note that here we cannot split
  this node down any further
  since all of our data is grouped
  into one set. This is called a
  pure node. The other node,           CORRESPONDING RULES:
  however, can be split again          •IF employed = yes AND married = yes THEN risk = good
  based on a different criterion.      •IF employed = yes AND married = no THEN risk = good
  So we can continue to grow the
                                       •IF employed = no THEN risk = bad
  tree on the left hand side.
                                                                                Lecture 4 - 3
                  Decision Tree performance
DSCI 4520/5240
 DATA MINING




    Confidence is the degree of accuracy of a rule.
    Support is the degree to which the rule conditions occur
       in the data.
    EXAMPLE: if 10 customers purchased Zane Grey’s The
       Young Pitcher and 8 of them also purchased The
       Short Stop, the rule: {IF basket has The Young
       Pitcher THEN basket has The Short Stop} has
       confidence of 0.80. If these purchases were the only
       10 to cover these books out of 10,000,000 purchases,
       the support is only 0.000001.

                                                     Lecture 4 - 4
                       Rule Interestingness
DSCI 4520/5240
 DATA MINING




    Interestingness is the idea that Data Mining discovers
        something unexpected.
    Consider the rule: {IF basket has eggs THEN basket has
        bacon}. Suppose the confidence level is 0.90 and the
        support level is 0.20. This may be a useful rule, but
        it may not be interesting if the grocer was already
        aware of this association.
    Recall the definition of DM as the discovery of
        previously unknown knowledge!


                                                     Lecture 4 - 5
DSCI 4520/5240
                      Rule Induction algorithms
 DATA MINING




    They are recursive algorithms that identify data
       partitions of progressive separation with respect to
       the outcome.
    The partitions are then organized into a decision tree.
    Common Algorithms:

                    1R                   CHAID
                    ID3                  CN2
                    C4.5/C5.0            BruteDL
                    CART                 SDL

                                                     Lecture 4 - 6
DSCI 4520/5240
                 Illustration of two Tree algorithms
 DATA MINING




                    1R and Discretization in 1R
                    Naïve Bayes Classification
                    ID3: Min Entropy and Max Info Gain




                                                          Lecture 4 - 7
DSCI 4520/5240
 DATA MINING




                 1R




                      Lecture 4 - 8
DSCI 4520/5240
                           1R: Inferring Rudimentary Rules
 DATA MINING




                 1R: learns a 1-level decision tree
                    In other words, generates a set of rules that all test on one
                     particular attribute
                 Basic version (assuming nominal attributes)
                    One branch for each of the attribute’s values
                    Each branch assigns most frequent class
                    Error rate: proportion of instances that don’t belong to the
                     majority class of their corresponding branch
                    Choose attribute with lowest error rate




                                                                              Lecture 4 - 9
DSCI 4520/5240
                                     Pseudo-code for 1R
 DATA MINING




                 For each attribute,
                   For each value of the attribute, make a rule as
                   follows:
                       count how often each class appears
                       find the most frequent class
                       make the rule assign that class to this
                   attribute-value
                   Calculate the error rate of the rules
                 Choose the rules with the smallest error rate

                 Let’s apply 1R on the weather data:
                    Consider the first (outlook) of the 4 attributes (outlook, temp,
                     humidity, windy). Consider all values (sunny, overcast,
                     rainy) and make 3 corresponding rules. Continue until you
                     get all 4 sets of rules.
                                                                              Lecture 4 - 10
                  A simple example: Weather Data
DSCI 4520/5240
 DATA MINING




                 Outlook    Temp   Humidity   Windy   Play?
                  Sunny     Hot      High     False    No
                  Sunny     Hot      High     True     No
                 Overcast   Hot      High     False   Yes
                  Rainy     Mild     High     False   Yes
                  Rainy     Cool   Normal     False   Yes
                  Rainy     Cool   Normal     True     No
                 Overcast   Cool   Normal     True    Yes
                  Sunny     Mild     High     False    No
                  Sunny     Cool   Normal     False   Yes
                  Rainy     Mild   Normal     False   Yes
                  Sunny     Mild   Normal     True    Yes
                 Overcast   Mild     High     True    Yes
                 Overcast   Hot    Normal     False   Yes
                  Rainy     Mild     High     True     No
                                                              Lecture 4 - 11
                 Evaluating the Weather Attributes in 1R
DSCI 4520/5240
 DATA MINING




                                               (*) indicates a
                                                   random choice
                                                   between two
                                                   equally likely
                                                   outcomes


                                                      Lecture 4 - 12
                        Decision tree for the weather
DSCI 4520/5240
 DATA MINING
                                    data

                                        Outlook
                                sunny
                                                          rainy
                                            overcast
                         Humidity                              Windy
                                          yes
                 high
                            normal                     false           true

                 no       yes                             yes          no




                                                                              Lecture 4 - 13
DSCI 4520/5240
                                   Discretization in 1R
 DATA MINING



       Consider continuous Temperature data, after sorting them in ascending order:
                   65 65 68 69 70 71 72 72 75 75 80 81 83 85
                   Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No

      One way to discretize temperature is to place breakpoints wherever the class changes:
             Yes | No | Yes Yes Yes | No No | Yes Yes Yes | No | Yes Yes | No

     To avoid overfitting, 1R adopts the rule that observations of the majority class in each
          partition be as many as possible but no more than 3, unless there is a “run”:
                 Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No

         If adjacent partitions have the same majority class, the partitions are merged:
                 Yes No Yes Yes Yes No No Yes Yes Yes | No Yes Yes No
                             The final discretization leads to the rule set:
                             IF temperature <= 77.5 THEN Yes
                             IF temperature > 77.5 THEN No
                                                                                   Lecture 4 - 14
DSCI 4520/5240
                                   Comments on 1R
 DATA MINING




                1R was described in a paper by Holte (1993)
                Contains an experimental evaluation on 16 datasets
                 (using cross-validation so that results were
                 representative of performance on future data)
                Minimum number of instances was set to 6 after
                 some experimentation
                1R’s simple rules performed not much worse than
                 much more complex decision trees
                Simplicity-first pays off!



                                                               Lecture 4 - 15
DSCI 4520/5240
 DATA MINING




                 Naïve Bayes Classification




                                              Lecture 4 - 16
DSCI 4520/5240
                 Statistical Decision Tree Modeling
 DATA MINING




 1R uses one attribute at a time and chooses the one that works best.
 Consider the “opposite” of 1R: Use all the attributes.

 Let’s first make two assumptions: Attributes are
       • Equally important
       • Statistically independent

 Although based on assumptions that are almost
 never correct, this scheme works well in practice!




                                                            Lecture 4 - 17
                   Probabilities for the Weather Data
DSCI 4520/5240
 DATA MINING



      Table showing counts and conditional probabilities (contigencies):




  A new day:



Suppose the answer is Play=Yes.
    How likely is to get the attribute
    values of this new day?
                                                                           Lecture 4 - 18
DSCI 4520/5240
                                     Baye’s Rule
 DATA MINING




        Probability of event H given evidence E:
                                      P(E|H) P(H)
                       P(H|E) =
                                          P(E)
          WHERE: H = target value, E = input variable values


         “A priori” probability of H: P(H)
            (Probability of event before evidence has been seen)

         “A posteriori” probability of H: P(H|E)
            (Probability of event after evidence has been seen)


                                                                  Lecture 4 - 19
DSCI 4520/5240
                        Naïve Bayes Classification
 DATA MINING




        Classification learning: what’s the probability of the class
           given an instance?
           Evidence E = instance
           Event H = class value for instance
        Naïve Bayes assumption: evidence can be split into
           independent parts (i.e. attributes of instance!)


                            P(E1|H) P(E2|H) … P(En|H) P(H)
                 P(H|E) =
                                         P(E)



                                                                 Lecture 4 - 20
                 Naïve Bayes on the Weather Data
DSCI 4520/5240
 DATA MINING




                                                         Evidence E

 P( Yes | E) = (P( Outlook = Sunny | Yes) ×
                    P( Temperature = Cool | Yes) ×
                    P( Humidity = High | Yes) ×
                    P( Windy = True | Yes) × P(Yes)) / P(E)
 P( Yes | E) = (2/9 × 3/9 × 3/9 × 3/9 × 9/14) / P(E)
                = 0.0053 / P(E)
 P( No | E) = (3/5 × 1/5 × 4/5 × 3/5 × 5/14) / P(E)
                = 0.0206 / P(E)


  Note that P(E) will disappear when we
     normalize!

                                                                      Lecture 4 - 21
DSCI 4520/5240
                    Comments on Naïve Bayes Classification
 DATA MINING




                Naïve Bayes works surprisingly well (even if
                 independence assumption is clearly violated)
                Why? Because classification doesn’t require
                 accurate probability estimates as long as maximum
                 probability is assigned to the correct class
                However: adding too many redundant attributes will
                 cause problems (e.g. identical attributes)




                                                                Lecture 4 - 22
DSCI 4520/5240
 DATA MINING




                 Entropy and Information
                           Gain




                                           Lecture 4 - 23
                 Constructing Decision Trees in ID3,
DSCI 4520/5240
 DATA MINING
                             C4.5, C5.0

 Normal procedure: top down in recursive divide-and-conquer
    fashion
     First: attribute is selected for root node and branch is created
        for each possible attribute value
     Then: the instances are split into subsets (one for each
        branch extending from the node)
     Finally: procedure is repeated recursively for each branch,
        using only instances that reach the branch

 Process stops if all instances have the same class



                                                            Lecture 4 - 24
                                    Which attribute to select?
DSCI 4520/5240
 DATA MINING




             Outlook                                                  Temperature
 sunny                            rainy                         hot        mild         cool
                       overcast                                           yes
   yes                            yes                           yes       yes            yes
   yes             yes            yes                           yes                      yes
                                                                          yes
   no              yes            yes                           no                       yes
                                                                          yes
   no              yes            no                            no                       no
                                                                          no
   no              yes            no             Windy                    no

                 Humidity                    false       true
         high                normal        yes           yes
                                           yes           yes
                 yes       yes
                 yes                       yes           yes
                           yes
                 yes                       yes           no
                           yes
                                           yes           no
                 no        yes
                 no                        yes           no
                           yes
                                           no
                 no        yes
                 no                        no
                           no
                                                                                Lecture 4 - 25
DSCI 4520/5240
                 A criterion for attribute selection
 DATA MINING




 •      Which is the best attribute?
        The one which will result in the smallest tree.
        Heuristic: choose the attribute that produces the “purest” nodes!
 •      Popular impurity criterion: Information. This is the extra
        information needed to classify an instance. It takes a low value
        for pure nodes and a high value for impure nodes.
 •      We can then compare a tree before the split and after the split
        using Information Gain = Info (before) – Info (after).
 •      Information Gain increases with the average purity of the
        subsets that an attribute produces
 •      Strategy: choose attribute that results in greatest information
        gain

                                                                Lecture 4 - 26
                                Computing Information
DSCI 4520/5240
 DATA MINING




           Information    is measured in bits
           Given  a probability distribution, the info required to predict
           an event is the distribution’s entropy
           Entropy  gives the additional required information (i.e., the
           information deficit) in bits
            This   can involve fractions of bits!
           The  negative sign in the entropy formula is needed to convert
           all negative logs back to positive values

    Formula for computing the entropy:
       Entropy (p1, p2, …, pn) = –p1 logp1 –p2 logp2 … –pn logpn

                                                                   Lecture 4 - 27
DSCI 4520/5240
                 Weather example: attribute “outlook”
 DATA MINING




• Outlook = “Sunny”                                       Outlook
Info([2,3]) = entropy(2/5, 3/5) =                 sunny                rainy
                                                            overcast
–2/5log(2/5) –3/5log(3/5) = 0.971 bits             yes                  yes
                                                   yes      yes         yes
                                                   no       yes         yes
                                                   no       yes         no
                                    Info([2,3])    no       yes         no


• Outlook = “Overcast”
Info([4,0]) = entropy(1, 0) = –1log(1) –0log(0) = 0 bits (by definition)
• Outlook = “Rainy”
Info([3,2]) = entropy(3/5, 2/5) = –3/5log(3/5) –2/5log(2/5) = 0.971 bits
Expected Information for attribute Outlook:
Info([3,2], [4,0], [3,2]) = (5/14)×0.971+ (4/14)×0 + (5/14)×0.971 =
    0.693 bits.
                                                                    Lecture 4 - 28
DSCI 4520/5240
                 Computing the Information Gain
 DATA MINING




• Information Gain = Information Before – Information After
Gain (Outlook) = info([9,5]) – info([2,3], [4,0], [3,2]) = 0.940 – 0.693
   = 0.247 bits
• Information Gain for attributes from the Weather Data:
       Gain (Outlook) = 0.247 bits
       Gain (Temperature) = 0.029 bits
       Gain (Humidity) = 0.152 bits
       Gain (Windy) = 0.048




                                                             Lecture 4 - 29
DSCI 4520/5240
                                        Continuing to split
 DATA MINING




                         Outlook                            Outlook                       Outlook
                 sunny                              sunny                       sunny

        Temperature                               Windy                      Humidity
  hot                                     false                       high
                 mild cool                          true                        normal
  no        yes       yes                 yes     yes                 no      yes
  no        no                            yes     no                  no      yes
                                          no                          no
                                          no




                            Gain (Temperature) = 0.571 bits
                            Gain (Humidity) = 0.971 bits
                            Gain (Windy) = 0.020 bits


                                                                                        Lecture 4 - 30
DSCI 4520/5240
                                  Final Decision Tree
 DATA MINING




                                              Outlook
                                     sunny                  rainy
                                                 overcast
                                  Humidity                      Windy
                                                yes
                           high
                                     normal             false       true
                           no      yes                       yes    no



     •       Not all leaves need to be pure. Sometimes identical
             instances belong to different classes
     •       Splitting stops when data cannot split any further




                                                                           Lecture 4 - 31
                 Another example: Loan Application Data
DSCI 4520/5240
 DATA MINING




                                         Twenty loan
                                         application cases
                                         are presented. The
                                         target variable
                                         OnTime? Indicates
                                         whether the loan
                                         was paid off on
                                         time.




                                                    Lecture 4 - 32
                       Loan Example: probability
DSCI 4520/5240
 DATA MINING
                             calculations

        All possible values for the three attributes (Age, Income, Risk)
            are shown below. For each value, the probability for the
            loan to be On Time (OnTime = yes) is calculated:




                                                                Lecture 4 - 33
                 Loan Example: Entropy calculations
DSCI 4520/5240
 DATA MINING




  Information calculations for attribute Age are shown below.
  • First we calculate the probability for each value to result in Yes
  • Also the probability for this value to result in No.
  • Then we compute the entropy for this value as:
                 E = –p(yes) logp(yes) –p(no) logp(no)
  • Finally we calculate Information for the entire attribute:
                 Inform = E1p1 + E2p2 + E3p3




                                                               Lecture 4 - 34
                   Loan Example: The first split
DSCI 4520/5240
 DATA MINING




  The calculations continue until we have, for each competing
   attribute, the Information required to predict the outcome.
  The attribute with lowest required information is also the attribute
   with largest information gain, when we compare the required
   information before and after the split.


                                                     Risk
                                               low
                                                                 high
                                                       average




                                                                 Lecture 4 - 35
DSCI 4520/5240
                              Suggested readings
 DATA MINING




    •      Verify the entropy, information, and information gain
           calculations we did in these slides
    •      Hint: All logs are base 2!!!
    •      Read the SAS GSEM 5.3 text, chapter 4 (pp. 61-102)
    •      Read the Sarma text, chapter 4 (pp. 113-168). Pay
           particular attention to:
           • Entropy calculations (p. 126)
           • Profit Matrix (p. 136)
           • Expected profit calculations (p. 137)
           • How to use SAS EM and grow a decision tree
               (pp. 143-158)

                                                                   Lecture 4 - 36

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:129
posted:5/8/2011
language:English
pages:36