Learning Center
Plans & pricing Sign in
Sign Out




                     Data Mining - CSE5230

                              Decision Trees

CSE5230 - Data Mining, 2002                    Lecture 7.1
                              Lecture Outline

       Why use Decision Trees?
       What is a Decision Tree?
       Examples
       Use as a data mining technique
       Popular Models
           ID3 & C4.5

CSE5230 - Data Mining, 2002                     Lecture 7.2
                  Why use Decision Trees? - 1
       Whereas      neural networks compute a
           mathematical function of their inputs to generate
           their outputs, decision trees use logical rules
        2.6                   > 2.6
                                                              Petal-length > 2.6 AND
   Iris setosa              Petal-width                       Petal-width  1.65 AND
                   1.65                    > 1.65            Petal-length > 5 AND
                                                              Sepal-length > 6.05
           Petal-length                    Iris virginica   THEN
      5                      >5                              the flower is Iris virginica
Iris versicolor             Sepal-length                    NB. This is not the only rule for
                    6.05                  > 6.05           this species. What is the other?

            Iris versicolor                Iris virginica
                                                            Figure adapted from [SGI2001]
CSE5230 - Data Mining, 2002                                                         Lecture 7.3
               Why use Decision Trees? - 2

       For  some applications accuracy of classification
         or prediction is sufficient, e.g.:
           Direct mail firm needing to find a model for identifying
            customers who will respond to mail
           Predicting the stock market using past data
       In other applications it is better (sometimes
         essential) that the decision be explained, e.g.:
           Rejection of a credit application
           Medical diagnosis
       Humans    generally require explanations for most

CSE5230 - Data Mining, 2002                               Lecture 7.4
               Why use Decision Trees? - 3

       Example:     When a bank rejects a credit card
         application, it is better to explain to the customer
              that it was due to the fact that:
           He/she is not a permanent resident of Australia AND
            He/she has been residing in Australia for < 6 months AND
            He/she does not have a permanent job.
       This is better than saying:
       “We are very sorry, but our neural network thinks
        that you are not a credit-worthy customer.” (In
        which case the customer might become angry and
        move to another bank)

CSE5230 - Data Mining, 2002                           Lecture 7.5
                                   What is a Decision Tree?
root node
                                                                 from root node (top) to leaf
                                                                 Built
                 Petal-length                  test         nodes (bottom)
      2.6
                                                child node A record first enters the root node
                                   > 2.6

   Iris setosa                  Petal-width
                                                           A test is applied to determine to
                   1.65                         > 1.65     which child node it should go next
                                   path        Iris virginica      A variety of algorithms for choosing the
                                                                    initial test exists. The aim is to
      5                           >5                               discriminate best between the target
Iris versicolor                 Sepal-length                        classes
                     6.05                     > 6.05            Theprocess is repeated until a
                 Iris versicolor               Iris virginica
                                                                record arrives at a leaf node
                                                               The path from the root to a leaf
                                                                node provides an expression of a
                         leaf nodes

CSE5230 - Data Mining, 2002                                                                Lecture 7.6
                Building a Decision Tree - 1
       Algorithms     for building decision trees (DTs) begin
         by trying to find the test which does the “best job”
         of splitting the data into the desired classes
            The desired classes have to be identified at the start
       Example:    we need to describe the profiles of
         customers of a telephone company who “churn”
         (do not renew their contracts). The DT building
         algorithm examines the customer database to
         find the best splitting criterion:
 Phone technology         Age of customer   Time has been a customer     Gender

       The  DT algorithm may discover out that the
         “Phone technology” variable is best for
         separating churners from non-churners
CSE5230 - Data Mining, 2002                                            Lecture 7.7
                Building a Decision Tree - 2
  The    process is repeated to discover the best splitting
     criterion for the records assigned to each node
                                        Phone technology

                                  new                      old

                      Time has been a customer              Churners

                  2.3                   > 2.3

     Once built, the effectiveness of a decision tree can be
     measured by applying it to a collection of previously
     unseen records and observing the percentage of
     correctly classified records
CSE5230 - Data Mining, 2002                                            Lecture 7.8
                                                                  Phone Technology
                                  Example - 1
                                                                 50 Churners
                                                                 50 Non-churners
  Requirement:    Classify
    customers who churn,                                   new                   old
    i.e. do not renew                       Time has been a Customer
                                                                        20 Churners
    their phone                                                          0 Non-churners
                                            30 Churners
    contracts.                              50 Non-churners
    (adapted from [BeS1997])

                                 <= 2.3 years                     > 2.3 years

                                     Age                       5 Churners
                                                              40 Non-churners
                              25 Churners
                              10 Non-churners
                  <= 35                             > 35

       20 Churners                                  5 Churners
        0 Non-churners                             10 Non-churners

CSE5230 - Data Mining, 2002                                             Lecture 7.9
                               Example - 2

       The   number of records in a given parent node
        equals the sum of the records contained in the
        child nodes
       Quite easy to understand how the model is being
        built (unlike NNs)
       Easy use the model
           say for a targeted marketing campaign aimed at
            customers likely to churn
       Provides          intuitive ideas about the customer base
           e.g: “Customers who have been with the company for a
            couple of years and have new phones are pretty loyal”

CSE5230 - Data Mining, 2002                            Lecture 7.10
        Use as a data mining technique - 1

       Exploration
           Analyzing the predictors and splitting criteria selected
            by the algorithm may provide interesting insights which
            can be acted upon
           e.g. if the following rule was identified:

                         time a customer < 1.1 years AND
                         sales channel = telesales
                       THEN chance of churn is 65%

           It might be worthwhile conducting a study on the way
            the telesales operators are making their calls

CSE5230 - Data Mining, 2002                                Lecture 7.11
        Use as a data mining technique - 2
  Exploration           (continued)
      Gleaning information from rules that fail
      e.g. from the phone example we obtained the rule:
                      Phone technology = old AND
                      Time has been a customer  2.3 years AND
                      Age > 35
                    THEN there are only 15 customers (15% of total)
      Can this rule be useful?
        » Perhaps we can attempt to build up this small market
          segment. If this is possible then we have the edge over
          competitors since we have a head start in this knowledge
        » We can remove these customers from our direct
          marketing campaign since there are so few of them
CSE5230 - Data Mining, 2002                              Lecture 7.12
        Use as a data mining technique - 3
       Exploration           (continued)
           Again from the phone company example we noticed
              » There was no combination of rules to reliably
                 discriminate between churners and non-churners
                 for the small market segment mentioned on the
                 previous slide (5 churners, 10 non-churners).
           Do we consider this as an occasion where it was not
            possible to achieve our objective?
           From this failure we have learnt that age is not all that
            important for this category churners (unlike those under
           Perhaps we were asking the wrong questions all along -
            this warrants further analysis

CSE5230 - Data Mining, 2002                             Lecture 7.13
        Use as a data mining technique - 4

       Data       Pre-processing
           Decision trees are very robust at handling different
            predictor types (number/categorical), and run quickly.
            Therefore the can be good for a first pass over the data
            in a data mining operation
           This will create a subset of the possibly useful
            predictors which can then be fed into another model,
            say a neural network
       Prediction
           Once the decision tree is built it can be then be used as
            a prediction tool, by using it on a new set of data

CSE5230 - Data Mining, 2002                             Lecture 7.14
            Popular Decision Tree Models:
       CART:   Classification And Regression Trees,
         developed in 1984 by a team of researchers (Leo
         Breiman et al.) from Stanford University
           Used in the DM software Darwin - from Thinking
            Machines Corporation (recently bought by Oracle)
       Often   uses an entropy measure to determine the
         split point (Shannon’s Information theory).

         measure of disorder (MOD) =         p log 2 ( p)
        where p is is the probability of that prediction
        value occurring in a particular node of the tree.
        Other measures used include Gini and twoing.
       CART produces a binary tree
CSE5230 - Data Mining, 2002                          Lecture 7.15
                              CART - 2
 Consider          the “Churn” problem from slide 7.9
      At the first node there are 100 customers to split, 50 who
       churn and 50 who don’t churn
       The MOD of this node is:
                   MOD = -0.5*log2(0.5) + -0.5*log2(0.5) = 1.00
      The algorithm will try each predictor
      For each predictor the algorithm will calculate the MOD of the
       split produced by several values to identify the optimum
      splitting on “Phone technology” produces two nodes, one with
       50 churners and 30 non-churners, the other with 20 churners
       and 0 non-churners. The first of these has:
                   MOD = -5/8*log2(5/8) + -3/8log2(3/8) = 0.95
       and the second has a MOD of 0.
      CART will select the predictor producing nodes with the lowest
       MOD as the split point
CSE5230 - Data Mining, 2002                           Lecture 7.16
                              Node splitting
    An ideally good split
    Name               Churned?      Name      Churned?
    Jim                Yes           Bob       No
    Sally              Yes           Betty     No
    Steve              Yes           Sue       No
    Joe                Yes           Alex      No

    An ideally bad split
    Name               Churned?      Name      Churned?
    Jim                Yes           Bob       No
    Sally              Yes           Betty     No
    Steve              No            Sue       Yes
    Joe                No            Alex      Yes

CSE5230 - Data Mining, 2002                          Lecture 7.17
            Popular Decision Tree Models:
       CHAID:   Chi-squared Automatic Interaction
         Detector, developed by J. A. Hartigan in 1975.
           Widely used since it is distributed as part of the popular
            statistical packages SAS and SPSS
       Differs from CART in the way it identifies the split
        points. Instead of the information measure, it
        uses chi-squared test to identify the split points
        (a statistical measure used for identifying
        independent variables)
       All predictors must be categorical or put into
        categorical form by binning
       The accuracy of the two methods CHAID and
        CART have been found to be similar
CSE5230 - Data Mining, 2002                              Lecture 7.18
            Popular Decision Tree Models:
                     ID3 & C4.5
       ID3: Iterative Dichtomiser, developed by the
         Australian researcher Ross Quinlan in 1979
           Used in the data mining software Clementine of Integral
            Solutions Ltd. (taken over by SPSS)
       ID3  picks predictors and their splitting values on
         the basis of the information gain provided
           Gain is the difference between the amount of
            information that is needed to make a correct prediction
            both before and after the split has been made
           If the amount of information required is much lower after
            the split is made, then the split is said to have
            decreased the disorder of the original data

CSE5230 - Data Mining, 2002                             Lecture 7.19
                                 ID3 & C4.5 - 2
                   A                                         B

                                  + ----            +++++----                 -
          ++++ -

   By using the entropy              p log( p )

       left        right left entropy               right entropy      start entropy
   A ++++-         +---- -4/5log(4/5)+              -4/5log(4/5)+      -5/10log(5/10)+
                         -1/5log(1/5)=.72           -1/5log(1/5)=.72   -5/10(log(5/10)
   B +++++ -                  -5/9log(5/9)+         -1/1log(1/1)
     ----                     -4/9log(4/9)=.99      =0

CSE5230 - Data Mining, 2002                                            Lecture 7.20
                              ID3 & C4.5 - 3
                          Weighted Entropy          Gain
                   A      (5/10)*0.72+(5/10)*0.72   0.28
                          = 0.72
                   B      (9/10)*0.99+(1/10)*0      0.11
                          = 0.89

            A will be selected
       Split
       C4.5 introduces a number of extensions to ID3:
           Handles unknown field values in training set
           Tree pruning method
           Automated rule generation

CSE5230 - Data Mining, 2002                            Lecture 7.21
                 Strengths and Weaknesses

       Strengths             of decision trees
           Able to generate understandable rules
           Classify with very little computation
           Handle both continuous and categorical data
           Provides a clear indication of which variables are most
            important for prediction or classification
       Weaknesses
           Not appropriate for estimation or prediction tasks
            (income, interest rates, etc.)
           Problematic with time series data (much pre-processing
            required), can be computationally expensive

CSE5230 - Data Mining, 2002                             Lecture 7.22
       [SGI2001]  Silicon Graphics Inc. MLC++ Utilities
        Manual, 2001
       [BeL1997] J. A. Berry and G. Linoff, Data Mining
        Techniques: For Marketing, Sales, and Customer
        Support, John Wiley & Sons Inc.,1997
       [BeS1997] A. Berson and S. J. Smith, Data
        Warehousing, Data Mining and OLAP, McGraw
        Hill, 1997

CSE5230 - Data Mining, 2002                   Lecture 7.23

To top