DSCI4520_TreeAlgorithms_4.ppt - University of North Texas by mudoc123

VIEWS: 129 PAGES: 36

• pg 1
```									                    DSCI 4520/5240 DBDSS
DSCI 4520/5240
DATA MINING           (DATA MINING)

DSCI 4520/5240 Lecture 4

Decision Tree Algorithms

Some slide material taken from: Witten & Frank 2000, Olson & Shi 2007,
de Ville 2006, SAS Education 2005
Lecture 4 - 1
DSCI 4520/5240
Objective
DATA MINING

    Review of some Decision Tree algorithms.

Lecture 4 - 2
Decision Trees: Credit risk example
DSCI 4520/5240
DATA MINING

This example is related to determining
credit risks. We have a total of 10
people. 6 are good risks and 4 are bad.
We apply splits to the tree based on
employment status. When we break
this down, we find that there are 7
employed and 3 not employed. Of the
3 that are not employed, all of them
are bad credit risks and thus we have

Note that here we cannot split
this node down any further
since all of our data is grouped
into one set. This is called a
pure node. The other node,           CORRESPONDING RULES:
however, can be split again          •IF employed = yes AND married = yes THEN risk = good
based on a different criterion.      •IF employed = yes AND married = no THEN risk = good
So we can continue to grow the
•IF employed = no THEN risk = bad
tree on the left hand side.
Lecture 4 - 3
Decision Tree performance
DSCI 4520/5240
DATA MINING

Confidence is the degree of accuracy of a rule.
Support is the degree to which the rule conditions occur
in the data.
EXAMPLE: if 10 customers purchased Zane Grey’s The
Young Pitcher and 8 of them also purchased The
Short Stop, the rule: {IF basket has The Young
Pitcher THEN basket has The Short Stop} has
confidence of 0.80. If these purchases were the only
10 to cover these books out of 10,000,000 purchases,
the support is only 0.000001.

Lecture 4 - 4
Rule Interestingness
DSCI 4520/5240
DATA MINING

Interestingness is the idea that Data Mining discovers
something unexpected.
bacon}. Suppose the confidence level is 0.90 and the
support level is 0.20. This may be a useful rule, but
it may not be interesting if the grocer was already
aware of this association.
Recall the definition of DM as the discovery of
previously unknown knowledge!

Lecture 4 - 5
DSCI 4520/5240
Rule Induction algorithms
DATA MINING

They are recursive algorithms that identify data
partitions of progressive separation with respect to
the outcome.
The partitions are then organized into a decision tree.
Common Algorithms:

   1R                   CHAID
   ID3                  CN2
   C4.5/C5.0            BruteDL
   CART                 SDL

Lecture 4 - 6
DSCI 4520/5240
Illustration of two Tree algorithms
DATA MINING

   1R and Discretization in 1R
   Naïve Bayes Classification
   ID3: Min Entropy and Max Info Gain

Lecture 4 - 7
DSCI 4520/5240
DATA MINING

1R

Lecture 4 - 8
DSCI 4520/5240
1R: Inferring Rudimentary Rules
DATA MINING

1R: learns a 1-level decision tree
   In other words, generates a set of rules that all test on one
particular attribute
Basic version (assuming nominal attributes)
   One branch for each of the attribute’s values
   Each branch assigns most frequent class
   Error rate: proportion of instances that don’t belong to the
majority class of their corresponding branch
   Choose attribute with lowest error rate

Lecture 4 - 9
DSCI 4520/5240
Pseudo-code for 1R
DATA MINING

For each attribute,
For each value of the attribute, make a rule as
follows:
count how often each class appears
find the most frequent class
make the rule assign that class to this
attribute-value
Calculate the error rate of the rules
Choose the rules with the smallest error rate

Let’s apply 1R on the weather data:
   Consider the first (outlook) of the 4 attributes (outlook, temp,
humidity, windy). Consider all values (sunny, overcast,
rainy) and make 3 corresponding rules. Continue until you
get all 4 sets of rules.
Lecture 4 - 10
A simple example: Weather Data
DSCI 4520/5240
DATA MINING

Outlook    Temp   Humidity   Windy   Play?
Sunny     Hot      High     False    No
Sunny     Hot      High     True     No
Overcast   Hot      High     False   Yes
Rainy     Mild     High     False   Yes
Rainy     Cool   Normal     False   Yes
Rainy     Cool   Normal     True     No
Overcast   Cool   Normal     True    Yes
Sunny     Mild     High     False    No
Sunny     Cool   Normal     False   Yes
Rainy     Mild   Normal     False   Yes
Sunny     Mild   Normal     True    Yes
Overcast   Mild     High     True    Yes
Overcast   Hot    Normal     False   Yes
Rainy     Mild     High     True     No
Lecture 4 - 11
Evaluating the Weather Attributes in 1R
DSCI 4520/5240
DATA MINING

(*) indicates a
random choice
between two
equally likely
outcomes

Lecture 4 - 12
Decision tree for the weather
DSCI 4520/5240
DATA MINING
data

Outlook
sunny
rainy
overcast
Humidity                              Windy
yes
high
normal                     false           true

no       yes                             yes          no

Lecture 4 - 13
DSCI 4520/5240
Discretization in 1R
DATA MINING

Consider continuous Temperature data, after sorting them in ascending order:
65 65 68 69 70 71 72 72 75 75 80 81 83 85
Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No

One way to discretize temperature is to place breakpoints wherever the class changes:
Yes | No | Yes Yes Yes | No No | Yes Yes Yes | No | Yes Yes | No

To avoid overfitting, 1R adopts the rule that observations of the majority class in each
partition be as many as possible but no more than 3, unless there is a “run”:
Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No

If adjacent partitions have the same majority class, the partitions are merged:
Yes No Yes Yes Yes No No Yes Yes Yes | No Yes Yes No
The final discretization leads to the rule set:
IF temperature <= 77.5 THEN Yes
IF temperature > 77.5 THEN No
Lecture 4 - 14
DSCI 4520/5240
DATA MINING

       1R was described in a paper by Holte (1993)
       Contains an experimental evaluation on 16 datasets
(using cross-validation so that results were
representative of performance on future data)
       Minimum number of instances was set to 6 after
some experimentation
       1R’s simple rules performed not much worse than
much more complex decision trees
       Simplicity-first pays off!

Lecture 4 - 15
DSCI 4520/5240
DATA MINING

Naïve Bayes Classification

Lecture 4 - 16
DSCI 4520/5240
Statistical Decision Tree Modeling
DATA MINING

1R uses one attribute at a time and chooses the one that works best.
Consider the “opposite” of 1R: Use all the attributes.

Let’s first make two assumptions: Attributes are
• Equally important
• Statistically independent

Although based on assumptions that are almost
never correct, this scheme works well in practice!

Lecture 4 - 17
Probabilities for the Weather Data
DSCI 4520/5240
DATA MINING

Table showing counts and conditional probabilities (contigencies):

A new day:

How likely is to get the attribute
values of this new day?
Lecture 4 - 18
DSCI 4520/5240
Baye’s Rule
DATA MINING

Probability of event H given evidence E:
P(E|H) P(H)
P(H|E) =
P(E)
WHERE: H = target value, E = input variable values

“A priori” probability of H: P(H)
(Probability of event before evidence has been seen)

“A posteriori” probability of H: P(H|E)
(Probability of event after evidence has been seen)

Lecture 4 - 19
DSCI 4520/5240
Naïve Bayes Classification
DATA MINING

Classification learning: what’s the probability of the class
given an instance?
Evidence E = instance
Event H = class value for instance
Naïve Bayes assumption: evidence can be split into
independent parts (i.e. attributes of instance!)

P(E1|H) P(E2|H) … P(En|H) P(H)
P(H|E) =
P(E)

Lecture 4 - 20
Naïve Bayes on the Weather Data
DSCI 4520/5240
DATA MINING

Evidence E

P( Yes | E) = (P( Outlook = Sunny | Yes) ×
P( Temperature = Cool | Yes) ×
P( Humidity = High | Yes) ×
P( Windy = True | Yes) × P(Yes)) / P(E)
P( Yes | E) = (2/9 × 3/9 × 3/9 × 3/9 × 9/14) / P(E)
= 0.0053 / P(E)
P( No | E) = (3/5 × 1/5 × 4/5 × 3/5 × 5/14) / P(E)
= 0.0206 / P(E)

Note that P(E) will disappear when we
normalize!

Lecture 4 - 21
DSCI 4520/5240
DATA MINING

       Naïve Bayes works surprisingly well (even if
independence assumption is clearly violated)
       Why? Because classification doesn’t require
accurate probability estimates as long as maximum
probability is assigned to the correct class
       However: adding too many redundant attributes will
cause problems (e.g. identical attributes)

Lecture 4 - 22
DSCI 4520/5240
DATA MINING

Entropy and Information
Gain

Lecture 4 - 23
Constructing Decision Trees in ID3,
DSCI 4520/5240
DATA MINING
C4.5, C5.0

Normal procedure: top down in recursive divide-and-conquer
fashion
 First: attribute is selected for root node and branch is created
for each possible attribute value
 Then: the instances are split into subsets (one for each
branch extending from the node)
 Finally: procedure is repeated recursively for each branch,
using only instances that reach the branch

Process stops if all instances have the same class

Lecture 4 - 24
Which attribute to select?
DSCI 4520/5240
DATA MINING

Outlook                                                  Temperature
sunny                            rainy                         hot        mild         cool
overcast                                           yes
yes                            yes                           yes       yes            yes
yes             yes            yes                           yes                      yes
yes
no              yes            yes                           no                       yes
yes
no              yes            no                            no                       no
no
no              yes            no             Windy                    no

Humidity                    false       true
high                normal        yes           yes
yes           yes
yes       yes
yes                       yes           yes
yes
yes                       yes           no
yes
yes           no
no        yes
no                        yes           no
yes
no
no        yes
no                        no
no
Lecture 4 - 25
DSCI 4520/5240
A criterion for attribute selection
DATA MINING

•      Which is the best attribute?
The one which will result in the smallest tree.
Heuristic: choose the attribute that produces the “purest” nodes!
•      Popular impurity criterion: Information. This is the extra
information needed to classify an instance. It takes a low value
for pure nodes and a high value for impure nodes.
•      We can then compare a tree before the split and after the split
using Information Gain = Info (before) – Info (after).
•      Information Gain increases with the average purity of the
subsets that an attribute produces
•      Strategy: choose attribute that results in greatest information
gain

Lecture 4 - 26
Computing Information
DSCI 4520/5240
DATA MINING

Information    is measured in bits
Given  a probability distribution, the info required to predict
an event is the distribution’s entropy
Entropy  gives the additional required information (i.e., the
information deficit) in bits
 This   can involve fractions of bits!
all negative logs back to positive values

Formula for computing the entropy:
Entropy (p1, p2, …, pn) = –p1 logp1 –p2 logp2 … –pn logpn

Lecture 4 - 27
DSCI 4520/5240
Weather example: attribute “outlook”
DATA MINING

• Outlook = “Sunny”                                       Outlook
Info([2,3]) = entropy(2/5, 3/5) =                 sunny                rainy
overcast
–2/5log(2/5) –3/5log(3/5) = 0.971 bits             yes                  yes
yes      yes         yes
no       yes         yes
no       yes         no
Info([2,3])    no       yes         no

• Outlook = “Overcast”
Info([4,0]) = entropy(1, 0) = –1log(1) –0log(0) = 0 bits (by definition)
• Outlook = “Rainy”
Info([3,2]) = entropy(3/5, 2/5) = –3/5log(3/5) –2/5log(2/5) = 0.971 bits
Expected Information for attribute Outlook:
Info([3,2], [4,0], [3,2]) = (5/14)×0.971+ (4/14)×0 + (5/14)×0.971 =
0.693 bits.
Lecture 4 - 28
DSCI 4520/5240
Computing the Information Gain
DATA MINING

• Information Gain = Information Before – Information After
Gain (Outlook) = info([9,5]) – info([2,3], [4,0], [3,2]) = 0.940 – 0.693
= 0.247 bits
• Information Gain for attributes from the Weather Data:
Gain (Outlook) = 0.247 bits
Gain (Temperature) = 0.029 bits
Gain (Humidity) = 0.152 bits
Gain (Windy) = 0.048

Lecture 4 - 29
DSCI 4520/5240
Continuing to split
DATA MINING

Outlook                            Outlook                       Outlook
sunny                              sunny                       sunny

Temperature                               Windy                      Humidity
hot                                     false                       high
mild cool                          true                        normal
no        yes       yes                 yes     yes                 no      yes
no        no                            yes     no                  no      yes
no                          no
no

Gain (Temperature) = 0.571 bits
Gain (Humidity) = 0.971 bits
Gain (Windy) = 0.020 bits

Lecture 4 - 30
DSCI 4520/5240
Final Decision Tree
DATA MINING

Outlook
sunny                  rainy
overcast
Humidity                      Windy
yes
high
normal             false       true
no      yes                       yes    no

•       Not all leaves need to be pure. Sometimes identical
instances belong to different classes
•       Splitting stops when data cannot split any further

Lecture 4 - 31
Another example: Loan Application Data
DSCI 4520/5240
DATA MINING

Twenty loan
application cases
are presented. The
target variable
OnTime? Indicates
whether the loan
was paid off on
time.

Lecture 4 - 32
Loan Example: probability
DSCI 4520/5240
DATA MINING
calculations

All possible values for the three attributes (Age, Income, Risk)
are shown below. For each value, the probability for the
loan to be On Time (OnTime = yes) is calculated:

Lecture 4 - 33
Loan Example: Entropy calculations
DSCI 4520/5240
DATA MINING

Information calculations for attribute Age are shown below.
• First we calculate the probability for each value to result in Yes
• Also the probability for this value to result in No.
• Then we compute the entropy for this value as:
E = –p(yes) logp(yes) –p(no) logp(no)
• Finally we calculate Information for the entire attribute:
Inform = E1p1 + E2p2 + E3p3

Lecture 4 - 34
Loan Example: The first split
DSCI 4520/5240
DATA MINING

The calculations continue until we have, for each competing
attribute, the Information required to predict the outcome.
The attribute with lowest required information is also the attribute
with largest information gain, when we compare the required
information before and after the split.

Risk
low
high
average

Lecture 4 - 35
DSCI 4520/5240
DATA MINING

•      Verify the entropy, information, and information gain
calculations we did in these slides
•      Hint: All logs are base 2!!!
•      Read the SAS GSEM 5.3 text, chapter 4 (pp. 61-102)
•      Read the Sarma text, chapter 4 (pp. 113-168). Pay
particular attention to:
• Entropy calculations (p. 126)
• Profit Matrix (p. 136)
• Expected profit calculations (p. 137)
• How to use SAS EM and grow a decision tree
(pp. 143-158)

Lecture 4 - 36

```
To top