VIEWS: 129 PAGES: 36 POSTED ON: 5/8/2011
DSCI 4520/5240 DBDSS DSCI 4520/5240 DATA MINING (DATA MINING) DSCI 4520/5240 Lecture 4 Decision Tree Algorithms Some slide material taken from: Witten & Frank 2000, Olson & Shi 2007, de Ville 2006, SAS Education 2005 Lecture 4 - 1 DSCI 4520/5240 Objective DATA MINING Review of some Decision Tree algorithms. Lecture 4 - 2 Decision Trees: Credit risk example DSCI 4520/5240 DATA MINING This example is related to determining credit risks. We have a total of 10 people. 6 are good risks and 4 are bad. We apply splits to the tree based on employment status. When we break this down, we find that there are 7 employed and 3 not employed. Of the 3 that are not employed, all of them are bad credit risks and thus we have learned something about our data. Note that here we cannot split this node down any further since all of our data is grouped into one set. This is called a pure node. The other node, CORRESPONDING RULES: however, can be split again •IF employed = yes AND married = yes THEN risk = good based on a different criterion. •IF employed = yes AND married = no THEN risk = good So we can continue to grow the •IF employed = no THEN risk = bad tree on the left hand side. Lecture 4 - 3 Decision Tree performance DSCI 4520/5240 DATA MINING Confidence is the degree of accuracy of a rule. Support is the degree to which the rule conditions occur in the data. EXAMPLE: if 10 customers purchased Zane Grey’s The Young Pitcher and 8 of them also purchased The Short Stop, the rule: {IF basket has The Young Pitcher THEN basket has The Short Stop} has confidence of 0.80. If these purchases were the only 10 to cover these books out of 10,000,000 purchases, the support is only 0.000001. Lecture 4 - 4 Rule Interestingness DSCI 4520/5240 DATA MINING Interestingness is the idea that Data Mining discovers something unexpected. Consider the rule: {IF basket has eggs THEN basket has bacon}. Suppose the confidence level is 0.90 and the support level is 0.20. This may be a useful rule, but it may not be interesting if the grocer was already aware of this association. Recall the definition of DM as the discovery of previously unknown knowledge! Lecture 4 - 5 DSCI 4520/5240 Rule Induction algorithms DATA MINING They are recursive algorithms that identify data partitions of progressive separation with respect to the outcome. The partitions are then organized into a decision tree. Common Algorithms: 1R CHAID ID3 CN2 C4.5/C5.0 BruteDL CART SDL Lecture 4 - 6 DSCI 4520/5240 Illustration of two Tree algorithms DATA MINING 1R and Discretization in 1R Naïve Bayes Classification ID3: Min Entropy and Max Info Gain Lecture 4 - 7 DSCI 4520/5240 DATA MINING 1R Lecture 4 - 8 DSCI 4520/5240 1R: Inferring Rudimentary Rules DATA MINING 1R: learns a 1-level decision tree In other words, generates a set of rules that all test on one particular attribute Basic version (assuming nominal attributes) One branch for each of the attribute’s values Each branch assigns most frequent class Error rate: proportion of instances that don’t belong to the majority class of their corresponding branch Choose attribute with lowest error rate Lecture 4 - 9 DSCI 4520/5240 Pseudo-code for 1R DATA MINING For each attribute, For each value of the attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate Let’s apply 1R on the weather data: Consider the first (outlook) of the 4 attributes (outlook, temp, humidity, windy). Consider all values (sunny, overcast, rainy) and make 3 corresponding rules. Continue until you get all 4 sets of rules. Lecture 4 - 10 A simple example: Weather Data DSCI 4520/5240 DATA MINING Outlook Temp Humidity Windy Play? Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No Lecture 4 - 11 Evaluating the Weather Attributes in 1R DSCI 4520/5240 DATA MINING (*) indicates a random choice between two equally likely outcomes Lecture 4 - 12 Decision tree for the weather DSCI 4520/5240 DATA MINING data Outlook sunny rainy overcast Humidity Windy yes high normal false true no yes yes no Lecture 4 - 13 DSCI 4520/5240 Discretization in 1R DATA MINING Consider continuous Temperature data, after sorting them in ascending order: 65 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No One way to discretize temperature is to place breakpoints wherever the class changes: Yes | No | Yes Yes Yes | No No | Yes Yes Yes | No | Yes Yes | No To avoid overfitting, 1R adopts the rule that observations of the majority class in each partition be as many as possible but no more than 3, unless there is a “run”: Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No If adjacent partitions have the same majority class, the partitions are merged: Yes No Yes Yes Yes No No Yes Yes Yes | No Yes Yes No The final discretization leads to the rule set: IF temperature <= 77.5 THEN Yes IF temperature > 77.5 THEN No Lecture 4 - 14 DSCI 4520/5240 Comments on 1R DATA MINING 1R was described in a paper by Holte (1993) Contains an experimental evaluation on 16 datasets (using cross-validation so that results were representative of performance on future data) Minimum number of instances was set to 6 after some experimentation 1R’s simple rules performed not much worse than much more complex decision trees Simplicity-first pays off! Lecture 4 - 15 DSCI 4520/5240 DATA MINING Naïve Bayes Classification Lecture 4 - 16 DSCI 4520/5240 Statistical Decision Tree Modeling DATA MINING 1R uses one attribute at a time and chooses the one that works best. Consider the “opposite” of 1R: Use all the attributes. Let’s first make two assumptions: Attributes are • Equally important • Statistically independent Although based on assumptions that are almost never correct, this scheme works well in practice! Lecture 4 - 17 Probabilities for the Weather Data DSCI 4520/5240 DATA MINING Table showing counts and conditional probabilities (contigencies): A new day: Suppose the answer is Play=Yes. How likely is to get the attribute values of this new day? Lecture 4 - 18 DSCI 4520/5240 Baye’s Rule DATA MINING Probability of event H given evidence E: P(E|H) P(H) P(H|E) = P(E) WHERE: H = target value, E = input variable values “A priori” probability of H: P(H) (Probability of event before evidence has been seen) “A posteriori” probability of H: P(H|E) (Probability of event after evidence has been seen) Lecture 4 - 19 DSCI 4520/5240 Naïve Bayes Classification DATA MINING Classification learning: what’s the probability of the class given an instance? Evidence E = instance Event H = class value for instance Naïve Bayes assumption: evidence can be split into independent parts (i.e. attributes of instance!) P(E1|H) P(E2|H) … P(En|H) P(H) P(H|E) = P(E) Lecture 4 - 20 Naïve Bayes on the Weather Data DSCI 4520/5240 DATA MINING Evidence E P( Yes | E) = (P( Outlook = Sunny | Yes) × P( Temperature = Cool | Yes) × P( Humidity = High | Yes) × P( Windy = True | Yes) × P(Yes)) / P(E) P( Yes | E) = (2/9 × 3/9 × 3/9 × 3/9 × 9/14) / P(E) = 0.0053 / P(E) P( No | E) = (3/5 × 1/5 × 4/5 × 3/5 × 5/14) / P(E) = 0.0206 / P(E) Note that P(E) will disappear when we normalize! Lecture 4 - 21 DSCI 4520/5240 Comments on Naïve Bayes Classification DATA MINING Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to the correct class However: adding too many redundant attributes will cause problems (e.g. identical attributes) Lecture 4 - 22 DSCI 4520/5240 DATA MINING Entropy and Information Gain Lecture 4 - 23 Constructing Decision Trees in ID3, DSCI 4520/5240 DATA MINING C4.5, C5.0 Normal procedure: top down in recursive divide-and-conquer fashion First: attribute is selected for root node and branch is created for each possible attribute value Then: the instances are split into subsets (one for each branch extending from the node) Finally: procedure is repeated recursively for each branch, using only instances that reach the branch Process stops if all instances have the same class Lecture 4 - 24 Which attribute to select? DSCI 4520/5240 DATA MINING Outlook Temperature sunny rainy hot mild cool overcast yes yes yes yes yes yes yes yes yes yes yes yes no yes yes no yes yes no yes no no no no no yes no Windy no Humidity false true high normal yes yes yes yes yes yes yes yes yes yes yes yes no yes yes no no yes no yes no yes no no yes no no no Lecture 4 - 25 DSCI 4520/5240 A criterion for attribute selection DATA MINING • Which is the best attribute? The one which will result in the smallest tree. Heuristic: choose the attribute that produces the “purest” nodes! • Popular impurity criterion: Information. This is the extra information needed to classify an instance. It takes a low value for pure nodes and a high value for impure nodes. • We can then compare a tree before the split and after the split using Information Gain = Info (before) – Info (after). • Information Gain increases with the average purity of the subsets that an attribute produces • Strategy: choose attribute that results in greatest information gain Lecture 4 - 26 Computing Information DSCI 4520/5240 DATA MINING Information is measured in bits Given a probability distribution, the info required to predict an event is the distribution’s entropy Entropy gives the additional required information (i.e., the information deficit) in bits This can involve fractions of bits! The negative sign in the entropy formula is needed to convert all negative logs back to positive values Formula for computing the entropy: Entropy (p1, p2, …, pn) = –p1 logp1 –p2 logp2 … –pn logpn Lecture 4 - 27 DSCI 4520/5240 Weather example: attribute “outlook” DATA MINING • Outlook = “Sunny” Outlook Info([2,3]) = entropy(2/5, 3/5) = sunny rainy overcast –2/5log(2/5) –3/5log(3/5) = 0.971 bits yes yes yes yes yes no yes yes no yes no Info([2,3]) no yes no • Outlook = “Overcast” Info([4,0]) = entropy(1, 0) = –1log(1) –0log(0) = 0 bits (by definition) • Outlook = “Rainy” Info([3,2]) = entropy(3/5, 2/5) = –3/5log(3/5) –2/5log(2/5) = 0.971 bits Expected Information for attribute Outlook: Info([3,2], [4,0], [3,2]) = (5/14)×0.971+ (4/14)×0 + (5/14)×0.971 = 0.693 bits. Lecture 4 - 28 DSCI 4520/5240 Computing the Information Gain DATA MINING • Information Gain = Information Before – Information After Gain (Outlook) = info([9,5]) – info([2,3], [4,0], [3,2]) = 0.940 – 0.693 = 0.247 bits • Information Gain for attributes from the Weather Data: Gain (Outlook) = 0.247 bits Gain (Temperature) = 0.029 bits Gain (Humidity) = 0.152 bits Gain (Windy) = 0.048 Lecture 4 - 29 DSCI 4520/5240 Continuing to split DATA MINING Outlook Outlook Outlook sunny sunny sunny Temperature Windy Humidity hot false high mild cool true normal no yes yes yes yes no yes no no yes no no yes no no no Gain (Temperature) = 0.571 bits Gain (Humidity) = 0.971 bits Gain (Windy) = 0.020 bits Lecture 4 - 30 DSCI 4520/5240 Final Decision Tree DATA MINING Outlook sunny rainy overcast Humidity Windy yes high normal false true no yes yes no • Not all leaves need to be pure. Sometimes identical instances belong to different classes • Splitting stops when data cannot split any further Lecture 4 - 31 Another example: Loan Application Data DSCI 4520/5240 DATA MINING Twenty loan application cases are presented. The target variable OnTime? Indicates whether the loan was paid off on time. Lecture 4 - 32 Loan Example: probability DSCI 4520/5240 DATA MINING calculations All possible values for the three attributes (Age, Income, Risk) are shown below. For each value, the probability for the loan to be On Time (OnTime = yes) is calculated: Lecture 4 - 33 Loan Example: Entropy calculations DSCI 4520/5240 DATA MINING Information calculations for attribute Age are shown below. • First we calculate the probability for each value to result in Yes • Also the probability for this value to result in No. • Then we compute the entropy for this value as: E = –p(yes) logp(yes) –p(no) logp(no) • Finally we calculate Information for the entire attribute: Inform = E1p1 + E2p2 + E3p3 Lecture 4 - 34 Loan Example: The first split DSCI 4520/5240 DATA MINING The calculations continue until we have, for each competing attribute, the Information required to predict the outcome. The attribute with lowest required information is also the attribute with largest information gain, when we compare the required information before and after the split. Risk low high average Lecture 4 - 35 DSCI 4520/5240 Suggested readings DATA MINING • Verify the entropy, information, and information gain calculations we did in these slides • Hint: All logs are base 2!!! • Read the SAS GSEM 5.3 text, chapter 4 (pp. 61-102) • Read the Sarma text, chapter 4 (pp. 113-168). Pay particular attention to: • Entropy calculations (p. 126) • Profit Matrix (p. 136) • Expected profit calculations (p. 137) • How to use SAS EM and grow a decision tree (pp. 143-158) Lecture 4 - 36