Document Sample

Data Mining With Decision Trees Craig A. Struble, Ph.D. Marquette University Overview Decision Trees Rules and Language Bias Constructing Decision Trees Some Analyses Heuristics Quality Assessment Extensions MSCS282 - Data Mining With 2 Decision Trees Goals Explore the complete data mining process Understand decision trees as a model Understand how to construct a decision tree Recognize the language bias, search bias, and overfitting avoidance bias for decision trees Be able to assess the performance of decision trees MSCS282 - Data Mining With 3 Decision Trees Decision Trees A graph (tree) based model used primarily for classification Extensively studied – Quinlan is the primary contributor to the field Applications are wide ranging – Data mining – Aircraft flying – Medical diagnosis – Etc. MSCS282 - Data Mining With 4 Decision Trees Decision Trees MSCS282 - Data Mining With 5 Decision Trees What kind of data? Initially, we will restrict the data to having only nominal values – We’ll explore numeric/continuous values later Number of attributes doesn’t matter – Beware of the “curse of dimensionality” though – We’ll see this later MSCS282 - Data Mining With 6 Decision Trees Classification Rules It is relatively straight forward to convert a decision tree into a set of rules for classification If tear production rate = reduced then recommendation = none. If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft If age = presbyopic and spectacle prescription = myope and astigmatic = no then recommendation = none If spectacle prescription = hypermetrope and astigmatic = no and tear production rate = normal then recommendation = soft If spectacle prescription = myope and astigmatic = yes and tear production rate = normal then recommendation = hard If age = young and astigmatic = yes and tear production rate = normal then recommendation = hard If age = pre-presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none If age = presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none MSCS282 - Data Mining With 7 Decision Trees Language Bias Decision trees are restricted to functions that can be represented by rules of the form if X and Y then A if X and W and V then B if Y and V then A That is, decision trees represent collections of implications The rules can be combined with or if Y and (X or V) then A MSCS282 - Data Mining With 8 Decision Trees Language Bias Examples of functions not well represented by decision trees – Parity: output is true if an even number of attributes are true – Majority: output is true if more than half of the attributes are true MSCS282 - Data Mining With 9 Decision Trees Propositional Logic Essentially, decision trees can represent any function in propositional logic – A, B, C: propositional variables – and, or, not, => (implies), <=> (equivalent): connectives A proposition is a statement that is either true or false The sky is blue. color of sky = blue Hence, decision trees are an example of a propositional learner. MSCS282 - Data Mining With 10 Decision Trees Constructing Decision Trees Example Alt Bar Fri Hun Pat Price Rain Res Type Est Wait? 1 Yes No No Yes Some $$$ No Yes French 0-10 Yes 2 Yes No No Yes Full $ No No Thai 30-60 No 3 No Yes No No Some $ No No Burger 0-10 Yes 4 Yes No Yes Yes Full $ No No Thai 10-30 Yes 5 Yes No Yes No Full $$$ No Yes French >60 No 6 No Yes No Yes Some $$ Yes Yes Italian 0-10 Yes 7 No Yes No No None $ Yes No Burger 0-10 No 8 No No No Yes Some $$ Yes Yes Thai 0-10 Yes 9 No Yes Yes No Full $ Yes No Burger >60 No 10 Yes Yes Yes Yes Full $$$ No Yes Italian 10-30 No 11 No No No No None $ No No Thai 0-10 No 12 Yes Yes Yes Yes Full $ No No Burger 30-60 Yes MSCS282 - Data Mining With 11 Decision Trees Select an Attribute Alt MSCS282 - Data Mining With 12 Decision Trees Partition The Data Alt No {3,6,7,8,9,11} {1,2,4,5,10,12} Yes MSCS282 - Data Mining With 13 Decision Trees Select Next Attribute Alt {1,2,4,5,10,12} Yes No {3,6,7,8,9,11} Res {1,5,10} Yes No {2,4,12} MSCS282 - Data Mining With 14 Decision Trees Continue Selecting Attributes Alt {1,2,4,5,10,12} Yes No {3,6,7,8,9,11} Res {1,5,10} Yes No {2,4,12} Fri This process continues along a {5,10} Yes No {1} subtree until all instances have the same label. No Yes MSCS282 - Data Mining With 15 Decision Trees Basic Algorithm algorithm LearnDecisionTree(examples, attributes, default) returns a decision tree inputs: examples, a set of examples attributes, a set of attributes default, default value for goal attribute if examples is empty then return default else if all examples have same value for goal attribute then return value else best = ChooseAttribute(attributes, examples) tree = a new decision tree with root test best for each value vi of best do examplesi = {elements of examples with best = vi} subtree = LearnDecisionTree(examplesi, attributes – best, MajorityValue(examples)) add a branch to tree with label vi and subtree subtree return tree MSCS282 - Data Mining With 16 Decision Trees Analysis of Basic Algorithm Let m be the number of attributes Let n be the number of instances Assumption: Depth of tree is O(log n) For each level of the tree all n instances are considered (best = vi) – O(n log n) work for a single attribute over the entire tree Total cost is O(mn log n) since all attributes are eventually considered. MSCS282 - Data Mining With 17 Decision Trees How Many Possible Decision Trees? Assume a set of m non-goal boolean attributes We can construct a decision tree for each boolean function with m non-goal attributes There are 2m possible ways to assign the attributes The number of different functions is the number of subsets of the rows, assign those rows in the subset a value of true. m So, there must be 22 possible decision trees! How do we select the best one? MSCS282 - Data Mining With 18 Decision Trees Applying Heuristics In the basic algorithm, the ChooseAttribute function makes an arbitrary choice of an attribute to build the tree. We can make this function try to choose the “best” attribute to avoid making poor choices This in effect biases the search. MSCS282 - Data Mining With 19 Decision Trees Information Theory One method for assessing attribute quality Described by Shannon and Weaver (1949) Measurement of the expected amount of information in terms of bits – These are not your ordinary computer bits – Often information is fractional Other Applications – Compression – Feature selection This is the ID3 algorithm for decision tree construction. MSCS282 - Data Mining With 20 Decision Trees Notation Let vi be a possible answer (value of attribute) Let P(vi) be the probability of getting answer vi from a random data element The information content I of the knowing the actual answer is n I ( P(v1 ),..., P(vn )) P(vi ) log 2 P(vi ) i 1 MSCS282 - Data Mining With 21 Decision Trees Example Consider a fair coin, P(heads) = P(tails) = ½ I (1 / 2,1 / 2) 1 / 2 log 2 1 / 2 1 / 2 log 2 1 / 2 1 bit Consider an unfair coin, P(heads) = 0.99 and P(tails)=0.01 I (0.99,0.01) 0.99 log 2 0.99 0.01 log 2 0.01 0.08 bits The value of the actual answer is reduced if you know there is a bias MSCS282 - Data Mining With 22 Decision Trees Application to Decision Trees Measure the value of information after splitting the instances by an attribute A Attribute A splits the instances E into subsets E1, …, Ea where a is the number of values A can have a Ei Remainder( A) I ( P(v1i ),...,P(vni )) i 1 E where P(v1i) is the probability of an element in Ei having value v1 for the goal attribute, etc. – Number of elements in Ei having v1 divided by |Ei | MSCS282 - Data Mining With 23 Decision Trees Application to Decision Trees The information gain of an attribute A is Gain ( A) I ( P(v1 ),..., P(vn )) Remainder ( A) or the amount of information before selecting the attribute minus how much is still needed afterwards (the values are for the goal attribute) Heuristic: select attribute with highest gain MSCS282 - Data Mining With 24 Decision Trees Example Calculate for Patrons and Type Gain ( Patrons) I (6 / 12 ,6 / 12 ) [2 / 12 I (0 / 2,2 / 2) 4 / 12 I (4 / 4,0 / 4) 6 / 12 I (2 / 6,4 / 6)] 0.541 bits Gain (Type ) I (6 / 12 ,6 / 12 ) [2 / 12 I (1 / 2,1 / 2) 2 / 12 I (1 / 2,1 / 2) 4 / 12 I (2 / 4,2 / 4) 4 / 12 I (2 / 4,2 / 4)] 0 bits Which attribute would be chosen? Exercise: calculate information gain of Alt MSCS282 - Data Mining With 25 Decision Trees Carrying On When you use information gain in lower levels of the tree, remember your set of instances under consideration changes – The decision tree construction procedure is recursive – This is the single most common mistake when calculating information gain by hand MSCS282 - Data Mining With 26 Decision Trees Highly Branching Attributes Highly branching attributes might generate spurious attributes with high gain Correct for this by using the gain ratio – Calculate the information of the split E1 E2 Ea Split( A) I , E E ,..., E – Calculate Gain(A)/Split(A) – Choose attribute with highest gain ratio MSCS282 - Data Mining With 27 Decision Trees Assessing Decision Trees Two kinds of assessments that we may want – Assess the performance of a single model – Assess the performance of a data mining technique What kinds of metrics can we use? – Model size – Accuracy MSCS282 - Data Mining With 28 Decision Trees Comparing Model Size Suppose two models with the same accuracy Choose the model with smaller size – Ockham’s razor: The most likely hypothesis is the simplest one that is consistent with all observations. – Can be used as a heuristic (other data mining techniques) Why? – Efficiency – Generality The problem of finding the smallest model is often intractable – NP-complete for decision tree learning MSCS282 - Data Mining With 29 Decision Trees Accuracy Measurement of the correctness of the technique – Success rate Definitions – True positive: a positive instance that is correctly classified – True negative: a negative instance correctly classified – False positive: a negative instance classified as a positive one – False negative: a positive instance classified as a negative one Accuracy is f = (|tp| + |tn|) / |E| Sometimes we’re more accepting of some errors – Spam filter MSCS282 - Data Mining With 30 Decision Trees Testing Procedures In general, instances are split into two disjoint sets – Training set: the set of instances used to build the model – Test set: the set of instances used to test the accuracy Training Set Test Set In both sets, the correct labeling is known MSCS282 - Data Mining With 31 Decision Trees Testing Dilemma We’d like both sets to be as large as possible Try to create sets that are representative of possible data As the number of attributes grows, the size of a representative set grows exponentially. (Why?) MSCS282 - Data Mining With 32 Decision Trees Assessing a Single Model Each test instance constitutes a Bernoulli trial of the model. – Mean and variance of single trial are p and p(1-p) – For N instances, f is a random variable with mean p, variance is p(1-p)/N – For large N (>100), the distribution of f approaches a normal distribution (bell curve) Calculate P(-z <= X <= z) = c, where z defines the confidence interval and c defines the confidence MSCS282 - Data Mining With 33 Decision Trees Assessing a Single Model The accuracy f needs to have 0 mean and unit variance f p P z z c p(1 p) / N Values for c and z can be found in standard statistical texts Solve for p,which is shown in the text MSCS282 - Data Mining With 34 Decision Trees Assessing a Single Model Two models are significantly different if their confidence intervals for p do not overlap Choose the model with a “better” confidence interval for p MSCS282 - Data Mining With 35 Decision Trees Assessing a Method n-fold cross-validation – Split the instances into n equal sized partitions Make sure each partition is as representative as possible – Run n training and testing sessions, treating each partition as a testing set during one session – Calculate accuracy and error rates Means and standard deviation – 10 fold tests are common Leave-one-out (or jackknife) – Special case of n-fold cross validation – Use for small datasets – Each instance is its own test set. MSCS282 - Data Mining With 36 Decision Trees WEKA Output MSCS282 - Data Mining With 37 Decision Trees WEKA Output MSCS282 - Data Mining With 38 Decision Trees Extensions to Basic Algorithm Numeric Attributes Missing Values Overfitting Avoidance (Pruning) Interpreting Decision Trees MSCS282 - Data Mining With 39 Decision Trees Handling Numeric Attributes Recall that decision trees work for nominal attributes – Can’t have infinite number of branches Our approach is to convert numeric attributes into ordinal (nominal) attributes This process is called discretization MSCS282 - Data Mining With 40 Decision Trees Discretization Binary split (weather data) Select a breakpoint between values with maximum information gain (equivalently, lowest Remainder) – For each breakpoint calculate gain for less than and greater than the breakpoint. – For n values, this is an O(n) process (assuming instances are sorted already). MSCS282 - Data Mining With 41 Decision Trees Discretization Example 64 65 68 69 70 71 72 75 80 81 83 85 Yes No Yes Yes Yes No No Yes No Yes Yes No Yes No Remainder ( 70 .5) 5 / 14 I (4 / 5,1 / 5) 9 / 14 I (4 / 9,5 / 9) 0.254 Remainder ( 73 .5) 8 / 14 I (5 / 8,3 / 8) 6 / 14 I (3 / 6,3 / 6) 0.309 You can reuse continuous attributes, but causes difficulty in interpreting the results. MSCS282 - Data Mining With 42 Decision Trees Discretization Equal-interval (equiwidth) binning splits the range into n equal sized ranges – (max – min) / n is the range width – Often distributes the instances unevenly Equal-frequency (equidepth) binning splits into n bins containing an equal (or close to equal) number of instances – Identify splits until the histogram is flat MSCS282 - Data Mining With 43 Decision Trees Discretization MSCS282 - Data Mining With 44 Decision Trees Discretization Entropy (information content) based – Requires class labeling (goal attribute) Recursively apply the approach on slide 41 – Select the breakpoint B with lowest Remainder – Recursively select breakpoint with lowest remainder on each of the two partitions – Stop splitting when some criterion is met Minimum description length in section 5.9 If Gain(<B) < t,for some threshold t – A formula for determining t is given in the book. MSCS282 - Data Mining With 45 Decision Trees Handling Missing Values Ignore instances with missing values – Pretty harsh, and missing value might not be important Ignore attributes with missing values – Again, may not be feasible Treat missing value as another nominal value – Fine if missing a value has significant meaning Estimate missing values – Data imputation: regression, nearest neighbor, mean, mode, etc. – We’ll cover this in more detail later in the semester MSCS282 - Data Mining With 46 Decision Trees Handling Missing Values Follow the leader – An instance with a missing value for a tested attribute is sent down the branch with the most instances Temp <75 >= 75 5 instances 3 instances Instance included on the left branch MSCS282 - Data Mining With 47 Decision Trees Handling Missing Values “Partition” the instance (branches show # of instances) Temp 5 5/8 3 3/8 Sunny Wind 2 5/8 3 1 1 1 3/8 MSCS282 - Data Mining With 48 Decision Trees Pruning To avoid overfitting, we can prune or simplify a decision tree. – More efficient, Ockham’s Razor Prepruning tries to decide a priori when to stop creating subtrees – This turns out to be fairly difficult to do well in practice Postpruning simplifies an existing decision tree MSCS282 - Data Mining With 49 Decision Trees Postpruning Subtree replacement replaces a subtree with a single leaf node Alt Alt Yes Yes Yes Price $$$ 12/15 $ $$ Yes Yes No 4/5 7/8 1/2 MSCS282 - Data Mining With 50 Decision Trees Postpruning Subtree raising moves a subtree to a higher level in the decision tree, subsuming its parent Alt Alt Yes Yes Res No Yes Price $ $$$ No $$ Price 4/4 Yes Yes No $ $$$ $$ 4/5 7/9 4/5 Yes Yes No MSCS282 - Data Mining With 51 4/5 7/8 1/2 Decision Trees Postpruning When do we want to perform subtree replacement or subtree raising? – Consider the estimated error of the pruning operation Estimating error – With a test set, similar to accuracy except replace f=(tp+tn)/|E| with f=(fp+fn)/|E|, the error rate and use confidence of 25% – The confidence can be tweaked to achieve better performance – Without a test set, consider number of misclassified training instances as errors, and take pessimistic estimate of error rate. MSCS282 - Data Mining With 52 Decision Trees Using Error Estimate To determine if a node should be replaced, compare the error rate estimate for the node with the combined error rates of the children. Replace the node if its error rate is less than combined rates of its children. Price $ $$$ 5/15 err(1/5,5) + 8/15 err(1/8, 8) + 2/15 err(1/2,2) = 0.33 $$ err(3/15, 15) = 0.28 Yes Yes No 4/5 7/8 1/2 MSCS282 - Data Mining With 53 Decision Trees Interpreting Decision Trees Although the decision is used for classification, you can use the classification rules from the decision tree to describe concepts If tear production rate = reduced then recommendation = none. If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft If age = presbyopic and spectacle prescription = myope and astigmatic = no then recommendation = none If spectacle prescription = hypermetrope and astigmatic = no and tear production rate = normal then recommendation = soft If spectacle prescription = myope and astigmatic = yes and tear production rate = normal then recommendation = hard If age = young and astigmatic = yes and tear production rate = normal then recommendation = hard If age = pre-presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none If age = presbyopic and spectacle prescription = hypermetrope and astigmatic = yes then recommendation = none MSCS282 - Data Mining With 54 Decision Trees Interpreting Decision Trees A description of hard contact wearers, appropriate for “regular people” In general, a nearsighted person with an astigmatism and normal tear production should be prescribed hard contacts. MSCS282 - Data Mining With 55 Decision Trees Summary Decision trees are a classification technique They can represent any function representable with propositional logic Heuristics such as information content are used to select relevant attributes Pruning is used to avoid over fitting The output of decision trees can be used for descriptive as well as predictive purposes MSCS282 - Data Mining With 56 Decision Trees

DOCUMENT INFO

Shared By:

Categories:

Tags:
data mining, Decision Trees, decision tree, decision support, business intelligence, machine learning, Predictive Analytics, Knowledge Discovery, Decision Management, decision making

Stats:

views: | 15 |

posted: | 8/19/2011 |

language: | English |

pages: | 56 |

OTHER DOCS BY liuhongmei

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.