Document Sample
Uncertainty Powered By Docstoc
					Decision Trees
• We are watching a set of independent random samples of X
• We see that X has four possible values

• So we might see: BAACBADCDADDDA…
• We transmit data over a binary serial link. We can encode
  each reading with two bits (e.g. A=00, B=01, C=10, D = 11)

                     Fewer Bits
• Someone tells us that the probabilities are not equal

• It’s possible…
…to invent a coding for your transmission that only uses
1.75 bits on average per symbol. Here is one.
                   General Case
• Suppose X can have one of m values…

• What’s the smallest possible number of bits, on average, per
  symbol, needed to transmit a stream of symbols drawn from X’s
  distribution? It’s

 entropy ( p1 ,..., pm )   p1 log 2 p1  ... pm log 2 pm

• H(X) : is the entropy of X
• Well, Shannon got to this formula by setting down several
  desirable properties for uncertainty, and then finding it.
  Constructing decision trees
• Normal procedure: top down in recursive divide-and-
  conquer fashion
   – First: an attribute is selected for root node and a branch is
     created for each possible attribute value
   – Then: the instances are split into subsets (one for each
     branch extending from the node)
   – Finally: the same procedure is repeated recursively for each
     branch, using only instances that reach the branch
• Process stops if all instances have the same class
        Which attribute to select?


  (c)                  (d)
        A criterion for attribute
    Which is the best attribute?

• The one which will result in the smallest tree
    – Heuristic: choose the attribute that produces the “purest”

• Popular impurity criterion: entropy of nodes
    – Lower the entropy purer the node.

• Strategy: choose attribute that results in lowest entropy of
  the children nodes.
Example: attribute “Outlook”
               Information gain
 Usually people don’t use directly the entropy of a node.
  Rather the information gain is being used.

 Clearly, greater the information gain better the purity of a
  node. So, we choose “Outlook” for the root.
Continuing to split
         The final decision tree

• Note: not all leaves need to be pure; sometimes identical
  instances have different classes
Splitting stops when data can’t be split any further
   Highly-branching attributes
• The weather data with ID code
Tree stump for ID code
   Highly-branching attributes
• Subsets are more likely to be pure if there is a large
  number of values
   – Information gain is biased towards choosing attributes with a
     large number of values
   – This may result in overfitting (selection of an attribute that is
     non-optimal for prediction)
                  The gain ratio
• Gain ratio: a modification of the information gain that
  reduces its bias
• Gain ratio takes number and size of branches into account
  when choosing an attribute
   – It corrects the information gain by taking the intrinsic
     information of a split into account
• Intrinsic information: entropy (with respect to the attribute
  on focus) of node to be split.
Computing the gain ratio
Gain ratios for weather data
         More on the gain ratio
• “Outlook” still comes out top but “Humidity” is now a much
  closer contender because it splits the data into two
  subsets instead of three.

• However: “ID code” has still greater gain ratio. But its
  advantage is greatly reduced.

• Problem with gain ratio: it may overcompensate
   – May choose an attribute just because its intrinsic information
     is very low
   – Standard fix: choose an attribute that maximizes the gain
     ratio, provided the information gain for that attribute is at
     least as great as the average information gain for all the
     attributes examined.
• Algorithm for top-down induction of decision trees (“ID3”)
  was developed by Ross Quinlan (University of Sydney

• Gain ratio is just one modification of this basic algorithm
   – Led to development of C4.5, which can deal with numeric
     attributes, missing values, and noisy data

• There are many other attribute selection criteria! (But
  almost no difference in accuracy of result.)

Shared By: