Docstoc

Uncertainty

Document Sample
Uncertainty Powered By Docstoc
					Decision Trees
                          Bits
• We are watching a set of independent random samples of X
• We see that X has four possible values




• So we might see: BAACBADCDADDDA…
• We transmit data over a binary serial link. We can encode
  each reading with two bits (e.g. A=00, B=01, C=10, D = 11)

0100001001001110110011111100…
                     Fewer Bits
• Someone tells us that the probabilities are not equal




• It’s possible…
…to invent a coding for your transmission that only uses
1.75 bits on average per symbol. Here is one.
                   General Case
• Suppose X can have one of m values…




• What’s the smallest possible number of bits, on average, per
  symbol, needed to transmit a stream of symbols drawn from X’s
  distribution? It’s

 entropy ( p1 ,..., pm )   p1 log 2 p1  ... pm log 2 pm

• H(X) : is the entropy of X
• Well, Shannon got to this formula by setting down several
  desirable properties for uncertainty, and then finding it.
  Constructing decision trees
• Normal procedure: top down in recursive divide-and-
  conquer fashion
   – First: an attribute is selected for root node and a branch is
     created for each possible attribute value
   – Then: the instances are split into subsets (one for each
     branch extending from the node)
   – Finally: the same procedure is repeated recursively for each
     branch, using only instances that reach the branch
• Process stops if all instances have the same class
        Which attribute to select?



                       (b)
(a)




  (c)                  (d)
        A criterion for attribute
•
                       selection
    Which is the best attribute?

• The one which will result in the smallest tree
    – Heuristic: choose the attribute that produces the “purest”
      nodes


• Popular impurity criterion: entropy of nodes
    – Lower the entropy purer the node.


• Strategy: choose attribute that results in lowest entropy of
  the children nodes.
Example: attribute “Outlook”
               Information gain
 Usually people don’t use directly the entropy of a node.
  Rather the information gain is being used.




 Clearly, greater the information gain better the purity of a
  node. So, we choose “Outlook” for the root.
Continuing to split
         The final decision tree




• Note: not all leaves need to be pure; sometimes identical
  instances have different classes
Splitting stops when data can’t be split any further
   Highly-branching attributes
• The weather data with ID code
Tree stump for ID code
       attribute
   Highly-branching attributes
So,
• Subsets are more likely to be pure if there is a large
  number of values
   – Information gain is biased towards choosing attributes with a
     large number of values
   – This may result in overfitting (selection of an attribute that is
     non-optimal for prediction)
                  The gain ratio
• Gain ratio: a modification of the information gain that
  reduces its bias
• Gain ratio takes number and size of branches into account
  when choosing an attribute
   – It corrects the information gain by taking the intrinsic
     information of a split into account
• Intrinsic information: entropy (with respect to the attribute
  on focus) of node to be split.
Computing the gain ratio
Gain ratios for weather data
         More on the gain ratio
• “Outlook” still comes out top but “Humidity” is now a much
  closer contender because it splits the data into two
  subsets instead of three.

• However: “ID code” has still greater gain ratio. But its
  advantage is greatly reduced.

• Problem with gain ratio: it may overcompensate
   – May choose an attribute just because its intrinsic information
     is very low
   – Standard fix: choose an attribute that maximizes the gain
     ratio, provided the information gain for that attribute is at
     least as great as the average information gain for all the
     attributes examined.
                    Discussion
• Algorithm for top-down induction of decision trees (“ID3”)
  was developed by Ross Quinlan (University of Sydney
  Australia)

• Gain ratio is just one modification of this basic algorithm
   – Led to development of C4.5, which can deal with numeric
     attributes, missing values, and noisy data


• There are many other attribute selection criteria! (But
  almost no difference in accuracy of result.)

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:10/11/2011
language:English
pages:19