• We are watching a set of independent random samples of X
• We see that X has four possible values
• So we might see: BAACBADCDADDDA…
• We transmit data over a binary serial link. We can encode
each reading with two bits (e.g. A=00, B=01, C=10, D = 11)
• Someone tells us that the probabilities are not equal
• It’s possible…
…to invent a coding for your transmission that only uses
1.75 bits on average per symbol. Here is one.
• Suppose X can have one of m values…
• What’s the smallest possible number of bits, on average, per
symbol, needed to transmit a stream of symbols drawn from X’s
entropy ( p1 ,..., pm ) p1 log 2 p1 ... pm log 2 pm
• H(X) : is the entropy of X
• Well, Shannon got to this formula by setting down several
desirable properties for uncertainty, and then finding it.
Constructing decision trees
• Normal procedure: top down in recursive divide-and-
– First: an attribute is selected for root node and a branch is
created for each possible attribute value
– Then: the instances are split into subsets (one for each
branch extending from the node)
– Finally: the same procedure is repeated recursively for each
branch, using only instances that reach the branch
• Process stops if all instances have the same class
Which attribute to select?
A criterion for attribute
Which is the best attribute?
• The one which will result in the smallest tree
– Heuristic: choose the attribute that produces the “purest”
• Popular impurity criterion: entropy of nodes
– Lower the entropy purer the node.
• Strategy: choose attribute that results in lowest entropy of
the children nodes.
Example: attribute “Outlook”
Usually people don’t use directly the entropy of a node.
Rather the information gain is being used.
Clearly, greater the information gain better the purity of a
node. So, we choose “Outlook” for the root.
Continuing to split
The final decision tree
• Note: not all leaves need to be pure; sometimes identical
instances have different classes
Splitting stops when data can’t be split any further
• The weather data with ID code
Tree stump for ID code
• Subsets are more likely to be pure if there is a large
number of values
– Information gain is biased towards choosing attributes with a
large number of values
– This may result in overfitting (selection of an attribute that is
non-optimal for prediction)
The gain ratio
• Gain ratio: a modification of the information gain that
reduces its bias
• Gain ratio takes number and size of branches into account
when choosing an attribute
– It corrects the information gain by taking the intrinsic
information of a split into account
• Intrinsic information: entropy (with respect to the attribute
on focus) of node to be split.
Computing the gain ratio
Gain ratios for weather data
More on the gain ratio
• “Outlook” still comes out top but “Humidity” is now a much
closer contender because it splits the data into two
subsets instead of three.
• However: “ID code” has still greater gain ratio. But its
advantage is greatly reduced.
• Problem with gain ratio: it may overcompensate
– May choose an attribute just because its intrinsic information
is very low
– Standard fix: choose an attribute that maximizes the gain
ratio, provided the information gain for that attribute is at
least as great as the average information gain for all the
• Algorithm for top-down induction of decision trees (“ID3”)
was developed by Ross Quinlan (University of Sydney
• Gain ratio is just one modification of this basic algorithm
– Led to development of C4.5, which can deal with numeric
attributes, missing values, and noisy data
• There are many other attribute selection criteria! (But
almost no difference in accuracy of result.)