Document Sample

```					  Decision Tree Learning
• Widely used, practical
• Method of approximating discrete-
valued functions
• Robust to noisy data
• Capable of learning disjunctive
expressions
• Typical bias: prefer smaller trees
Decision trees
• Classify instances
– sorting them down the tree to a leaf
node containing the class (value)
– based on attributes of instances
– branch for each value
• In general:
– disjunction of conjunctions of
constraints on attribute values of
instances
When to use?
• Instances presented as attribute-
value pairs
• Target function has discrete values
– classification problems
• Disjunctive descriptions required
• Training data may contain
– errors
– missing attribute values
What follows?
• Basic learning algorithm (ID3)
• Hypothesis space
• Inductive bias
– Occam’s razor in general
• Overfit problem & extensions
– post-pruning
– real values, missing values, attribute
costs, …
Basic DT Learning Alg.
• Most better ones variations of this
– top-down greedy search in H
– ID3, C4.5 (Quinlan -86, -93)
• Top-down greedy construction
– Which attribute should be tested?
– Statistical testing with current data
– repeat for descendants
Best attribute
• Most useful in classification
– how to measure the ‘worth’
– information gain
– how well attr. separates examples
according to their classification
• Next
– precise definition for gain
– example
Entropy
• Homogeneity measure for set S
• Entropy(S) =
– -p(+) log p(+) - p(-)log p(-)
– p(+): proportion of pos. examples
– p(-): prop. of neg. examples
– note: 0 log 0 is defined to be 0
– 0 if all examples in same class
– 1 if p(+) = p(-) = 0.5
Entropy...
• Information theoretical concept
– expected minimal number of bits
required to code class of a randomly
drawn member of S
– optimal coding for ‘having probability
p’ has length - log p
– length of opt. code for ‘p(+) or p(-)’
– generalizes to m-ary classes
Information Gain
• Expected reduction in entropy
• Gain(S, A) =
– Ent(S) - sum (|Sv|/|S|)*Ent(Sv)
– v ranges over values of A
– Sv: members of S with A=v
– 2nd term: expected value of entropy
after partitioning with A
Interpretations of gain
• Gain(S,A)
– expected reduction in entropy caused
by knowing A
– information provided about the target
function value given the value of A
– number of bits saved in the coding a
member of S knowing the value of A
• Measure used by ID3 algorithm
Example
• Gains for each attribute
– Outlook 0.246, Humidity 0.151, Wind
0.048, Temp 0.029
• Node creation
– Outlook selected at the root node
– 3 descendants are created
– S is sorted down to descendants
– one becomes a leaf node (0 entropy)
Example...
• At inner nodes
– same steps as earlier but
– only examples sorted to the node are
used in Gain computations
– ‘knowing the value of A’
• Continues until
– entropy = 0 (all have same class)
– all attributes are used
Hypothesis space of ID3
• Set of possible decision trees
– ‘simple-to-complex’ hill-climbing
– evaluation function: inf. gain
• Complete!
– contains all discrete functions based
on available attributes
– including the target function
Hypothesis space...
• Maintains only one hypothesis
– how many other DT are consistent?
– what queries to make?
• No backtracking
– local minima possible --> extensions
• Statistics-based choices
– uses all data at each step, robustness
– compare to incremental methods
Inductive Bias
• Many DT are usually consistent
– basis by which ID3 chooses one
• Roughly: prefer
– shorter trees over longer ones
– ones with high gain attributes at root
• Difficult to characterize precisely
– attribute selection heuristics
– interacts closely with given data
Approx. bias of ID3
• Shorter trees are better
– only this: breadth-first search in H
– ID3: efficient approximation of BFS
• Compare bias to C-E
– ID3: complete space, incomplete
search --> bias from search strategy
– C-E: incomplete space, complete
search --> bias from expressive
power of H
Restriction & preference
• ID3: preference bias, search bias
• C-E: restriction bias, language bias
• Which one is better?
– preference allows us to work with a
complete hypothesis space
– restriction: c may not be there at all
– combinations possible (linear
functions + LMS)
Why prefer short
hypotheses?
• William of Occam (ca. 1320)
– Occam’s razor
– “Prefer the simplest hypothesis fitting
the data”
• Sound principle?
– fewer short ones than long ones
– coincidences less likely
Difficulties
• Many ‘small sets of hypotheses’
– fit to the previous principle, e.g.
– DT with m nodes & n leafs
– Attribute A1 at root, A2 at node 2, …
– Few such trees -> small probability
one fits the data
– why the one with short trees is good?
Difficulties...
• Size of a hypothesis?
– depends on learner’s internal
representation
– two learner with different ones may
reach different conclusions
• Example case
– L1: as before
– L2: boolean attribute XYZ & one node
Reject altogether?
• ‘Natural’ internal representations?
– (artificial) evolution of algorithms
– more successful descendants by
modifying internal representation
– result: int. repr. working well with
any learning algorithm & bias
– if alg. uses O.R, evolution creates int.
repr. suitable for O.R
– reason: easier to change repr. than
algorithm
Issues in DT learning
• Facing the real world
– how deeply to grow the DT
– continuous attributes
– attribute selection measures
– missing attribute values
– attributes with differing costs
– computational efficiency
• ID3 + these issues --> C4.5
Overfit
• Basic algorithm overfits training
examples
• Creates problems
– noise
– small training sets
• Informal definition
– some less well-fitting h actually
performs better with X
Overfit…
• h in H overfits D if
– exists h’ in H such that
– error(h,D) < error(h’,D) but
– error(h,X) > error(h’,X)
• Example figure
– accuracy & tree size
– on training data & test data
How can such happen?
• One reason: noise
– noisy data creates a large tree h
– h’ not fitting it is likely to work better
• Small samples
– coincidences possible
– attributes unrelated to c may
partition training data well
How to avoid overfit?
• Several approaches
– stop growing tree earlier
– allow overfit but post-prune after
construction
– latter one has been found more
successful
How to decide tree size?
• What criterion to use
– separate test set to evaluate the use
of pruning
– use all data, apply statistical test to
estimate if expanding/pruning is
likely to produce improvement
– use an explicit complexity measure
(coding length of data & tree), stop
growth when minimized
Training/validation sets
• Available data split
– training set: apply learning to this
– validation set: evaluate result
• accuracy
• impact of pruning
• ‘safety check’ against overfit
– common strategy: 2/3 for training
Reduced error pruning
• Pruning
– make an inner node a leaf node
– assign it the most common class
• Procedure
– candidate: result performs no worse
• coincidences likely to be removed
– choose the one giving best accuracy
– continue until no progress
Reduced error pruning…
• If large data available
– training set
– validation set used for pruning
– test set to measure accuracy
• If not
– alternative methods (will follow)
– multiple partitioning & averaging
Rule Post-Pruning
• Procedure (C4.5 uses a variant)
– infer DT as usual (allow overfit)
– convert tree to rules (one per path)
– prune each rule independently
• remove preconditions if result is more
accurate
– sort runes by estimated accuracy
– apply rules in this order in
classification
Rule post-pruning…
• Estimating the accuracy
– separate validation set
– training data & pessimistic estimates
• data is too favorable for the rules
• compute accuracy & standard deviation
• take lower bound from given confidence
level (e.g. 95%) as the measure
• very close to observed one for large sets
• not statistically valid but works
Why convert to rules?
• Distinguishing different contexts in
which a node is used
– separate pruning decision for each
path
• No difference for root/inner
– no bookkeeping on how to reorganize
tree if root node is pruned
Continuous values
• Define new discrete-valued attr
– partition the continuous value into a
discrete set of intervals
– Ac = true iff A < c
– How to select best c? (inf. gain)
• Example case
– sort examples by continuous value
– identify borderlines
Continuous values…
• Fact
– value maximizing inf. gain lies on
such boundary
• Evaluation
– compute gain for each boundary
• Extensions
– multiple values
– LTUs based on many attributes
Alternative selection
measures
• Information gain measure favors
attributes with many values
– separates data into small subsets
– high gain, poor prediction
• Gain ratio measure
– penalize gain with split information
– sensitive to how broadly & uniformly
attribute splits data
Split information
• Entropy of S with respect to values
of A
– earlier: entropy of S wrt target values
• GR(S,A) = Gain(S,A) / SI(S,A)
• Discourages selection of attr with
– many uniformly distr. Values
– n values: log n, boolean: 1
Practical issues on SI
• Some value ‘rules’
– |Si| close to |S|
– SI 0 or very small
– GR undefined or very large
• Apply heuristics to select attributes
– compute Gain first
– compute GR only when Gain large
enough (above average)
Another alternative
• Distance-based measure
– define a metric between partitions of
the data
– evaluate attributes: distance between
created & perfect partition
– choose the attribute with closest one
• Shown
– not biased towards attr. with large
value sets
Missing values
• Estimate value
– other examples with known value
• Compute Gain(S,A), A(x) unknown
– assign most common value in S
– most common with class c(x)
– assign probability for each value,
distribute fractionals of x down
• Sim. techniques in classification
Attributes with differing
costs
• Measuring attribute costs
something
– prefer cheap ones if possible
– use costly ones only if good gain
– introduce cost term in selection
measure
– no guarantee in finding optimum, but
give bias towards cheapest
Attributes with costs...
• Example applications
– robot & sonar: time required to
position
– medical diagnosis: cost of a
laboratory test
Summary
• Practical learning method
– discrete-valued functions
– ID3: greedy
• Complete hypothesis space
• Preference bias
• Overfit & pruning
– methods using preference bias
• Extensions

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 36 posted: 5/6/2012 language: English pages: 43
How are you planning on using Docstoc?