# 20091102140505029

Document Sample

```					    Lecture 5 Machine Learning
第5讲 机器学习
5.1 Introduction
5.2 Supervised Learning 监督学习
5.3 Parametric Methods 参数化方法
5.4 Clustering       聚类
5.5 Nonparametric Methods 非参数化方法
5.6 Decision Trees   决策树
5.1 Introduction
Why “Learn” ?
 Machine learning is programming computers to
optimize a performance criterion using example
data or past experience.
 There is no need to “learn” to calculate payroll
 Learning is used when:
 Human expertise does not exist (navigating on Mars),
 Humans are unable to explain their expertise (speech
recognition)
 Solution changes in time (routing on a computer
network)
 Solution needs to be adapted to particular cases (user
biometrics)

3
What We Talk About When We
 Learning general models from a data of particular
examples
 Data is cheap and abundant (data warehouses, data
marts); knowledge is expensive and scarce.
 Example in retail: Customer transactions to
consumer behavior:
People who bought “Da Vinci Code” also bought “The
Five People You Meet in Heaven” (www.amazon.com)
 Build a model that is a good and useful
approximation to the data.

4
Data Mining
 Retail: Market basket analysis, Customer
relationship management (CRM)
 Finance: Credit scoring, fraud detection
 Manufacturing: Optimization, troubleshooting
 Medicine: Medical diagnosis
 Telecommunications: Quality of service
optimization
 Bioinformatics: Motifs, alignment
 Web mining: Search engines
 ...

5
What is Machine Learning?
 Optimize a performance criterion using
example data or past experience.
 Role of Statistics: Inference from a sample
 Role of Computer science: Efficient
algorithms to
Solve the optimization problem
Representing and evaluating the model
for inference

6
Applications
 Association
 Supervised Learning
Classification
Regression
 Unsupervised Learning
 Reinforcement Learning

7
Learning Associations
P (Y | X ) probability that somebody who
products/services.

Example: P ( chips | beer ) = 0.7

8
Classification
 Example: Credit
scoring
 Differentiating
between low-risk
and high-risk
customers from
their income and
savings

Discriminant: IF income > θ1 AND savings > θ2
THEN low-risk ELSE high-risk
9
Classification: Applications
 Aka Pattern recognition
 Face recognition: Pose, lighting, occlusion
(glasses, beard), make-up, hair style
 Character recognition: Different handwriting
styles.
 Speech recognition: Temporal dependency.
 Use of a dictionary or the syntax of the language.
 Sensor fusion: Combine multiple modalities; eg,
visual (lip image) and acoustic for speech
 Medical diagnosis: From symptoms to illnesses
 ...

10
Face Recognition
Training examples of a person

Test images

AT&T Laboratories, Cambridge UK
http://www.uk.research.att.com/facedatabase.html

11
Regression

 Example: Price of
a used car
 x : car attributes   y = wx+w0
y : price
y = g (x | θ )
g ( ) model,
θ parameters

12
Regression Applications

 Navigating a car: Angle of the steering
wheel (CMU NavLab)
 Kinematics of a robot arm
(x,y)      α1= g1(x,y)
α2 α2= g2(x,y)

α1

 Response surface design
13
Supervised Learning: Uses
 Prediction of future cases: Use the rule to
predict the output for future inputs
 Knowledge extraction: The rule is easy to
understand
 Compression: The rule is simpler than the
data it explains
 Outlier detection: Exceptions that are not
covered by the rule, e.g., fraud

14
Unsupervised Learning

 Learning “what normally happens”
 No output
 Clustering: Grouping similar instances
 Example applications
Customer segmentation in CRM
Image compression: Color quantization
Bioinformatics: Learning motifs

15
Reinforcement Learning

 Learning a policy: A sequence of outputs
 No supervised output but delayed reward
 Credit assignment problem
 Game playing
 Robot in a maze
 Multiple agents, partial observability, ...

16
Resources: Datasets

 UCI Repository:
http://www.ics.uci.edu/~mlearn/MLRepository.h
tml
 UCI KDD Archive:
http://kdd.ics.uci.edu/summary.data.application.
html
 Statlib: http://lib.stat.cmu.edu/
 Delve: http://www.cs.utoronto.ca/~delve/

17
Resources: Journals
 Journal of Machine Learning Research
www.jmlr.org
 Machine Learning
 Neural Computation
 Neural Networks
 IEEE Transactions on Neural Networks
 IEEE Transactions on Pattern Analysis and
Machine Intelligence
 Annals of Statistics
 Journal of the American Statistical Association
 ...
18
Resources: Conferences
 International Conference on Machine Learning (ICML)
 ICML05: http://icml.ais.fraunhofer.de/
 European Conference on Machine Learning (ECML)
 ECML05: http://ecmlpkdd05.liacc.up.pt/
 Neural Information Processing Systems (NIPS)
 NIPS05: http://nips.cc/
 Uncertainty in Artificial Intelligence (UAI)
 UAI05: http://www.cs.toronto.edu/uai2005/
 Computational Learning Theory (COLT)
 COLT05: http://learningtheory.org/colt2005/
 International Joint Conference on Artificial Intelligence (IJCAI)
 IJCAI05: http://ijcai05.csd.abdn.ac.uk/
 International Conference on Neural Networks (Europe)
 ICANN05: http://www.ibspan.waw.pl/ICANN-2005/
 ...

19
5.2 Supervised Learning
Learning a Class from Examples
 Class C of a “family car”
Prediction: Is car x a family car?
Knowledge extraction: What do people
expect from a family car?
 Output:
Positive (+) and negative (–) examples
 Input representation:
x1: price, x2 : engine power
21
Training set X
X  {x t ,r t }tN 1


 1 if x is positive
r 
0 if x is negative

x 1 
x 
x 2 

22
Class C

p1  price  p2  AND e1  engine   power  e2 

23
Hypothesis class H
 1 if h classifies x as positive
h (x )  
0 if h classifies x as negative

Error of h on H

E (h | X )   1h x   r 
N
t     t

t 1

24
S, G, and the Version Space

most specific hypothesis, S
most general hypothesis, G

h  H, between S and G is
consistent

and make up the
version space

(Mitchell, 1997)

25
VC Dimension
 N points can be labeled in 2N ways as +/–
 H shatters N if there
exists h Î H consistent
for any of these:
VC(H ) = N

An axis-aligned rectangle shatters 4 points only !

26
Probably Approximately Correct
(PAC) Learning
 How many training examples N should we have, such
that with probability at least 1 ‒ δ, h has error at most ε ?
(Blumer et al., 1989)

 Each strip is at most ε/4
 Pr that we miss a strip 1‒ ε/4
 Pr that N instances miss a strip (1 ‒ ε/4)N
 Pr that N instances miss 4 strips 4(1 ‒ ε/4)N
 4(1 ‒ ε/4)N ≤ δ and (1 ‒ x)≤exp( ‒ x)
 4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)

27
Noise and Model Complexity
Use the simpler one because
 Simpler to use
(lower computational
complexity)
 Easier to train (lower
space complexity)
 Easier to explain
(more interpretable)
 Generalizes better (lower
variance - Occam’s razor)

28
Multiple Classes, Ci i=1,...,K
X  {x t ,r t }tN 1


t
1 if x t  Ci

ri  

0 if x t  C j , j  i

Train hypotheses
hi(x), i =1,...,K:

1 if x t  Ci
 
t  
hi x  
0 if x  C j , j  i
t


29
g x   w1x  w 0
Regression
g x   w 2x 2  w 1x  w 0


X  x ,r  t

t N
t 1

rt 
 
rt  f xt  

 r          
1      N
E g | X                 t
g x   t 2

N      t 1

 r                          
1            N
E w 1 , w 0 | X                          t          t         2
 w 1x  w 0
N           t 1

30
Model Selection &
Generalization
 Learning is an ill-posed problem; data is
not sufficient to find a unique solution
 The need for inductive bias, assumptions
 Generalization: How well a model
performs on new data
 Overfitting: H more complex than C or f
 Underfitting: H less complex than C or f

31

 There is a trade-off between three factors
(Dietterich, 2003):
1. Complexity of H, c (H),
2. Training set size, N,
3. Generalization error, E, on new data
 As N-, E¯
 As c (H)-, first E¯ and then E-

32
Cross-Validation

 To estimate generalization error, we need
data unseen during training. We split the
data as
Training set (50%)
Validation set (25%)
Test (publication) set (25%)
 Resampling when there is few data

33
Dimensions of a Supervised
Learner
1.   Model : x | 
g

2.                                       
Loss function: E  | X    L r t , g x t |    
t

*  arg min E  | X 
3.   Optimization
procedure:             

34
5.3 Parametric Methods
Parametric Estimation
 X = { xt }t where xt ~ p (x)
 Parametric estimation:
Assume a form for p (x | θ) and estimate
θ, its sufficient statistics, using X
e.g., N ( μ, σ2) where θ = { μ, σ2}

36
Maximum Likelihood Estimation
 Likelihood of θ given the sample X
l (θ|X) = p (X |θ) = ∏t p (xt|θ)

 Log likelihood
L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)

 Maximum likelihood estimator (MLE)
θ* = argmaxθ L(θ|X)

37
Examples: Bernoulli/Multinomial
 Bernoulli: Two states, failure/success, x in {0,1}
P (x) = pox (1 – po ) (1 – x)
L (po|X) = log ∏t po xt (1 – p ) (1
o
– xt)

MLE: po = ∑t xt / N

 Multinomial: K>2 states, xi in {0,1}
P (x1,x2,...,xK) = ∏i pixi
L(p1,p2,...,pK|X) = log ∏t ∏i pixit

MLE: pi = ∑t xit / N

38
Gaussian (Normal) Distribution
 p(x) = N ( μ, σ2)

1       x   2 
p x       exp     2    
2       2 

 MLE for μ and σ2:
μ   σ                       xt
m      t
N
 x             
t        2
m
s2    t
N
39
Bias and Variance
Unknown parameter θ
Estimator di = d (Xi) on sample Xi

Bias: bθ(d) = E [d] – θ
Variance: E [(d–E [d])2]

Mean square error:
r (d,θ) = E [(d–θ)2]
= (E [d] – θ)2 + E [(d–E [d])2]
= Bias2 + Variance

40
Bayes’ Estimator
 Treat θ as a random var with prior p (θ)
 Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X)

 Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθ
 Maximum a Posteriori (MAP): θMAP = argmaxθ
p(θ|X)
 Maximum Likelihood (ML): θML = argmaxθ
p(X|θ)
 Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ

41
Bayes’ Estimator: Example
 xt ~ N (θ, σo2) and θ ~ N ( μ, σ2)
 θML = m
 θMAP = θBayes’ =

N/  2
1 / 2
E  | X        2
0
2
m      2        2

N / 0  1 /      N / 0  1 / 

42
Parametric Classification
gi x   p x | Ci P Ci 
or equivalent     ly
gi x   log p x | Ci   log P Ci 

1       x  i 2 
p x | Ci        exp      2 
2i     2i 
1
gi x    log 2  log i 
x  i   log P C 
2

2             i
2                    2i
43
 Given the sample X  {x t ,r t }tN1
1 if x t  Ci

x                  t
ri  
0 if x  C j , j  i
t


 ML estimates are
ri               x ri                   x                
t                  t   t                   t            2
 mi rit
ˆ Ci  
P           t
mi    t
si2    t
N                 rit
t
 rit
t
 Discriminant becomes
1
gi x    log 2  log si 
x  mi   log ˆ C 
P i
2

2
2                     2si

44
Equal variances

Single boundary at
halfway between
means

45
Variances are different

Two boundaries

46
Regression
r  f x   
estimator : g x |  
 ~ N 0,  2 

p r | x  ~ N g x |  ,  2   
         
N
L  | X   log  p x t ,r t
t 1

                    
N                    N
 log  p r t | x t  log  p x t
t 1                 t 1

47
Regression: From LogL to Error

L  | X   log 
N
1
exp

 rt  g xt |         
2


t 1   2   
      22                  

2
1
               
N
  N log 2  2  r t  g x t | 
2 t 1
2
1
             
N
E  | X    r t  g x t | 
2 t 1

48
Linear Regression
             
g x t | w 1 , w 0  w 1x t  w 0


t
r t  Nw 0  w 1  x t
t

r x  t t
 w 0  x  w1  x
t
 
t 2

t                t           t

 N

  w 0 
xt                r t 
A                 t
w    y   t t t
tx t
 t 2
 x  w1  r x    t     
               t                          
w  A 1 y
49
Polynomial Regression
   t

g x | wk ,, w2 , w1 , w 0  wk x         
t k
 
   w2 x   t 2
 w1x t  w 0

1 x 1

x     1 2
 x    

1 k

r 1 
 2
D
1 x2     x     2 2
 
2 k
 x 
r r 
                                      
                                  2     N
1 x N
         x     N 2           N
 x   
    r 
 


w  D D DT r T

1

50
Other Error Measures
2

 Square Error:
1N t
2 t 1

E  | X    r  g x t |                          
2

 r                           
N
t
 g xt | 
 Relative Square Error:       E  | X     t 1
2

 r                
N
t
r
t 1

 Absolute Error: E (θ|X) = ∑t |rt – g(xt|θ)|
 ε-sensitive Error:
E (θ|X) = ∑ t 1(|rt – g(xt|θ)|>ε) (|rt –
g(xt|θ)| – ε)
51
Bias and Variance
                                           
E r  g x  | x  E r  E r | x  | x  E r | x   g x 
2                           2                            2

noise                   squared error

                                                  2

E X E r | x   gx  | x  E r | x   E X gx   E X gx   E X gx 
2                                                            2

bias                        variance

52
Estimating Bias and Variance
 M samples Xi={xti , rti}, i=1,...,M
are used to fit gi (x), i =1,...,M

 g x   f x 
1
Bias g  
2                   t       t 2

N    t

Varianceg  
1

NM t i
    
gi x t  g x t
2

1
g x    gi x 
M t

53
Bias/Variance Dilemma
 Example: gi(x)=2 has no variance and high bias
gi(x)= ∑t rti/N has lower bias with variance

 As we increase complexity,
bias decreases (a better fit to data) and
variance increases (fit varies more with data)
 Bias/Variance dilemma: (Geman et al., 1992)

54
f
f
bias
gi   g

variance

55
Polynomial Regression

Best fit “min error”

56
Best fit, “elbow”

57
Model Selection
 Cross-validation: Measure generalization
accuracy by testing on data unused during
training
 Regularization: Penalize complex models
E’=error on data + λ model complexity

Akaike’s information criterion (AIC), Bayesian
information criterion (BIC)
 Minimum description length (MDL):
Kolmogorov complexity, shortest description of
data
 Structural risk minimization (SRM)
58
Bayesian Model Selection
 Prior on models, p(model)

p data | model  p model 
p model | data  
p data 
 Regularization, when prior favors simpler
models
 Bayes, MAP of the posterior, p(model|data)
 Average over a number of models with high
posterior (voting, ensembles: Chapter 15)

59
5.4 Clustering
Semiparametric Density
Estimation
 Parametric: Assume a single model for p (x | Ci)
(Chapter 4 and 5)
 Semiparametric: p (x | Ci) is a mixture of
densities
Multiple possible explanations/prototypes:
Different handwriting styles, accents in
speech
 Nonparametric: No model; data speaks for itself
(Chapter 8)

61
Mixture Densities
k
p x    p x | Gi P Gi 
i 1

where Gi the components/groups/clusters,
P ( Gi ) mixture proportions (priors),
p ( x | Gi) component densities

Gaussian mixture where p(x|Gi) ~ N ( μi , ∑i )
parameters Φ = {P ( Gi ), μi , ∑i }ki=1
unlabeled sample X={xt}t (unsupervised
learning)
62
Classes vs. Clusters
 Supervised: X = { xt ,rt }t                    Unsupervised : X = { xt }t
 Classes Ci i=1,...,K                           Clusters Gi i=1,...,k
k
p x    p x | Gi P Gi 
K
p x    p x | Ci P Ci 
i 1                                      i 1

where p ( x | Ci) ~ N ( μi ,                     where p ( x | Gi) ~ N ( μi ,
∑i )                                             ∑i )
 Φ = {P (Ci ), μi , ∑i }Ki=1                    Φ = {P ( Gi ), μi , ∑i }ki=1

ˆ Ci  
P
t rit
mi 
t rit x t
N                  t rit               Labels, r ti ?

Si 
           
t rit x t  mi x t  mi         T

t rit                                                     63
k-Means Clustering
 Find k reference vectors (prototypes/codebook
vectors/codewords) which best represent data
 Reference vectors, mj, j =1,...,k
 Use nearest (most similar) reference:
x t  mi  min x t  m j
j

 Reconstruction error

E 
m
k
i i 1     
X  t i bit x t  mi
1 if x t  mi  min x t  m j

bit                   j
0 otherwise

64
Encoding/Decoding

t
1 if x t  mi  min x t  m j

bi                   j
0 otherwise

65
k-means Clustering

66
67
Expectation-Maximization (EM)
 Log likelihood with a mixture model

L  | X   log p x t |     
t

       
k
 t log p x t | Gi P Gi 
i 1
 Assume hidden variables z, which when known,
make optimization much simpler
 Complete likelihood, Lc(Φ |X,Z), in terms of x
and z
 Incomplete likelihood, L(Φ |X), in terms of x
68
E- and M-steps
 Iterate the two steps
1. E-step: Estimate z given X and current Φ
2. M-step: Find new Φ’ given z, X, and old Φ.

    
E - step : Q  | l  E LC  | X, Z  | X, l   


M - step : l 1  arg max Q  | l   
An increase in Q increases incomplete
likelihood
Ll 1
 
|X L |X
l

69
EM in Gaussian Mixtures
 zti = 1 if xt belongs to Gi, 0 otherwise
(labels r ti of supervised learning); assume
p(x|Gi)~N(μi,∑i)
 E-step: E zit X ,l   p x t | Gi , l P Gi 
t
 p xj
l

| Gj ,  P Gj 

 P Gi | x t , l  hit  
 M-step:    P Gi  
t hit
m   l 1

t hit x t        Use estimated labels

h               in place of
i                    t
N                                        i
t           unknown labels

Sil 1 
                 
t hit x t  mil 1 x t  mil 1                T

t hit
70
P(G1|x)=h1=0.5

71
Mixtures of Latent Variable
Models
 Regularize clusters
1. Assume shared/diagonal covariance matrices
2. Use PCA/FA to decrease dimensionality:
Mixtures of PCA/FA

p xt | Gi   N mi , Vi ViT  ψi 
Can use EM to learn Vi (Ghahramani and
Hinton, 1997; Tipping and Bishop, 1999)

72
After Clustering
 Dimensionality reduction methods find
correlations between features and group features
 Clustering methods find similarities between
instances and group instances
 Allows knowledge extraction through
number of clusters,
prior probabilities,
cluster parameters, i.e., center, range of
features.
Example: CRM, customer segmentation
73
Clustering as Preprocessing
 Estimated group labels hj (soft) or bj (hard)
may be seen as the dimensions of a new k
dimensional space, where we can then
learn our discriminant or regressor.
 Local representation (only one bj is 1, all
others are 0; only few hj are nonzero) vs
Distributed representation (After PCA; all
zj are nonzero)
74
Mixture of Mixtures
 In classification, the input comes from a
mixture of classes (supervised).
 If each class is also a mixture, e.g., of
Gaussians, (unsupervised), we have a
mixture of mixtures:
ki
p x | Ci    p x | Gij P Gij 
j 1
K
p x    p x | Ci P Ci 
i 1

75
Hierarchical Clustering
 Cluster based on similarities/distances
 Distance measure between instances xr
and xs
Minkowski (Lp) (Euclidean for p = 2)

             x              
1/ p
r   s       d      r        s p
dm x , x          j 1   j   x   j

City-block distance
           
dcb x , x   j 1 x rj  x s
r   s       d
j
76
Agglomerative Clustering
merge two closest groups at each iteration
 Distance between two groups Gi and Gj:

x Gi ,x G j

d Gi ,Gj   r min d x r , x s
s


x Gi ,x G j

d Gi ,Gj   r max d x r , x s
s


77

Dendrogram

78
Choosing k
 Defined by the application, e.g., image
quantization
 Plot data (after PCA) and check for clusters
at a time until “elbow” (reconstruction error/log
likelihood/intergroup distances)
 Manual check for meaning

79
5.5 Nonparametric Methods
Nonparametric Estimation
 Parametric (single global model),
semiparametric (small number of local models)
 Nonparametric: Similar inputs have similar
outputs
 Functions (pdf, discriminant, regression) change
smoothly
 Keep the training data;“let the data speak for
itself”
 Given x, find a small number of closest training
instances and interpolate from these
 Aka lazy/memory-based/case-based/instance-
based learning
81
Density Estimation
 Given the training set X={xt}t drawn iid
from p(x)
 Divide data into bins of size h
 Histogram: ˆ x   # x t in the same bin as x 
p
Nh

 Naive estimator:       # x  h  x t  x  h
ˆ x  
p
2Nh
or ˆ x      1 N  x  xt             1 / 2 if u  1
p             w  h        w u   
Nh t 1 


          0     otherwise

82
83
84
Kernel Estimator
 Kernel function, e.g., Gaussian kernel:
1      u2 
K u      exp 
2     2

 Kernel estimator (Parzen windows)

1 N  x  xt 
ˆ x  
p           K h 
Nh t 1 



85
86
k-Nearest Neighbor Estimator
 Instead of fixing bin width h and counting
the number of instances, fix the instances
(neighbors) k and check bin width
k
ˆ x  
p
2Nd k x 

dk(x), distance to kth closest instance to x

87
88
Multivariate Data
 Kernel density estimator

1 N  x  xt 
ˆ x   d  K 
p                h  
Nh t 1      
Multivariate Gaussian kernel
d
 1           u 2
K u         exp      
spheric               2          2 
    
1            1 T 1 
ellipsoid   K u                 exp u S u 
2 S
d/2  1/ 2
 2      

89
Nonparametric Classification
 Estimate p(x|Ci) and use Bayes’ rule
 Kernel estimator

1       N
 x  xt  t ˆ         Ni
ˆ x | Ci  
p                   K  h ri P Ci   N
        
N ih d
t 1          
ˆ Ci   1  K 
N
 x  xt  t
gi x   ˆ x | Ci P
p                        d      h ri  
Nh t 1           
 k-NN estimator
ki                   ˆ x | Ci ˆ Ci  ki
ˆ Ci | x   p          P
ˆ x | Ci  
p                       P                               
N iV x 
k
ˆ x 
p             k
90
Condensed Nearest Neighbor
 Time/space complexity of k-NN is O (N)
 Find a subset Z of X that is small and is
accurate in classifying X (Hart, 1968)

E' Z | X   E X | Z    Z

91
Condensed Nearest Neighbor
 Incremental algorithm: Add instance if
needed

92
Nonparametric Regression
 Aka smoothing models
 Regressogram

ˆ x  
g
 
t 1
N
b x ,x t r t
 b x , x 
N      t
t 1

where
1 if x t is in the same bin with x
 
b x ,xt  
0 otherwise
93
94
95
Running Mean/Kernel Smoother
 Running mean smoother            Kernel smoother
 x  xt  t                     x  xt  t
t 1w  h  r                 t 1 K  h  r
N                              N
                                      
ˆ x                                                
g                              ˆ x  
g
 x  xt                        x  xt 
t 1w  h 
N
t 1 K  h 
N
                                      
                                      
where
1 if u  1
w u                            where K( ) is Gaussian
0 otherwise
 Running line smoother
and Tibshirani, 1990)
96
97
98
99
How to Choose k or h?
 When k or h is small, single instances
matter; bias is small, variance is large
(undersmoothing): High complexity
 As k or h increases, we average over more
instances and variance decreases but bias
increases (oversmoothing): Low
complexity
 Cross-validation is used to finetune k or h.

100
5.6 Decision Trees
Tree Uses Nodes, and Leaves

102
Divide and Conquer
 Internal decision nodes
Univariate: Uses a single attribute, xi
Numeric xi : Binary split : xi > wm
Discrete xi : n-way split for n possible values
Multivariate: Uses all attributes, x
 Leaves
Classification: Class labels, or proportions
Regression: Numeric; r average, or local fit
 Learning is greedy; find the best split recursively
(Breiman et al, 1984; Quinlan, 1986, 1993)
103
Classification Trees
(ID3, CART, C4.5)
 For node m, Nm instances reach m, Nim
belong to Ci
i
ˆ Ci | x ,m  pm 
i   Nm
P
Nm
 Node m is pure if pim is 0 or 1
 Measure of impurity is entropy
K
Im   pm log 2pm
i       i

i 1
104
Best Split
 If node m is pure, generate a leaf and stop,
otherwise split and continue recursively
 Impurity after split: Nmj of Nm take branch j. Nimj
belong to Ci
i
N mj
ˆ Ci | x ,m, j   pmj 
P                    i

N mj
n      N mj   K
I'm                  i       i
pmj log2pmj
j 1   Nm     i 1

 Find the variable and split that min impurity
(among all variables -- and split positions for
numeric variables)                                    105
106
Regression Trees
 Error at node m:
1 if x  Xm : x reaches node m
bm x   
0 otherwise
1
                                 t   
bm x t r t

2
Em        t
r  gm bm x t                     gm 
Nm t                                            b x 
t   m
t

 After splitting:
1 if x  Xmj : x reaches node m and branch j
bmj x   
0 otherwise
t      
bmj x t r t
  r                    
1                t        2        t
E'm                         gmj bmj x              gmj 
Nm    j       t
 b x 
t   mj
t

107
Model Selection in Trees:

108
Pruning Trees
 Remove subtrees for better generalization
(decrease variance)
Prepruning: Early stopping
Postpruning: Grow the whole tree then
prune subtrees which overfit on the
pruning set
 Prepruning is faster, postpruning is more
accurate (requires a separate pruning set)

109
Rule Extraction from Trees
C4.5Rules
(Quinlan, 1993)

110
Learning Rules
 Rule induction is similar to tree induction but
 rule induction is depth-first; one rule at a
time
 Rule set contains rules; rules are conjunctions of
terms
 Rule covers an example if all terms of the rule
evaluate to true for the example
 Sequential covering: Generate rules one at a time
until all positive examples are covered
 IREP (Fürnkrantz and Widmer, 1994), Ripper
(Cohen, 1995)
111
112
113
Multivariate Trees

114

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 58 posted: 8/2/2012 language: pages: 114