20091102140505029

Document Sample
20091102140505029 Powered By Docstoc
					    Lecture 5 Machine Learning
         第5讲 机器学习
5.1 Introduction
5.2 Supervised Learning 监督学习
5.3 Parametric Methods 参数化方法
5.4 Clustering       聚类
5.5 Nonparametric Methods 非参数化方法
5.6 Decision Trees   决策树
5.1 Introduction
Why “Learn” ?
 Machine learning is programming computers to
  optimize a performance criterion using example
  data or past experience.
 There is no need to “learn” to calculate payroll
 Learning is used when:
    Human expertise does not exist (navigating on Mars),
    Humans are unable to explain their expertise (speech
     recognition)
    Solution changes in time (routing on a computer
     network)
    Solution needs to be adapted to particular cases (user
     biometrics)

                                                              3
What We Talk About When We
Talk About“Learning”
  Learning general models from a data of particular
   examples
  Data is cheap and abundant (data warehouses, data
   marts); knowledge is expensive and scarce.
  Example in retail: Customer transactions to
   consumer behavior:
      People who bought “Da Vinci Code” also bought “The
      Five People You Meet in Heaven” (www.amazon.com)
  Build a model that is a good and useful
   approximation to the data.

                                                           4
Data Mining
  Retail: Market basket analysis, Customer
   relationship management (CRM)
  Finance: Credit scoring, fraud detection
  Manufacturing: Optimization, troubleshooting
  Medicine: Medical diagnosis
  Telecommunications: Quality of service
   optimization
  Bioinformatics: Motifs, alignment
  Web mining: Search engines
  ...

                                                  5
What is Machine Learning?
 Optimize a performance criterion using
  example data or past experience.
 Role of Statistics: Inference from a sample
 Role of Computer science: Efficient
  algorithms to
   Solve the optimization problem
   Representing and evaluating the model
     for inference


                                                6
Applications
  Association
  Supervised Learning
    Classification
    Regression
  Unsupervised Learning
  Reinforcement Learning



                            7
Learning Associations
 Basket analysis:
  P (Y | X ) probability that somebody who
  buys X also buys Y where X and Y are
  products/services.

  Example: P ( chips | beer ) = 0.7



                                             8
Classification
 Example: Credit
  scoring
 Differentiating
  between low-risk
  and high-risk
  customers from
  their income and
  savings



 Discriminant: IF income > θ1 AND savings > θ2
                    THEN low-risk ELSE high-risk
                                                   9
Classification: Applications
 Aka Pattern recognition
 Face recognition: Pose, lighting, occlusion
  (glasses, beard), make-up, hair style
 Character recognition: Different handwriting
  styles.
 Speech recognition: Temporal dependency.
    Use of a dictionary or the syntax of the language.
    Sensor fusion: Combine multiple modalities; eg,
     visual (lip image) and acoustic for speech
 Medical diagnosis: From symptoms to illnesses
 ...

                                                          10
 Face Recognition
Training examples of a person




Test images




                                AT&T Laboratories, Cambridge UK
                                http://www.uk.research.att.com/facedatabase.html


                                                                              11
Regression

 Example: Price of
  a used car
 x : car attributes   y = wx+w0
  y : price
      y = g (x | θ )
  g ( ) model,
  θ parameters


                                   12
Regression Applications

     Navigating a car: Angle of the steering
      wheel (CMU NavLab)
     Kinematics of a robot arm
        (x,y)      α1= g1(x,y)
                α2 α2= g2(x,y)


              α1


  Response surface design
                                                13
Supervised Learning: Uses
 Prediction of future cases: Use the rule to
  predict the output for future inputs
 Knowledge extraction: The rule is easy to
  understand
 Compression: The rule is simpler than the
  data it explains
 Outlier detection: Exceptions that are not
  covered by the rule, e.g., fraud

                                                14
Unsupervised Learning

  Learning “what normally happens”
  No output
  Clustering: Grouping similar instances
  Example applications
   Customer segmentation in CRM
   Image compression: Color quantization
   Bioinformatics: Learning motifs


                                            15
Reinforcement Learning

   Learning a policy: A sequence of outputs
   No supervised output but delayed reward
   Credit assignment problem
   Game playing
   Robot in a maze
   Multiple agents, partial observability, ...


                                                  16
Resources: Datasets

   UCI Repository:
    http://www.ics.uci.edu/~mlearn/MLRepository.h
    tml
   UCI KDD Archive:
    http://kdd.ics.uci.edu/summary.data.application.
    html
   Statlib: http://lib.stat.cmu.edu/
   Delve: http://www.cs.utoronto.ca/~delve/

                                                       17
Resources: Journals
 Journal of Machine Learning Research
  www.jmlr.org
 Machine Learning
 Neural Computation
 Neural Networks
 IEEE Transactions on Neural Networks
 IEEE Transactions on Pattern Analysis and
  Machine Intelligence
 Annals of Statistics
 Journal of the American Statistical Association
 ...
                                                    18
Resources: Conferences
    International Conference on Machine Learning (ICML)
        ICML05: http://icml.ais.fraunhofer.de/
    European Conference on Machine Learning (ECML)
        ECML05: http://ecmlpkdd05.liacc.up.pt/
    Neural Information Processing Systems (NIPS)
        NIPS05: http://nips.cc/
    Uncertainty in Artificial Intelligence (UAI)
        UAI05: http://www.cs.toronto.edu/uai2005/
    Computational Learning Theory (COLT)
        COLT05: http://learningtheory.org/colt2005/
    International Joint Conference on Artificial Intelligence (IJCAI)
        IJCAI05: http://ijcai05.csd.abdn.ac.uk/
    International Conference on Neural Networks (Europe)
        ICANN05: http://www.ibspan.waw.pl/ICANN-2005/
    ...


                                                                         19
5.2 Supervised Learning
Learning a Class from Examples
 Class C of a “family car”
  Prediction: Is car x a family car?
  Knowledge extraction: What do people
    expect from a family car?
 Output:
     Positive (+) and negative (–) examples
 Input representation:
     x1: price, x2 : engine power
                                              21
Training set X
                 X  {x t ,r t }tN 1
                                  


                      1 if x is positive
                  r 
                     0 if x is negative

                              x 1 
                            x 
                              x 2 




                                            22
Class C

      p1  price  p2  AND e1  engine   power  e2 




                                                           23
Hypothesis class H
              1 if h classifies x as positive
    h (x )  
             0 if h classifies x as negative




                              Error of h on H

                            E (h | X )   1h x   r 
                                         N
                                                 t     t

                                        t 1




                                                      24
S, G, and the Version Space

   most specific hypothesis, S
                                 most general hypothesis, G


                                  h  H, between S and G is
                                  consistent

                                  and make up the
                                  version space

                                  (Mitchell, 1997)



                                                          25
   VC Dimension
  N points can be labeled in 2N ways as +/–
  H shatters N if there
    exists h Î H consistent
    for any of these:
    VC(H ) = N



An axis-aligned rectangle shatters 4 points only !

                                                     26
Probably Approximately Correct
(PAC) Learning
 How many training examples N should we have, such
   that with probability at least 1 ‒ δ, h has error at most ε ?
   (Blumer et al., 1989)

 Each strip is at most ε/4
 Pr that we miss a strip 1‒ ε/4
 Pr that N instances miss a strip (1 ‒ ε/4)N
 Pr that N instances miss 4 strips 4(1 ‒ ε/4)N
 4(1 ‒ ε/4)N ≤ δ and (1 ‒ x)≤exp( ‒ x)
 4exp(‒ εN/4) ≤ δ and N ≥ (4/ε)log(4/δ)




                                                                   27
Noise and Model Complexity
Use the simpler one because
 Simpler to use
  (lower computational
  complexity)
 Easier to train (lower
  space complexity)
 Easier to explain
  (more interpretable)
 Generalizes better (lower
  variance - Occam’s razor)

                              28
Multiple Classes, Ci i=1,...,K
                           X  {x t ,r t }tN 1
                                            


                              t
                                 1 if x t  Ci
                                 
                            ri  
                                 
                                 0 if x t  C j , j  i

                                   Train hypotheses
                                   hi(x), i =1,...,K:

                                 1 if x t  Ci
                             
                              t  
                          hi x  
                                 0 if x  C j , j  i
                                          t
                                 



                                                           29
                                                g x   w1x  w 0
Regression
                                                g x   w 2x 2  w 1x  w 0

      
X  x ,r  t
                
              t N
                t 1

rt 
           
rt  f xt  

                     r          
             1      N
E g | X                 t
                               g x   t 2

             N      t 1




                                  r                          
                     1            N
E w 1 , w 0 | X                          t          t         2
                                                 w 1x  w 0
                     N           t 1




                                                                               30
Model Selection &
Generalization
    Learning is an ill-posed problem; data is
     not sufficient to find a unique solution
    The need for inductive bias, assumptions
     about H
    Generalization: How well a model
     performs on new data
    Overfitting: H more complex than C or f
    Underfitting: H less complex than C or f

                                                 31
Triple Trade-Off

   There is a trade-off between three factors
    (Dietterich, 2003):
   1. Complexity of H, c (H),
   2. Training set size, N,
   3. Generalization error, E, on new data
   As N-, E¯
   As c (H)-, first E¯ and then E-

                                                 32
Cross-Validation

   To estimate generalization error, we need
    data unseen during training. We split the
    data as
     Training set (50%)
     Validation set (25%)
     Test (publication) set (25%)
   Resampling when there is few data

                                                33
Dimensions of a Supervised
Learner
1.   Model : x | 
           g

2.                                       
     Loss function: E  | X    L r t , g x t |    
                                    t



                  *  arg min E  | X 
3.   Optimization
     procedure:             




                                                            34
5.3 Parametric Methods
Parametric Estimation
 X = { xt }t where xt ~ p (x)
 Parametric estimation:
    Assume a form for p (x | θ) and estimate
    θ, its sufficient statistics, using X
    e.g., N ( μ, σ2) where θ = { μ, σ2}




                                               36
Maximum Likelihood Estimation
 Likelihood of θ given the sample X
      l (θ|X) = p (X |θ) = ∏t p (xt|θ)

 Log likelihood
      L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)

 Maximum likelihood estimator (MLE)
      θ* = argmaxθ L(θ|X)

                                               37
Examples: Bernoulli/Multinomial
 Bernoulli: Two states, failure/success, x in {0,1}
   P (x) = pox (1 – po ) (1 – x)
                     L (po|X) = log ∏t po xt (1 – p ) (1
                                                   o
     – xt)

   MLE: po = ∑t xt / N

 Multinomial: K>2 states, xi in {0,1}
   P (x1,x2,...,xK) = ∏i pixi
                     L(p1,p2,...,pK|X) = log ∏t ∏i pixit

   MLE: pi = ∑t xit / N

                                                           38
Gaussian (Normal) Distribution
                  p(x) = N ( μ, σ2)

                            1       x   2 
                   p x       exp     2    
                            2       2 


                  MLE for μ and σ2:
       μ   σ                       xt
                          m      t
                                      N
                                   x             
                                          t        2
                                              m
                           s2    t
                                          N
                                                       39
   Bias and Variance
Unknown parameter θ
Estimator di = d (Xi) on sample Xi

Bias: bθ(d) = E [d] – θ
Variance: E [(d–E [d])2]

Mean square error:
r (d,θ) = E [(d–θ)2]
        = (E [d] – θ)2 + E [(d–E [d])2]
        = Bias2 + Variance




                                          40
Bayes’ Estimator
 Treat θ as a random var with prior p (θ)
 Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X)

 Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθ
 Maximum a Posteriori (MAP): θMAP = argmaxθ
  p(θ|X)
 Maximum Likelihood (ML): θML = argmaxθ
  p(X|θ)
 Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ

                                               41
  Bayes’ Estimator: Example
 xt ~ N (θ, σo2) and θ ~ N ( μ, σ2)
 θML = m
 θMAP = θBayes’ =

                     N/  2
                                         1 / 2
     E  | X        2
                          0
                                2
                                  m      2        2
                                                     
                  N / 0  1 /      N / 0  1 / 



                                                         42
Parametric Classification
 gi x   p x | Ci P Ci 
 or equivalent     ly
 gi x   log p x | Ci   log P Ci 

                     1       x  i 2 
      p x | Ci        exp      2 
                    2i     2i 
                 1
      gi x    log 2  log i 
                                    x  i   log P C 
                                             2

                                          2             i
                 2                    2i
                                                             43
 Given the sample X  {x t ,r t }tN1
                                 1 if x t  Ci
                                 
         x                  t
                            ri  
                                 0 if x  C j , j  i
                                          t
                                 


 ML estimates are
                ri               x ri                   x                
                      t                  t   t                   t            2
                                                                          mi rit
    ˆ Ci  
    P           t
                          mi    t
                                                 si2    t
                 N                 rit
                                     t
                                                                  rit
                                                                     t
 Discriminant becomes
               1
    gi x    log 2  log si 
                                  x  mi   log ˆ C 
                                                  P i
                                                             2

                                         2
               2                     2si

                                                                                    44
         Equal variances




Single boundary at
halfway between
means



                           45
Variances are different




   Two boundaries




                          46
 Regression
r  f x   
estimator : g x |  
 ~ N 0,  2 
                
p r | x  ~ N g x |  ,  2   
                                    
                    N
L  | X   log  p x t ,r t
                    t 1


                                               
                    N                    N
           log  p r t | x t  log  p x t
                    t 1                 t 1

                                                      47
  Regression: From LogL to Error

L  | X   log 
                  N
                         1
                            exp
                                      
                                rt  g xt |         
                                                       2

                                                         
                  t 1   2   
                                     22                  
                                                           
                                                               2
                         1
                                                         
                                      N
           N log 2  2  r t  g x t | 
                        2 t 1
                                  2
            1
                               
              N
E  | X    r t  g x t | 
            2 t 1



                                                                   48
Linear Regression
              
g x t | w 1 , w 0  w 1x t  w 0

                   
                   t
                     r t  Nw 0  w 1  x t
                                            t


                   r x  t t
                                w 0  x  w1  x
                                        t
                                                     
                                                     t 2

                   t                t           t

        N
       
                         w 0 
                           xt                r t 
     A                 t
                                 w    y   t t t
       tx t
                          t 2
                        x  w1  r x    t     
                      t                          
                           w  A 1 y
                                                           49
         Polynomial Regression
    t
                           
g x | wk ,, w2 , w1 , w 0  wk x         
                                           t k
                                                          
                                                    w2 x   t 2
                                                                     w1x t  w 0

             1 x 1
             
                       x     1 2
                                          x    
                                                 
                                                 1 k
                                                 
                                                      r 1 
                                                       2
           D
              1 x2     x     2 2
                                                
                                             2 k
                                          x 
                                                   r r 
                                                   
                                               2     N
             1 x N
                      x     N 2           N
                                          x   
                                                     r 
                                                       


                       
                  w  D D DT r T
                                     
                                     1



                                                                               50
Other Error Measures
                                                                              2

 Square Error:
                                   1N t
                                   2 t 1
                                         
                       E  | X    r  g x t |                          
                                                                                  2

                                               r                           
                                              N
                                                       t
                                                             g xt | 
 Relative Square Error:       E  | X     t 1
                                                                         2

                                                      r                
                                                     N
                                                                t
                                                                    r
                                                     t 1

 Absolute Error: E (θ|X) = ∑t |rt – g(xt|θ)|
 ε-sensitive Error:
              E (θ|X) = ∑ t 1(|rt – g(xt|θ)|>ε) (|rt –
     g(xt|θ)| – ε)
                                                                                      51
        Bias and Variance
                                                     
        E r  g x  | x  E r  E r | x  | x  E r | x   g x 
                       2                           2                            2


                                        noise                   squared error



                                                     2
                                                               
E X E r | x   gx  | x  E r | x   E X gx   E X gx   E X gx 
                       2                                                            2
                                                                                        
                                       bias                        variance




                                                                                    52
Estimating Bias and Variance
 M samples Xi={xti , rti}, i=1,...,M
  are used to fit gi (x), i =1,...,M

                        g x   f x 
                   1
       Bias g  
           2                   t       t 2

                   N    t


       Varianceg  
                       1
                          
                      NM t i
                                        
                             gi x t  g x t
                                              2



               1
       g x    gi x 
               M t

                                                  53
Bias/Variance Dilemma
 Example: gi(x)=2 has no variance and high bias
  gi(x)= ∑t rti/N has lower bias with variance

 As we increase complexity,
      bias decreases (a better fit to data) and
      variance increases (fit varies more with data)
 Bias/Variance dilemma: (Geman et al., 1992)


                                                       54
f
                    f
                    bias
               gi   g




    variance



                        55
Polynomial Regression


               Best fit “min error”




                                      56
Best fit, “elbow”




                    57
Model Selection
 Cross-validation: Measure generalization
  accuracy by testing on data unused during
  training
 Regularization: Penalize complex models
       E’=error on data + λ model complexity

  Akaike’s information criterion (AIC), Bayesian
  information criterion (BIC)
 Minimum description length (MDL):
  Kolmogorov complexity, shortest description of
  data
 Structural risk minimization (SRM)
                                                   58
Bayesian Model Selection
 Prior on models, p(model)

                       p data | model  p model 
   p model | data  
                                 p data 
 Regularization, when prior favors simpler
  models
 Bayes, MAP of the posterior, p(model|data)
 Average over a number of models with high
  posterior (voting, ensembles: Chapter 15)

                                                      59
5.4 Clustering
Semiparametric Density
Estimation
 Parametric: Assume a single model for p (x | Ci)
  (Chapter 4 and 5)
 Semiparametric: p (x | Ci) is a mixture of
  densities
  Multiple possible explanations/prototypes:
      Different handwriting styles, accents in
  speech
 Nonparametric: No model; data speaks for itself
  (Chapter 8)

                                                     61
Mixture Densities
                    k
           p x    p x | Gi P Gi 
                   i 1


where Gi the components/groups/clusters,
      P ( Gi ) mixture proportions (priors),
      p ( x | Gi) component densities

Gaussian mixture where p(x|Gi) ~ N ( μi , ∑i )
  parameters Φ = {P ( Gi ), μi , ∑i }ki=1
  unlabeled sample X={xt}t (unsupervised
  learning)
                                                 62
    Classes vs. Clusters
 Supervised: X = { xt ,rt }t                    Unsupervised : X = { xt }t
 Classes Ci i=1,...,K                           Clusters Gi i=1,...,k
                                                            k
                                                   p x    p x | Gi P Gi 
                  K
     p x    p x | Ci P Ci 
                 i 1                                      i 1



  where p ( x | Ci) ~ N ( μi ,                     where p ( x | Gi) ~ N ( μi ,
  ∑i )                                             ∑i )
 Φ = {P (Ci ), μi , ∑i }Ki=1                    Φ = {P ( Gi ), μi , ∑i }ki=1

   ˆ Ci  
   P
               t rit
                         mi 
                                  t rit x t
                N                  t rit               Labels, r ti ?

   Si 
                            
          t rit x t  mi x t  mi         T



                        t rit                                                     63
  k-Means Clustering
 Find k reference vectors (prototypes/codebook
  vectors/codewords) which best represent data
 Reference vectors, mj, j =1,...,k
 Use nearest (most similar) reference:
              x t  mi  min x t  m j
                               j

 Reconstruction error
          
         E 
           m
                 k
               i i 1     
                        X  t i bit x t  mi
               1 if x t  mi  min x t  m j
               
         bit                   j
               0 otherwise
               
                                                  64
Encoding/Decoding




    t
        1 if x t  mi  min x t  m j
        
   bi                   j
        0 otherwise
        
                                         65
k-means Clustering




                     66
67
Expectation-Maximization (EM)
 Log likelihood with a mixture model
                             
        L  | X   log p x t |     
                        t


                                          
                             k
                  t log p x t | Gi P Gi 
                            i 1
 Assume hidden variables z, which when known,
  make optimization much simpler
 Complete likelihood, Lc(Φ |X,Z), in terms of x
  and z
 Incomplete likelihood, L(Φ |X), in terms of x
                                                   68
E- and M-steps
 Iterate the two steps
1. E-step: Estimate z given X and current Φ
2. M-step: Find new Φ’ given z, X, and old Φ.

                         
      E - step : Q  | l  E LC  | X, Z  | X, l   
                              
                                   
      M - step : l 1  arg max Q  | l   
   An increase in Q increases incomplete
   likelihood
               Ll 1
                             
                         |X L |X
                               l
                                            
                                                           69
EM in Gaussian Mixtures
 zti = 1 if xt belongs to Gi, 0 otherwise
  (labels r ti of supervised learning); assume
  p(x|Gi)~N(μi,∑i)
 E-step: E zit X ,l   p x t | Gi , l P Gi 
                                  t
                                       p xj
                                             l
                                                                     
                                                       | Gj ,  P Gj 
                                        
                                     P Gi | x t , l  hit  
 M-step:    P Gi  
                          t hit
                                            m   l 1
                                                       
                                                           t hit x t        Use estimated labels

                                                            h               in place of
                                                i                    t
                            N                                        i
                                                                 t           unknown labels

             Sil 1 
                                                 
                        t hit x t  mil 1 x t  mil 1                T



                                         t hit
                                                                                               70
P(G1|x)=h1=0.5




                 71
Mixtures of Latent Variable
Models
 Regularize clusters
1. Assume shared/diagonal covariance matrices
2. Use PCA/FA to decrease dimensionality:
   Mixtures of PCA/FA

             p xt | Gi   N mi , Vi ViT  ψi 
   Can use EM to learn Vi (Ghahramani and
   Hinton, 1997; Tipping and Bishop, 1999)

                                                    72
After Clustering
 Dimensionality reduction methods find
  correlations between features and group features
 Clustering methods find similarities between
  instances and group instances
 Allows knowledge extraction through
     number of clusters,
     prior probabilities,
     cluster parameters, i.e., center, range of
     features.
  Example: CRM, customer segmentation
                                                     73
Clustering as Preprocessing
 Estimated group labels hj (soft) or bj (hard)
  may be seen as the dimensions of a new k
  dimensional space, where we can then
  learn our discriminant or regressor.
 Local representation (only one bj is 1, all
  others are 0; only few hj are nonzero) vs
  Distributed representation (After PCA; all
  zj are nonzero)
                                                  74
Mixture of Mixtures
 In classification, the input comes from a
  mixture of classes (supervised).
 If each class is also a mixture, e.g., of
  Gaussians, (unsupervised), we have a
  mixture of mixtures:
                          ki
            p x | Ci    p x | Gij P Gij 
                          j 1
                          K
                p x    p x | Ci P Ci 
                          i 1


                                                   75
Hierarchical Clustering
 Cluster based on similarities/distances
 Distance measure between instances xr
  and xs
  Minkowski (Lp) (Euclidean for p = 2)

                       x              
                                              1/ p
              r   s       d      r        s p
       dm x , x          j 1   j   x   j




  City-block distance
                     
       dcb x , x   j 1 x rj  x s
              r   s       d
                                   j
                                                     76
Agglomerative Clustering
 Start with N groups each with one instance and
  merge two closest groups at each iteration
 Distance between two groups Gi and Gj:
    Single-link:

                                x Gi ,x G j
                                                
                    d Gi ,Gj   r min d x r , x s
                                       s
                                                      
    Complete-link:


                                x Gi ,x G j
                                                
                    d Gi ,Gj   r max d x r , x s
                                      s
                                                      
    Average-link, centroid

                                                          77
Example: Single-Link Clustering




                     Dendrogram


                                  78
Choosing k
 Defined by the application, e.g., image
  quantization
 Plot data (after PCA) and check for clusters
 Incremental (leader-cluster) algorithm: Add one
  at a time until “elbow” (reconstruction error/log
  likelihood/intergroup distances)
 Manual check for meaning




                                                      79
5.5 Nonparametric Methods
Nonparametric Estimation
 Parametric (single global model),
  semiparametric (small number of local models)
 Nonparametric: Similar inputs have similar
  outputs
 Functions (pdf, discriminant, regression) change
  smoothly
 Keep the training data;“let the data speak for
  itself”
 Given x, find a small number of closest training
  instances and interpolate from these
 Aka lazy/memory-based/case-based/instance-
  based learning
                                                     81
Density Estimation
 Given the training set X={xt}t drawn iid
  from p(x)
 Divide data into bins of size h
 Histogram: ˆ x   # x t in the same bin as x 
                 p
                                       Nh

 Naive estimator:       # x  h  x t  x  h
                ˆ x  
                p
                                  2Nh
  or ˆ x      1 N  x  xt             1 / 2 if u  1
     p             w  h        w u   
                Nh t 1 
                        
                                
                                          0     otherwise

                                                              82
83
84
Kernel Estimator
 Kernel function, e.g., Gaussian kernel:
                   1      u2 
          K u      exp 
                   2     2

 Kernel estimator (Parzen windows)

                   1 N  x  xt 
         ˆ x  
         p           K h 
                  Nh t 1 
                          
                                
                                
                                            85
86
k-Nearest Neighbor Estimator
 Instead of fixing bin width h and counting
  the number of instances, fix the instances
  (neighbors) k and check bin width
                         k
              ˆ x  
              p
                       2Nd k x 


  dk(x), distance to kth closest instance to x

                                                 87
88
  Multivariate Data
 Kernel density estimator

                       1 N  x  xt 
              ˆ x   d  K 
              p                h  
                      Nh t 1      
  Multivariate Gaussian kernel
                           d
                        1           u 2
              K u         exp      
  spheric               2          2 
                                         
                            1            1 T 1 
  ellipsoid   K u                 exp u S u 
                       2 S
                           d/2  1/ 2
                                         2      

                                                     89
Nonparametric Classification
 Estimate p(x|Ci) and use Bayes’ rule
 Kernel estimator


                         1       N
                                    x  xt  t ˆ         Ni
          ˆ x | Ci  
          p                   K  h ri P Ci   N
                                           
                        N ih d
                             t 1          
                            ˆ Ci   1  K 
                                            N
                                                 x  xt  t
       gi x   ˆ x | Ci P
                 p                        d      h ri  
                                      Nh t 1           
 k-NN estimator
                       ki                   ˆ x | Ci ˆ Ci  ki
                              ˆ Ci | x   p          P
      ˆ x | Ci  
      p                       P                               
                    N iV x 
                        k
                                                  ˆ x 
                                                  p             k
                                                                    90
Condensed Nearest Neighbor
 Time/space complexity of k-NN is O (N)
 Find a subset Z of X that is small and is
  accurate in classifying X (Hart, 1968)

                      E' Z | X   E X | Z    Z




                                                       91
Condensed Nearest Neighbor
 Incremental algorithm: Add instance if
  needed




                                           92
Nonparametric Regression
 Aka smoothing models
 Regressogram

   ˆ x  
   g
               
            t 1
               N
                  b x ,x t r t
             b x , x 
                   N      t
                   t 1

    where
             1 if x t is in the same bin with x
      
   b x ,xt  
             0 otherwise
                                                   93
94
95
   Running Mean/Kernel Smoother
 Running mean smoother            Kernel smoother
                  x  xt  t                     x  xt  t
          t 1w  h  r                 t 1 K  h  r
            N                              N
                                                       
 ˆ x                                                
 g                              ˆ x  
                                g
                    x  xt                        x  xt 
           t 1w  h 
              N
                                          t 1 K  h 
                                             N
                                                         
                                                         
 where
           1 if u  1
  w u                            where K( ) is Gaussian
           0 otherwise
                                   Additive models (Hastie
 Running line smoother
                                     and Tibshirani, 1990)
                                                                96
97
98
99
How to Choose k or h?
 When k or h is small, single instances
  matter; bias is small, variance is large
  (undersmoothing): High complexity
 As k or h increases, we average over more
  instances and variance decreases but bias
  increases (oversmoothing): Low
  complexity
 Cross-validation is used to finetune k or h.

                                                 100
5.6 Decision Trees
Tree Uses Nodes, and Leaves




                              102
Divide and Conquer
 Internal decision nodes
   Univariate: Uses a single attribute, xi
      Numeric xi : Binary split : xi > wm
      Discrete xi : n-way split for n possible values
   Multivariate: Uses all attributes, x
 Leaves
   Classification: Class labels, or proportions
   Regression: Numeric; r average, or local fit
 Learning is greedy; find the best split recursively
  (Breiman et al, 1984; Quinlan, 1986, 1993)
                                                         103
Classification Trees
(ID3, CART, C4.5)
 For node m, Nm instances reach m, Nim
  belong to Ci
                             i
       ˆ Ci | x ,m  pm 
                        i   Nm
       P
                            Nm
 Node m is pure if pim is 0 or 1
 Measure of impurity is entropy
               K
       Im   pm log 2pm
                i       i

              i 1
                                          104
Best Split
 If node m is pure, generate a leaf and stop,
  otherwise split and continue recursively
 Impurity after split: Nmj of Nm take branch j. Nimj
  belong to Ci
                                              i
                                            N mj
               ˆ Ci | x ,m, j   pmj 
               P                    i

                                            N mj
                       n      N mj   K
               I'm                  i       i
                                       pmj log2pmj
                       j 1   Nm     i 1

 Find the variable and split that min impurity
  (among all variables -- and split positions for
  numeric variables)                                    105
106
Regression Trees
 Error at node m:
               1 if x  Xm : x reaches node m
     bm x   
               0 otherwise
           1
                                                    t   
                                                           bm x t r t
             
                      2
     Em        t
               r  gm bm x t                     gm 
          Nm t                                            b x 
                                                            t   m
                                                                    t




 After splitting:
              1 if x  Xmj : x reaches node m and branch j
   bmj x   
              0 otherwise
                                                                t      
                                                                   bmj x t r t
                r                    
          1                t        2        t
   E'm                         gmj bmj x              gmj 
         Nm    j       t
                                                                     b x 
                                                                        t   mj
                                                                                 t

                                                                                     107
Model Selection in Trees:




                            108
Pruning Trees
 Remove subtrees for better generalization
  (decrease variance)
   Prepruning: Early stopping
   Postpruning: Grow the whole tree then
    prune subtrees which overfit on the
    pruning set
 Prepruning is faster, postpruning is more
  accurate (requires a separate pruning set)

                                               109
      Rule Extraction from Trees
C4.5Rules
(Quinlan, 1993)




                                   110
Learning Rules
 Rule induction is similar to tree induction but
   tree induction is breadth-first,
   rule induction is depth-first; one rule at a
     time
 Rule set contains rules; rules are conjunctions of
  terms
 Rule covers an example if all terms of the rule
  evaluate to true for the example
 Sequential covering: Generate rules one at a time
  until all positive examples are covered
 IREP (Fürnkrantz and Widmer, 1994), Ripper
  (Cohen, 1995)
                                                       111
112
113
Multivariate Trees




                     114

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:58
posted:8/2/2012
language:
pages:114