Lecture 5 Non-Parameter Estimation for Supervised Learning by nml23533

VIEWS: 25 PAGES: 35

									                    Lecture 5

            Non-Parameter Estimation
            for Supervised Learning –
              Parzen Windows, KNN


Aug. 2006            ECE5907-NUS        1
                   Outline
Introduction
Density Estimation
Parzen Windows Estimation
Probabilistic Neural Network based on Parzen
Window
K Nearest Neighbor Estimation
Nearest Neighbor for Classification
     –1NN
     –KNN



Aug. 2006           ECE5907-NUS                2
                         Introduction

• All classical parametric densities are unimodal (have a
  single peaks), whereas many practical problems involve
  multi-modal densities

• Nonparametric procedures can be used with arbitrary
  distributions and without the assumption that the forms
  of the underlying densities are known

• There are two types of nonparametric methods:
     – Estimating conditional density- P(x | j )
     – Estimating a-posteriori probability estimation P(j | x )
• Density estimation from samples
     – Learning density function from samples

Aug. 2006                       ECE5907-NUS                        3
                   Density Estimation
• Basic idea: given samples to estimate class conditional densities,
  from discrete samples to estimate density function
     – p(x) is continuous
     – P is constant within the small region R
     – V the volume enclosed by R



                        P   p(x' )dx'              (1)
                              


                          p(x' )dx'  p(x)V
                         
                                                      (4)

                                              k/n
                                     p ( x) 
                                               V

Aug. 2006                         ECE5907-NUS                          4
• How to choose right volumes for DE?
     – Too big or too small volume are not good for
       density estimation
     – Depend on availability of data samples
• Two popular methods to choose volumes
     – Fixed volume size
     – Fix no. of samples fallen in the volume (KNN),
       data dependent

Aug. 2006              ECE5907-NUS                    5
Aug. 2006   ECE5907-NUS   6
• The volume V needs to approach 0 anyway if we
  want to use this estimation
   – Practically, V cannot be allowed to become small since the number of
     samples is always limited

   – One will have to accept a certain amount of variance in the ratio k/n

   – Theoretically, if an unlimited number of samples is available, we can
     circumvent this difficulty
     To estimate the density of x, we form a sequence of regions
      R1, R2,…containing x: the first region contains one sample, the second two
      samples and so on.
      Let Vn be the volume of Rn, kn the number of samples falling in Rn and pn(x)
      be the nth estimate for p(x):

                pn(x) = (kn/n)/Vn                                 (7)

 Aug. 2006                          ECE5907-NUS                                 7
Three necessary conditions should apply if we want pn(x) to converge to
   p(x):
                           1 ) lim Vn  0
                              n 

                           2 ) lim k n  
                               n

                           3 ) lim k n / n  0
                              n 

There are two different ways of obtaining sequences of regions that satisfy
  these conditions:

   (a) Shrink an initial region where Vn = 1/n and show that

                             pn (x)  p(x)
                                     n
      This is called “the Parzen-window estimation method”

   (b) Specify kn as some function of n, such as kn = n; the volume Vn is
       grown until it encloses kn neighbors of x. This is called “the kn-
   nearest neighbor estimation method”
  Aug. 2006                      ECE5907-NUS                                 8
  – Condition for convergence

            The fraction k/(nV) is a space averaged value of p(x).
            p(x) is obtained only if V approaches zero.

                            lim p(x)  0 (if n  fixed)
                         V 0, k 0

            This is the case where no samples are included in    R: it is an
            uninteresting case!


                                   lim p(x)  
                                V 0, k  0

            In this case, the estimate diverges: it is an uninteresting case!


Aug. 2006                          ECE5907-NUS                                  9
Parzen Windows Estimation
• Parzen-window approach to estimate densities assume that the
  region Rn is a d-dimensional hypercube

            Vn  hn (h n : length of the edge of  n )
                  d


             Let  (u) be the follow ing w indow function :
                            1
                     1 uj      j  1,..., d
              (u)         2
                     0 otherw ise
                     

• ((x-xi)/hn) is an unit window function
• hn controls the kernel width, smaller hn require more samples, bigger
  hn produces density function smother

Aug. 2006                     ECE5907-NUS                            10
– The number of samples in this hypercube is:

                                n
                                   x  xi 
                        kn     h                            (10)
                             i 1     n   
By substituting kn in equation (7), we obtain the following estimate:




                              1 n 1         x  xi 
                    pn ( x )            
                                            h     
                              n i 1 Vn     n                    (11)

     Pn(x) estimates p(x) as an average of functions of x and
     the samples (xi) (i = 1,… ,n). These functions  can be general!


Aug. 2006                      ECE5907-NUS                                11
Example 1: Parzen Window Estimation for a
Normal Density p(x) N(0,1)

 •       Using a window function: (u) = (1/(2) exp(-u2/2)
 •       hn = h1/n, h1 is the parameter used (n>1)

                                    1 n 1         x  xi 
                          pn ( x )              h 
                                                        
                                    n i 1 hn     n 
                   is an average of normal densities centered at the
            samples xi.
 •       n is the no. of samples used for density estimation
 •       The more samples used, better estimation can be obtained
 •       Small window width h1 will sharpen the density distribution,
         but require more samples
     Aug. 2006                   ECE5907-NUS                     12
•          For n = 1 and h1=1


                                   1 1 / 2
         p1 ( x )   ( x  x1 )     e     ( x  x1 )2  N ( x1 ,1 )
                                   2


      – High bias due to small n


• For n = 10 and h = 0.1, the contributions of the individual
  samples are clearly observable (see figures next page)

    Aug. 2006                      ECE5907-NUS                          13
Aug. 2006   ECE5907-NUS   14
      Analogous results are also
      obtained in two dimensions




Aug. 2006           ECE5907-NUS    15
Aug. 2006   ECE5907-NUS   16
Example 2: Density estimation for a
mixture of a uniform and a triangle
density

• Case where p(x) = 1.U(a,b) + 2.T(c,d)
 (unknown density)




Aug. 2006                ECE5907-NUS        17
Aug. 2006   ECE5907-NUS   18
Parzen Window Estimation for classification


– Classification example
     • We estimate the densities for each category
       and classify a test point by the label
       corresponding to the maximum posterior

     • The decision region for a Parzen-window
       classifier depends upon the choice of window
       function as illustrated in the following figure.


Aug. 2006                ECE5907-NUS                      19
Aug. 2006   ECE5907-NUS   20
       Probabilistic Neural Networks
• PNN based on Parzen estimation
     – Input with d dimensional features
     – n patterns
     – c classes
     –C
                                          .
       Three layers: input, (training) pattern, category output

       n


       d




Aug. 2006                   ECE5907-NUS                       21
        Training the network

     1. Normalize each pattern x of the training set to 1

     2. Place the first training pattern on the input units

     3. Set the weights linking the input units and the first
        pattern units such that: w1 = x1

     4. Make a single connection from the first pattern unit to
        the category unit corresponding to the known class of
        that pattern

     5. Repeat the process for all remaining training patterns
        by setting the weights such that wk = xk (k = 1, 2, …, n)
Aug. 2006                   ECE5907-NUS                         22
Testing the network

 1. Normalize the test pattern x and place it at the input units
 2. Each pattern unit computes the inner product in order to
    yield the net activation
                      net k  w k .x
                                t


                                                                 net  1 
                                               f ( net k )  exp k 2 
    and emit a nonlinear function                                       

 3. Each output unit sums the contributions from all pattern
    units connected to it                  n
                          Pn ( x |  j )    i  P (  j | x )
                                        i 1



 4. Classify by selecting the maximum value of Pn(x | j)
      (j = 1, …, c)
 Aug. 2006                 ECE5907-NUS                                  23
                 PNN summary

• Advantages
     – Fast training and classification
     – Easy to add more training samples by adding
       more pattern nodes
     – Good for online applications
     – Much simpler than the back propagation NN
• Disadvantages
     – High memory if many training samples used

Aug. 2006              ECE5907-NUS                 24
K-Nearest neighbor estimation (KNN)
• Goal: a solution for the problem of the unknown “best”
  window function
     – Let the cell volume be a function of the training data
     – Center a cell about x and let it grows until it captures kn samples
       (kn = f(n))
     – kn are called the kn nearest-neighbors of x

• 2 possibilities can occur:
     – Density is high near x; therefore the cell will be small which
       provides a good resolution
     – Density is low; therefore the cell will grow large and stop until
       higher density regions are reached

We can obtain a family of estimates by setting kn=k1/n and
 choosing different values for k1
Aug. 2006                       ECE5907-NUS                                25
Aug. 2006   ECE5907-NUS   26
K-NN for Classification

      Goal: estimate P(i | x) from a set of n labeled samples
      – Let’s place a cell of volume V around x and capture k
        samples
      – ki samples amongst k turned out to be labeled I then:
                            pn(x, i) = ki /n.V
        An estimate for pn(i| x) is:

                                        pn (x, i )         ki
                    Pn (i | x)    c
                                                          
                                     p (x,  )
                                                            k
                                           n          j
                                    j 1

  Aug. 2006                 ECE5907-NUS                          27
• ki/k is the fraction of the samples within the
  cell that are labeled i

• For minimum error rate, the most frequently
  represented category within the cell is
  selected

• If k is large and the cell sufficiently small, the
  performance will approach the best possible


Aug. 2006              ECE5907-NUS                     28
The 1-NN (nearest –neighbor) classifier

  • Let Dn = {x1, x2, …, xn} be a set of n labeled prototypes
  • Let x’  Dn be the closest prototype to a test point x then
    the nearest-neighbor rule for classifying x is to assign it the
    label associated with x’
  • The nearest-neighbor rule leads to an error rate greater
    than the minimum possible: the Bayes rate
  • If the number of prototype is large (unlimited), the error rate
    of the nearest-neighbor classifier is never worse than twice
    the Bayes rate (it can be demonstrated!)
  • If n  , it is always possible to find x’ sufficiently close so
    that: P(i | x’)  P(i | x)
 Aug. 2006                 ECE5907-NUS                        29
Aug. 2006   ECE5907-NUS   30
The KNN rule
  Goal: Classify x by assigning it the label most
   frequently represented among the k nearest
   samples and use a voting scheme




Aug. 2006             ECE5907-NUS                   31
       Example:
       k = 3 (odd value) and x = (0.10, 0.25)t

                    Prototypes                  Labels
                    (0.15, 0.35)                   1
                    (0.10, 0.28)                   2
                    (0.09, 0.30)                   5
                    (0.12, 0.20)                   2
       Closest vectors to x with their labels are:
             {(0.10, 0.28, 2); (0.12, 0.20, 2); (0.15, 0.35,1)}
       One voting scheme assigns the label 2 to x since 2 is the most
       frequently represented
Aug. 2006                      ECE5907-NUS                            32
                    More on K-NN

• Most simple classifier, often used as a baseline
  for performance comparison with more
  sophisticated classifiers
• High computation cost, especially when samples
  are high
• Only became practical in 80s
• Methods to improve efficiency
     – NN editing
     – Vector quantization (VQ) developed in early 90

Aug. 2006                 ECE5907-NUS                   33
                            Summary
• Advantages of Parzen Window Density Estimation
     –   No assumption on underlying distribution
     –   Being a general DE
     –   Only based on samples
     –   High accuracy if enough samples
• Disadvantages
     – Require too many samples
     – High computation cost
     – Curse of dimensionality
• How to choose best window function?
     – KNN (K nearest neighbor) estimation

Aug. 2006                       ECE5907-NUS         34
                      Reading
• Chapter 4, Pattern Classification by Duda, Hart, Stork,
  2001, Sections 4.1-4.5




Aug. 2006                ECE5907-NUS                        35

								
To top