# Lecture 5 Non-Parameter Estimation for Supervised Learning by nml23533

VIEWS: 25 PAGES: 35

• pg 1
```									                    Lecture 5

Non-Parameter Estimation
for Supervised Learning –
Parzen Windows, KNN

Aug. 2006            ECE5907-NUS        1
Outline
Introduction
Density Estimation
Parzen Windows Estimation
Probabilistic Neural Network based on Parzen
Window
K Nearest Neighbor Estimation
Nearest Neighbor for Classification
–1NN
–KNN

Aug. 2006           ECE5907-NUS                2
Introduction

• All classical parametric densities are unimodal (have a
single peaks), whereas many practical problems involve
multi-modal densities

• Nonparametric procedures can be used with arbitrary
distributions and without the assumption that the forms
of the underlying densities are known

• There are two types of nonparametric methods:
– Estimating conditional density- P(x | j )
– Estimating a-posteriori probability estimation P(j | x )
• Density estimation from samples
– Learning density function from samples

Aug. 2006                       ECE5907-NUS                        3
Density Estimation
• Basic idea: given samples to estimate class conditional densities,
from discrete samples to estimate density function
– p(x) is continuous
– P is constant within the small region R
– V the volume enclosed by R

P   p(x' )dx'              (1)


 p(x' )dx'  p(x)V

(4)

k/n
p ( x) 
V

Aug. 2006                         ECE5907-NUS                          4
• How to choose right volumes for DE?
– Too big or too small volume are not good for
density estimation
– Depend on availability of data samples
• Two popular methods to choose volumes
– Fixed volume size
– Fix no. of samples fallen in the volume (KNN),
data dependent

Aug. 2006              ECE5907-NUS                    5
Aug. 2006   ECE5907-NUS   6
• The volume V needs to approach 0 anyway if we
want to use this estimation
– Practically, V cannot be allowed to become small since the number of
samples is always limited

– One will have to accept a certain amount of variance in the ratio k/n

– Theoretically, if an unlimited number of samples is available, we can
circumvent this difficulty
To estimate the density of x, we form a sequence of regions
R1, R2,…containing x: the first region contains one sample, the second two
samples and so on.
Let Vn be the volume of Rn, kn the number of samples falling in Rn and pn(x)
be the nth estimate for p(x):

pn(x) = (kn/n)/Vn                                 (7)

Aug. 2006                          ECE5907-NUS                                 7
Three necessary conditions should apply if we want pn(x) to converge to
p(x):
1 ) lim Vn  0
n 

2 ) lim k n  
n

3 ) lim k n / n  0
n 

There are two different ways of obtaining sequences of regions that satisfy
these conditions:

(a) Shrink an initial region where Vn = 1/n and show that

pn (x)  p(x)
n
This is called “the Parzen-window estimation method”

(b) Specify kn as some function of n, such as kn = n; the volume Vn is
grown until it encloses kn neighbors of x. This is called “the kn-
nearest neighbor estimation method”
Aug. 2006                      ECE5907-NUS                                 8
– Condition for convergence

The fraction k/(nV) is a space averaged value of p(x).
p(x) is obtained only if V approaches zero.

lim p(x)  0 (if n  fixed)
V 0, k 0

This is the case where no samples are included in    R: it is an
uninteresting case!

lim p(x)  
V 0, k  0

In this case, the estimate diverges: it is an uninteresting case!

Aug. 2006                          ECE5907-NUS                                  9
Parzen Windows Estimation
• Parzen-window approach to estimate densities assume that the
region Rn is a d-dimensional hypercube

Vn  hn (h n : length of the edge of  n )
d

Let  (u) be the follow ing w indow function :
       1
1 uj      j  1,..., d
 (u)         2
0 otherw ise


• ((x-xi)/hn) is an unit window function
• hn controls the kernel width, smaller hn require more samples, bigger
hn produces density function smother

Aug. 2006                     ECE5907-NUS                            10
– The number of samples in this hypercube is:

n
 x  xi 
kn     h                            (10)
i 1     n   
By substituting kn in equation (7), we obtain the following estimate:

1 n 1         x  xi 
pn ( x )            
 h     
n i 1 Vn     n                    (11)

Pn(x) estimates p(x) as an average of functions of x and
the samples (xi) (i = 1,… ,n). These functions  can be general!

Aug. 2006                      ECE5907-NUS                                11
Example 1: Parzen Window Estimation for a
Normal Density p(x) N(0,1)

•       Using a window function: (u) = (1/(2) exp(-u2/2)
•       hn = h1/n, h1 is the parameter used (n>1)

1 n 1         x  xi 
pn ( x )              h 
        
n i 1 hn     n 
is an average of normal densities centered at the
samples xi.
•       n is the no. of samples used for density estimation
•       The more samples used, better estimation can be obtained
•       Small window width h1 will sharpen the density distribution,
but require more samples
Aug. 2006                   ECE5907-NUS                     12
•          For n = 1 and h1=1

1 1 / 2
p1 ( x )   ( x  x1 )     e     ( x  x1 )2  N ( x1 ,1 )
2

– High bias due to small n

• For n = 10 and h = 0.1, the contributions of the individual
samples are clearly observable (see figures next page)

Aug. 2006                      ECE5907-NUS                          13
Aug. 2006   ECE5907-NUS   14
Analogous results are also
obtained in two dimensions

Aug. 2006           ECE5907-NUS    15
Aug. 2006   ECE5907-NUS   16
Example 2: Density estimation for a
mixture of a uniform and a triangle
density

• Case where p(x) = 1.U(a,b) + 2.T(c,d)
(unknown density)

Aug. 2006                ECE5907-NUS        17
Aug. 2006   ECE5907-NUS   18
Parzen Window Estimation for classification

– Classification example
• We estimate the densities for each category
and classify a test point by the label
corresponding to the maximum posterior

• The decision region for a Parzen-window
classifier depends upon the choice of window
function as illustrated in the following figure.

Aug. 2006                ECE5907-NUS                      19
Aug. 2006   ECE5907-NUS   20
Probabilistic Neural Networks
• PNN based on Parzen estimation
– Input with d dimensional features
– n patterns
– c classes
–C
.
Three layers: input, (training) pattern, category output

n

d

Aug. 2006                   ECE5907-NUS                       21
Training the network

1. Normalize each pattern x of the training set to 1

2. Place the first training pattern on the input units

3. Set the weights linking the input units and the first
pattern units such that: w1 = x1

4. Make a single connection from the first pattern unit to
the category unit corresponding to the known class of
that pattern

5. Repeat the process for all remaining training patterns
by setting the weights such that wk = xk (k = 1, 2, …, n)
Aug. 2006                   ECE5907-NUS                         22
Testing the network

1. Normalize the test pattern x and place it at the input units
2. Each pattern unit computes the inner product in order to
yield the net activation
net k  w k .x
t

 net  1 
f ( net k )  exp k 2 
and emit a nonlinear function                                       

3. Each output unit sums the contributions from all pattern
units connected to it                  n
Pn ( x |  j )    i  P (  j | x )
i 1

4. Classify by selecting the maximum value of Pn(x | j)
(j = 1, …, c)
Aug. 2006                 ECE5907-NUS                                  23
PNN summary

– Fast training and classification
more pattern nodes
– Good for online applications
– Much simpler than the back propagation NN
– High memory if many training samples used

Aug. 2006              ECE5907-NUS                 24
K-Nearest neighbor estimation (KNN)
• Goal: a solution for the problem of the unknown “best”
window function
– Let the cell volume be a function of the training data
– Center a cell about x and let it grows until it captures kn samples
(kn = f(n))
– kn are called the kn nearest-neighbors of x

• 2 possibilities can occur:
– Density is high near x; therefore the cell will be small which
provides a good resolution
– Density is low; therefore the cell will grow large and stop until
higher density regions are reached

We can obtain a family of estimates by setting kn=k1/n and
choosing different values for k1
Aug. 2006                       ECE5907-NUS                                25
Aug. 2006   ECE5907-NUS   26
K-NN for Classification

Goal: estimate P(i | x) from a set of n labeled samples
– Let’s place a cell of volume V around x and capture k
samples
– ki samples amongst k turned out to be labeled I then:
pn(x, i) = ki /n.V
An estimate for pn(i| x) is:

pn (x, i )         ki
Pn (i | x)    c

 p (x,  )
k
n          j
j 1

Aug. 2006                 ECE5907-NUS                          27
• ki/k is the fraction of the samples within the
cell that are labeled i

• For minimum error rate, the most frequently
represented category within the cell is
selected

• If k is large and the cell sufficiently small, the
performance will approach the best possible

Aug. 2006              ECE5907-NUS                     28
The 1-NN (nearest –neighbor) classifier

• Let Dn = {x1, x2, …, xn} be a set of n labeled prototypes
• Let x’  Dn be the closest prototype to a test point x then
the nearest-neighbor rule for classifying x is to assign it the
label associated with x’
• The nearest-neighbor rule leads to an error rate greater
than the minimum possible: the Bayes rate
• If the number of prototype is large (unlimited), the error rate
of the nearest-neighbor classifier is never worse than twice
the Bayes rate (it can be demonstrated!)
• If n  , it is always possible to find x’ sufficiently close so
that: P(i | x’)  P(i | x)
Aug. 2006                 ECE5907-NUS                        29
Aug. 2006   ECE5907-NUS   30
The KNN rule
Goal: Classify x by assigning it the label most
frequently represented among the k nearest
samples and use a voting scheme

Aug. 2006             ECE5907-NUS                   31
Example:
k = 3 (odd value) and x = (0.10, 0.25)t

Prototypes                  Labels
(0.15, 0.35)                   1
(0.10, 0.28)                   2
(0.09, 0.30)                   5
(0.12, 0.20)                   2
Closest vectors to x with their labels are:
{(0.10, 0.28, 2); (0.12, 0.20, 2); (0.15, 0.35,1)}
One voting scheme assigns the label 2 to x since 2 is the most
frequently represented
Aug. 2006                      ECE5907-NUS                            32
More on K-NN

• Most simple classifier, often used as a baseline
for performance comparison with more
sophisticated classifiers
• High computation cost, especially when samples
are high
• Only became practical in 80s
• Methods to improve efficiency
– NN editing
– Vector quantization (VQ) developed in early 90

Aug. 2006                 ECE5907-NUS                   33
Summary
• Advantages of Parzen Window Density Estimation
–   No assumption on underlying distribution
–   Being a general DE
–   Only based on samples
–   High accuracy if enough samples
– Require too many samples
– High computation cost
– Curse of dimensionality
• How to choose best window function?
– KNN (K nearest neighbor) estimation

Aug. 2006                       ECE5907-NUS         34
• Chapter 4, Pattern Classification by Duda, Hart, Stork,
2001, Sections 4.1-4.5

Aug. 2006                ECE5907-NUS                        35

```
To top