VIEWS: 25 PAGES: 35 CATEGORY: Politics & History POSTED ON: 6/1/2010 Public Domain
Lecture 5 Non-Parameter Estimation for Supervised Learning – Parzen Windows, KNN Aug. 2006 ECE5907-NUS 1 Outline Introduction Density Estimation Parzen Windows Estimation Probabilistic Neural Network based on Parzen Window K Nearest Neighbor Estimation Nearest Neighbor for Classification –1NN –KNN Aug. 2006 ECE5907-NUS 2 Introduction • All classical parametric densities are unimodal (have a single peaks), whereas many practical problems involve multi-modal densities • Nonparametric procedures can be used with arbitrary distributions and without the assumption that the forms of the underlying densities are known • There are two types of nonparametric methods: – Estimating conditional density- P(x | j ) – Estimating a-posteriori probability estimation P(j | x ) • Density estimation from samples – Learning density function from samples Aug. 2006 ECE5907-NUS 3 Density Estimation • Basic idea: given samples to estimate class conditional densities, from discrete samples to estimate density function – p(x) is continuous – P is constant within the small region R – V the volume enclosed by R P p(x' )dx' (1) p(x' )dx' p(x)V (4) k/n p ( x) V Aug. 2006 ECE5907-NUS 4 • How to choose right volumes for DE? – Too big or too small volume are not good for density estimation – Depend on availability of data samples • Two popular methods to choose volumes – Fixed volume size – Fix no. of samples fallen in the volume (KNN), data dependent Aug. 2006 ECE5907-NUS 5 Aug. 2006 ECE5907-NUS 6 • The volume V needs to approach 0 anyway if we want to use this estimation – Practically, V cannot be allowed to become small since the number of samples is always limited – One will have to accept a certain amount of variance in the ratio k/n – Theoretically, if an unlimited number of samples is available, we can circumvent this difficulty To estimate the density of x, we form a sequence of regions R1, R2,…containing x: the first region contains one sample, the second two samples and so on. Let Vn be the volume of Rn, kn the number of samples falling in Rn and pn(x) be the nth estimate for p(x): pn(x) = (kn/n)/Vn (7) Aug. 2006 ECE5907-NUS 7 Three necessary conditions should apply if we want pn(x) to converge to p(x): 1 ) lim Vn 0 n 2 ) lim k n n 3 ) lim k n / n 0 n There are two different ways of obtaining sequences of regions that satisfy these conditions: (a) Shrink an initial region where Vn = 1/n and show that pn (x) p(x) n This is called “the Parzen-window estimation method” (b) Specify kn as some function of n, such as kn = n; the volume Vn is grown until it encloses kn neighbors of x. This is called “the kn- nearest neighbor estimation method” Aug. 2006 ECE5907-NUS 8 – Condition for convergence The fraction k/(nV) is a space averaged value of p(x). p(x) is obtained only if V approaches zero. lim p(x) 0 (if n fixed) V 0, k 0 This is the case where no samples are included in R: it is an uninteresting case! lim p(x) V 0, k 0 In this case, the estimate diverges: it is an uninteresting case! Aug. 2006 ECE5907-NUS 9 Parzen Windows Estimation • Parzen-window approach to estimate densities assume that the region Rn is a d-dimensional hypercube Vn hn (h n : length of the edge of n ) d Let (u) be the follow ing w indow function : 1 1 uj j 1,..., d (u) 2 0 otherw ise • ((x-xi)/hn) is an unit window function • hn controls the kernel width, smaller hn require more samples, bigger hn produces density function smother Aug. 2006 ECE5907-NUS 10 – The number of samples in this hypercube is: n x xi kn h (10) i 1 n By substituting kn in equation (7), we obtain the following estimate: 1 n 1 x xi pn ( x ) h n i 1 Vn n (11) Pn(x) estimates p(x) as an average of functions of x and the samples (xi) (i = 1,… ,n). These functions can be general! Aug. 2006 ECE5907-NUS 11 Example 1: Parzen Window Estimation for a Normal Density p(x) N(0,1) • Using a window function: (u) = (1/(2) exp(-u2/2) • hn = h1/n, h1 is the parameter used (n>1) 1 n 1 x xi pn ( x ) h n i 1 hn n is an average of normal densities centered at the samples xi. • n is the no. of samples used for density estimation • The more samples used, better estimation can be obtained • Small window width h1 will sharpen the density distribution, but require more samples Aug. 2006 ECE5907-NUS 12 • For n = 1 and h1=1 1 1 / 2 p1 ( x ) ( x x1 ) e ( x x1 )2 N ( x1 ,1 ) 2 – High bias due to small n • For n = 10 and h = 0.1, the contributions of the individual samples are clearly observable (see figures next page) Aug. 2006 ECE5907-NUS 13 Aug. 2006 ECE5907-NUS 14 Analogous results are also obtained in two dimensions Aug. 2006 ECE5907-NUS 15 Aug. 2006 ECE5907-NUS 16 Example 2: Density estimation for a mixture of a uniform and a triangle density • Case where p(x) = 1.U(a,b) + 2.T(c,d) (unknown density) Aug. 2006 ECE5907-NUS 17 Aug. 2006 ECE5907-NUS 18 Parzen Window Estimation for classification – Classification example • We estimate the densities for each category and classify a test point by the label corresponding to the maximum posterior • The decision region for a Parzen-window classifier depends upon the choice of window function as illustrated in the following figure. Aug. 2006 ECE5907-NUS 19 Aug. 2006 ECE5907-NUS 20 Probabilistic Neural Networks • PNN based on Parzen estimation – Input with d dimensional features – n patterns – c classes –C . Three layers: input, (training) pattern, category output n d Aug. 2006 ECE5907-NUS 21 Training the network 1. Normalize each pattern x of the training set to 1 2. Place the first training pattern on the input units 3. Set the weights linking the input units and the first pattern units such that: w1 = x1 4. Make a single connection from the first pattern unit to the category unit corresponding to the known class of that pattern 5. Repeat the process for all remaining training patterns by setting the weights such that wk = xk (k = 1, 2, …, n) Aug. 2006 ECE5907-NUS 22 Testing the network 1. Normalize the test pattern x and place it at the input units 2. Each pattern unit computes the inner product in order to yield the net activation net k w k .x t net 1 f ( net k ) exp k 2 and emit a nonlinear function 3. Each output unit sums the contributions from all pattern units connected to it n Pn ( x | j ) i P ( j | x ) i 1 4. Classify by selecting the maximum value of Pn(x | j) (j = 1, …, c) Aug. 2006 ECE5907-NUS 23 PNN summary • Advantages – Fast training and classification – Easy to add more training samples by adding more pattern nodes – Good for online applications – Much simpler than the back propagation NN • Disadvantages – High memory if many training samples used Aug. 2006 ECE5907-NUS 24 K-Nearest neighbor estimation (KNN) • Goal: a solution for the problem of the unknown “best” window function – Let the cell volume be a function of the training data – Center a cell about x and let it grows until it captures kn samples (kn = f(n)) – kn are called the kn nearest-neighbors of x • 2 possibilities can occur: – Density is high near x; therefore the cell will be small which provides a good resolution – Density is low; therefore the cell will grow large and stop until higher density regions are reached We can obtain a family of estimates by setting kn=k1/n and choosing different values for k1 Aug. 2006 ECE5907-NUS 25 Aug. 2006 ECE5907-NUS 26 K-NN for Classification Goal: estimate P(i | x) from a set of n labeled samples – Let’s place a cell of volume V around x and capture k samples – ki samples amongst k turned out to be labeled I then: pn(x, i) = ki /n.V An estimate for pn(i| x) is: pn (x, i ) ki Pn (i | x) c p (x, ) k n j j 1 Aug. 2006 ECE5907-NUS 27 • ki/k is the fraction of the samples within the cell that are labeled i • For minimum error rate, the most frequently represented category within the cell is selected • If k is large and the cell sufficiently small, the performance will approach the best possible Aug. 2006 ECE5907-NUS 28 The 1-NN (nearest –neighbor) classifier • Let Dn = {x1, x2, …, xn} be a set of n labeled prototypes • Let x’ Dn be the closest prototype to a test point x then the nearest-neighbor rule for classifying x is to assign it the label associated with x’ • The nearest-neighbor rule leads to an error rate greater than the minimum possible: the Bayes rate • If the number of prototype is large (unlimited), the error rate of the nearest-neighbor classifier is never worse than twice the Bayes rate (it can be demonstrated!) • If n , it is always possible to find x’ sufficiently close so that: P(i | x’) P(i | x) Aug. 2006 ECE5907-NUS 29 Aug. 2006 ECE5907-NUS 30 The KNN rule Goal: Classify x by assigning it the label most frequently represented among the k nearest samples and use a voting scheme Aug. 2006 ECE5907-NUS 31 Example: k = 3 (odd value) and x = (0.10, 0.25)t Prototypes Labels (0.15, 0.35) 1 (0.10, 0.28) 2 (0.09, 0.30) 5 (0.12, 0.20) 2 Closest vectors to x with their labels are: {(0.10, 0.28, 2); (0.12, 0.20, 2); (0.15, 0.35,1)} One voting scheme assigns the label 2 to x since 2 is the most frequently represented Aug. 2006 ECE5907-NUS 32 More on K-NN • Most simple classifier, often used as a baseline for performance comparison with more sophisticated classifiers • High computation cost, especially when samples are high • Only became practical in 80s • Methods to improve efficiency – NN editing – Vector quantization (VQ) developed in early 90 Aug. 2006 ECE5907-NUS 33 Summary • Advantages of Parzen Window Density Estimation – No assumption on underlying distribution – Being a general DE – Only based on samples – High accuracy if enough samples • Disadvantages – Require too many samples – High computation cost – Curse of dimensionality • How to choose best window function? – KNN (K nearest neighbor) estimation Aug. 2006 ECE5907-NUS 34 Reading • Chapter 4, Pattern Classification by Duda, Hart, Stork, 2001, Sections 4.1-4.5 Aug. 2006 ECE5907-NUS 35