VIEWS: 11 PAGES: 50 POSTED ON: 7/30/2012
7.44 7.4 Discriminant Rules (More Than Two Populations) Almost all of the discrimination rules given in Section 7.1 can be extended to having more than two populations (there is no longer a single linear discriminant function rule). Example: Wheat kernels (wheat.sas, wheat.R, wheat.dat) A researcher at Kansas State University wanted to classify wheat kernels into “healthy”, “sprout”, or “scab” types. The researcher developed an automated method to take the following measurements: kernel density, hardness index, size index, weight index, and moisture content. The purpose of the discriminant analysis was to see if the wheat kernels could accurately be classified into the three types using the five measurements. Partial listing of the data set (s=single, k=kernel): lot class kernel type1 skden1 skden2 skden3 skhard sksize skwt skmst 5911 hrw 1 Healthy 1.35 1.35 1.36 60.33 2.30 24.65 12.02 5911 hrw 2 Healthy 1.29 1.29 1.29 56.09 2.73 33.30 12.17 5911 hrw 3 Healthy 1.24 1.23 1.23 43.99 2.51 31.76 11.88 5911 hrw 4 Healthy 1.34 1.34 1.34 53.82 2.27 32.71 12.11 5911 hrw 5 Healthy 1.26 1.26 1.26 44.39 2.35 26.07 12.06 srw4 srw 276 Scab 1.03 1.03 1.03 -9.57 2.06 23.82 12.65 Lot is the batch of the wheat Class is hrw=hard red winter wheat and srw=soft red winter wheat; a new variable called hrw is created to be a binary numerical variable denoting the different classes 2005 Christopher R. Bilder 7.45 Kernel is an identification number type1 is the kernel type – Healthy, Sprout, Scab skden1, skden2, and skden3 are the density of the kernel (the measurement was repeated three times); a new variable called skden is created to be the average of the three measurements skhard is the kernel’s hardness sksize is the kernel’s size skwt is the kernel’s weight skmst is the kernel’s moisture content Initial R summary measures (see program for code) Bubble plot for first three PCs Wheat data 4 269 270 267 265 274 210 275 180 209 212 172 207 240 178 29 216235 2 174 191 205 175 158 232 276 204 208 254 215 163 266 176 153 170 272 177211 251 169 185 206 173 198 27 98 171 268 146 193 213 263 183 140 34 69 252 234 195 238 260 247 155 219 236 203 32 101 74 30 64 6679 271 273 242 192 228 264229 141 256 224 PC2 179 199 142 65 67 150 189 168164 261 184 188 262 226 257 200 971719 102 225 149194 196 253 148 144 71 246 239190 165 182 108 187 159 249 222 156233230 0 244 258 167 227 237 151 152 181 202218 15 99 92 11 135 37 13391 139 201 214 248 145259 221 197 162 147 154 12572 26 24 107 220 231 241 245 143 44 62 8178 161 223 93 13 35 1290 84 21 7313482 10 49 136 70 3363 5106 22 166 250 160 255 127 54 113 50 77 100 57 68 103 186 128 123 1 18 96 243 129 104 36 89 75 85 86 8 95 217 31 111 137 61 4394 80 130 25 628 112 13146 83 16 14 60 76 9 105 53 2 138 122118 40 39 45 58 87 38 2055 23 52 56 121 119 -2 47 116 109 51 43 59 157 7 110 132 42 41 88 124 48 120 114 117 115 126 -4 -4 -2 0 2 4 PC1 2005 Christopher R. Bilder Wheat star plot 7.46 10 1 21 10 20 51 71 91 101 121 141 161 181 201 221 241 261 hrw skhard skden sksize skmst 2005 Christopher R. Bilder 7.47 Parallel coordinate plot for wheat data Healthy Sprout Scab wheat2.kernel wheat2.skden wheat2.skhard wheat2.sksize wheat2.skwt wheat2.skmst wheat2.hrw 2005 Christopher R. Bilder 7.48 Parallel coordinate plot for wheat data - highlight kernel 31 Healthy Sprout Scab Kernel 31 wheat2.kernel wheat2.skden wheat2.skhard wheat2.sksize wheat2.skwt wheat2.skmst wheat2.hrw 2005 Christopher R. Bilder 7.49 Parallel coordinate plot for wheat data - sort by type1 Healthy Sprout Scab wheat.colors2 wheat2.kernel wheat2.skden wheat2.skhard wheat2.sksize wheat2.skwt wheat2.skmst wheat2.hrw 2005 Christopher R. Bilder 7.50 Initial SAS summary measures (see program for code) The PRINCOMP Procedure Observations 276 Variables 6 Simple Statistics skden skhard sksize Mean 1.191571450 25.91269460 2.200355072 StD 0.140570141 27.91452693 0.495121266 Simple Statistics skwt skmst hrw Mean 27.43524638 11.18883322 0.5217391304 StD 7.97612238 2.03025511 0.5004345937 Correlation Matrix skden skhard sksize skwt skmst hrw skden 1.0000 0.0989 0.2073 0.2764 0.0089 -.0151 skhard 0.0989 1.0000 -.0980 -.3395 -.1883 0.3909 sksize 0.2073 -.0980 1.0000 0.7566 0.0046 0.0248 skwt 0.2764 -.3395 0.7566 1.0000 0.1442 -.1410 skmst 0.0089 -.1883 0.0046 0.1442 1.0000 -.6816 hrw -.0151 0.3909 0.0248 -.1410 -.6816 1.0000 Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative 1 2.14089924 0.45206317 0.3568 0.3568 2 1.68883607 0.68799736 0.2815 0.6383 3 1.00083871 0.30098689 0.1668 0.8051 4 0.69985182 0.41680304 0.1166 0.9217 5 0.28304878 0.09652340 0.0472 0.9689 6 0.18652538 0.0311 1.0000 Eigenvectors Prin1 Prin2 Prin3 Prin4 Prin5 Prin6 skden 0.182219 0.315600 0.734515 -.557323 0.042130 0.123681 skhard -.394409 0.195882 0.547652 0.630087 -.198030 -.264431 sksize 0.419028 0.501916 -.115373 0.384230 -.232840 0.597778 skwt 0.542541 0.382239 -.126822 0.079601 0.156216 -.716044 skmst 0.394311 -.461683 0.319975 0.372033 0.601871 0.168304 hrw -.431070 0.500849 -.169666 0.001902 0.719821 0.128047 3-4 principal components appear to be the true dimension of the data. Note the possible interpretations of the principal components. 2005 Christopher R. Bilder 7.51 Kernel #31 is the outlier on the plot Notice there is some separations between the different kernel classes. The division is actually for the hard and soft read winter wheat (hrw variable). It may be of interest to try a separate analysis for the variables??? I will not do that here and just use hrw as a variable for discriminating between the wheat kernels types. 2005 Christopher R. Bilder 7.52 What should be done about kernel #31? Talk to the researcher to make sure this kernel’s data values are correct. Discriminant Analysis: title2 'Discriminant analysis on the wheat data set - priors proportional'; proc discrim data=set1 method=normal crossvalidate out=list_set outcross=cross_set; class type1; var skden skhard sksize skwt skmst hrw ; priors proportional; run; title2 'Missclassifications from crossvalidation'; proc print data=cross_set; where type1 ne _into_; var type1 _into_ healthy sprout scab; run; Chris Bilder, STAT 873 Discriminant analysis on the wheat data set - priors proportional The DISCRIM Procedure Observations 276 DF Total 275 Variables 6 DF Within Classes 273 Classes 3 DF Between Classes 2 Class Level Information Variable Prior type1 Name Frequency Weight Proportion Probability Healthy Healthy 96 96.0000 0.347826 0.347826 Scab Scab 84 84.0000 0.304348 0.304348 Sprout Sprout 96 96.0000 0.347826 0.347826 Pooled Covariance Matrix Information Natural Log of the Covariance Determinant of the Matrix Rank Covariance Matrix 6 2.83517 Pairwise Generalized Squared Distances Between Groups 2 _ _ -1 _ _ 2005 Christopher R. Bilder 7.53 D (i|j) = (X - X )' COV (X - X ) - 2 ln PRIOR i j i j j Generalized Squared Distance to type1 From type1 Healthy Scab Sprout Healthy 2.11211 7.28413 2.75637 Scab 7.01707 2.37917 5.44346 Sprout 2.75637 5.71052 2.11211 Linear Discriminant Function _ -1 _ -1 _ Constant = -.5 X' COV X + ln PRIOR Coefficient = COV X j j j Vector j Linear Discriminant Function for type1 Variable Healthy Scab Sprout Constant -108.89245 -88.62721 -101.75140 skden 92.60313 79.65724 86.94798 skhard -0.06784 -0.07288 -0.08135 sksize 10.47804 10.41204 10.98232 skwt 0.04965 -0.19559 0.01680 skmst 5.68324 5.79647 5.67197 hrw 18.26673 18.63187 18.46743 Classification Summary for Calibration Data: WORK.SET1 Resubstitution Summary using Linear Discriminant Function Generalized Squared Distance Function 2 _ -1 _ D (X) = (X-X )' COV (X-X ) - 2 ln PRIOR j j j j Posterior Probability of Membership in Each type1 2 2 Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X)) j k k Number of Observations and Percent Classified into type1 From type1 Healthy Scab Sprout Total Healthy 74 7 15 96 77.08 7.29 15.63 100.00 Scab 10 64 10 84 11.90 76.19 11.90 100.00 Sprout 23 19 54 96 23.96 19.79 56.25 100.00 Total 107 90 79 276 38.77 32.61 28.62 100.00 2005 Christopher R. Bilder 7.54 Priors 0.34783 0.30435 0.34783 Error Count Estimates for type1 Healthy Scab Sprout Total Rate 0.2292 0.2381 0.4375 0.3043 Priors 0.3478 0.3043 0.3478 Classification Summary for Calibration Data: WORK.SET1 Cross-validation Summary using Linear Discriminant Function Generalized Squared Distance Function 2 _ -1 _ D (X) = (X-X )' COV (X-X ) - 2 ln PRIOR j (X)j (X) (X)j j Posterior Probability of Membership in Each type1 2 2 Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X)) j k k Number of Observations and Percent Classified into type1 From type1 Healthy Scab Sprout Total Healthy 67 7 22 96 69.79 7.29 22.92 100.00 Scab 11 62 11 84 13.10 73.81 13.10 100.00 Sprout 25 19 52 96 26.04 19.79 54.17 100.00 Total 103 88 85 276 37.32 31.88 30.80 100.00 Priors 0.34783 0.30435 0.34783 Error Count Estimates for type1 Healthy Scab Sprout Total Rate 0.3021 0.2619 0.4583 0.3442 Priors 0.3478 0.3043 0.3478 Chris Bilder, STAT 873 Missclassifications from crossvalidation Obs type1 _INTO_ Healthy Sprout Scab 8 Healthy Sprout 0.45678 0.48732 0.05591 11 Healthy Scab 0.29606 0.25829 0.44564 14 Sprout Healthy 0.65448 0.33797 0.00755 15 Sprout Scab 0.11832 0.25361 0.62806 16 Sprout Healthy 0.54047 0.42140 0.03813 2005 Christopher R. Bilder 7.55 271 Scab Healthy 0.37123 0.28543 0.34334 273 Scab Healthy 0.47232 0.34115 0.18653 2005 Christopher R. Bilder 7.56 title2 'Discriminant analysis on the wheat data set'; proc discrim data=set1 method=normal crossvalidate outcross=cross_set; class type1; var skden skhard sksize skwt skmst hrw ; priors equal; run; title2 'Missclassifications from crossvalidation'; proc print data=cross_set; where type1 ne _into_; var type1 _into_ healthy sprout scab; run; The DISCRIM Procedure Observations 276 DF Total 275 Variables 6 DF Within Classes 273 Classes 3 DF Between Classes 2 Class Level Information Variable Prior type1 Name Frequency Weight Proportion Probability Healthy Healthy 96 96.0000 0.347826 0.333333 Scab Scab 84 84.0000 0.304348 0.333333 Sprout Sprout 96 96.0000 0.347826 0.333333 Pooled Covariance Matrix Information Natural Log of the Covariance Determinant of the Matrix Rank Covariance Matrix 6 2.83517 Pairwise Generalized Squared Distances Between Groups 2 _ _ -1 _ _ D (i|j) = (X - X )' COV (X - X ) i j i j Generalized Squared Distance to type1 From type1 Healthy Scab Sprout Healthy 0 4.90497 0.64426 Scab 4.90497 0 3.33136 Sprout 0.64426 3.33136 0 Linear Discriminant Function 2005 Christopher R. Bilder 7.57 _ -1 _ -1 _ Constant = -.5 X' COV X Coefficient Vector = COV X j j j Linear Discriminant Function for type1 Variable Healthy Scab Sprout Constant -107.83640 -87.43762 -100.69535 skden 92.60313 79.65724 86.94798 skhard -0.06784 -0.07288 -0.08135 sksize 10.47804 10.41204 10.98232 skwt 0.04965 -0.19559 0.01680 skmst 5.68324 5.79647 5.67197 hrw 18.26673 18.63187 18.46743 Classification Summary for Calibration Data: WORK.SET1 Resubstitution Summary using Linear Discriminant Function Generalized Squared Distance Function 2 _ -1 _ D (X) = (X-X )' COV (X-X ) j j j Posterior Probability of Membership in Each type1 2 2 Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X)) j k k Number of Observations and Percent Classified into type1 From type1 Healthy Scab Sprout Total Healthy 74 7 15 96 77.08 7.29 15.63 100.00 Scab 10 65 9 84 11.90 77.38 10.71 100.00 Sprout 23 20 53 96 23.96 20.83 55.21 100.00 Total 107 92 77 276 38.77 33.33 27.90 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for type1 Healthy Scab Sprout Total Rate 0.2292 0.2262 0.4479 0.3011 Priors 0.3333 0.3333 0.3333 2005 Christopher R. Bilder 7.58 Classification Summary for Calibration Data: WORK.SET1 Cross-validation Summary using Linear Discriminant Function Generalized Squared Distance Function 2 _ -1 _ D (X) = (X-X )' COV (X-X ) j (X)j (X) (X)j Posterior Probability of Membership in Each type1 2 2 Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X)) j k k Number of Observations and Percent Classified into type1 From type1 Healthy Scab Sprout Total Healthy 66 9 21 96 68.75 9.38 21.88 100.00 Scab 10 65 9 84 11.90 77.38 10.71 100.00 Sprout 25 20 51 96 26.04 20.83 53.13 100.00 Total 101 94 81 276 36.59 34.06 29.35 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for type1 Healthy Scab Sprout Total Rate 0.3125 0.2262 0.4688 0.3358 Priors 0.3333 0.3333 0.3333 Obs type1 _INTO_ Healthy Sprout Scab 8 Healthy Sprout 0.45316 0.48345 0.06339 11 Healthy Scab 0.27834 0.24283 0.47882 14 Sprout Healthy 0.65378 0.33760 0.00862 15 Sprout Scab 0.10858 0.23273 0.65869 16 Sprout Healthy 0.53754 0.41912 0.04334 17 Sprout Scab 0.15010 0.12690 0.72299 18 Sprout Healthy 0.56431 0.35650 0.07919 19 Sprout Scab 0.18301 0.16055 0.65644 268 Scab Healthy 0.52590 0.13806 0.33604 273 Scab Healthy 0.46007 0.33229 0.20764 Notes: The classification error rates are a little better using the PRIORS=PROPORTIONAL option. Note that the 2005 Christopher R. Bilder 7.59 proportion in each wheat class is approximately the same. Using the linear discriminant rules is a little more complicated when the number of populations is more than 2. For this example, 2 different linear discriminant functions are needed. See wheat.sas for the PROC IML code used to classify the wheat kernels. Also contained in wheat.sas is the PROC IML code needed to show how the Mahalanobis distance and the posterior probability are found. Examine this on your own. No cost of classifications are used here The covariance matrices for healthy, sprout, and scab were found to be unequal (p-value<0.0001) using the POOL=TEST option in PROC DISCRIM. When the quadratic discriminant rule is used, the classification error rates are a little less. The actual code and output used to find these rates are excluded from the notes. Examine the 3D plot of the principal components for justification of why some classification error rates are larger than others. 2005 Christopher R. Bilder 7.60 Summary of classification errors Classification Error Rates Actual Healthy Scab Sprout Overall Error 1=2=3 Resubstitution 22.92% 23.81% 43.75% 30.43% priors=prop. Crossvalidation 30.21% 26.19% 45.83% 34.42% Different i Resubstitution 26.04% 20.24% 33.33% 26.54% priors=equal Crossvalidation 31.25% 22.62% 41.67% 31.85% 1=2= Resubstitution 22.92% 22.62% 44.79% 30.11% priors=equal Crossvalidation 31.25% 22.62% 46.88% 33.58% Different i Resubstitution 25.00% 21.43% 31.25% 26.09% prior=prop. Crossvalidation 31.25% 23.81% 40.63% 32.25% 7.5 Variable Selection Procedures In order to find the most parsimonious model that best estimates the dependent variable in regression analysis, variable selection procedures are used to narrow down the number of independent variables. Similar variable selection procedures caan be used for discriminant analysis. This helps to eliminate variables that do not help to discriminant between the different populations. ANCOVA REVIEW (STAT 801) One-way ANOVA model: Yij = + i + ij 2005 Christopher R. Bilder 7.61 where ij~ind. N(0,2) i is the effect of treatment i is the grand mean Yij is the response of the jth object to treatment i Example: Wheat kernels Let Yij be the hardness of the jth kernel from the ith classification. Y11 = hardness of kernel 1 from healthy class 1 = healthy effect, 2 = sprout effect, 3 = scab effect Note that if 1 = 2 = 3, there are no mean differences among the kernel types. In this case, would hardness be a good discriminator between the kernel types? One-way ANCOVA model: Yij = + i + ixij + ij i = slope coefficient xij = covariate Example: Wheat kernels xij = variable that has an effect on hardness Note that if 1 = 2 = 3, there are no mean differences among the kernel types when xij is accounted for. In 2005 Christopher R. Bilder 7.62 this case, would hardness be a good discriminator between the kernel types? Forward selection 1. Find the variable that is the best discriminator among all the variables. This variable produces the largest F statistic value in a one-way ANOVA model. Example: Wheat kernels (sorry about the notation) skdenij = + i + ij skhardij = + i + ij sksizeij = + i + ij skwtij = + i + ij skmstij = + i + ij hrwij = + i + ij For each model, perform an F test for Ho:1=2=3 vs. Ha: not all equal. In other words, test for equality of the mean density for the 3 kernel types; test for equality of the mean hardness for the 3 kernel types;… The largest F statistic value somewhat equates to the “biggest” difference and thus the variable that will discriminate between the 3 kernel types the best. 2. Find the next best discriminator among all the remaining variables. This variable produces the largest F statistic 2005 Christopher R. Bilder 7.63 value in a one-way ANCOVA model with the first variable as a covariate. Example: Wheat kernels (sorry about the notation) Suppose hardness had the largest F statistic from the previous step. skdenij = + i + iskhardij + ij sksizeij = + i + iskhardij + ij skwtij = + i + iskhardij + ij skmstij = + i + iskhardij + ij hrwij = + i + iskhardij + ij For each model, perform an F test for Ho:1=2=3 vs. Ha: not all equal. In other words, test for equality of the mean density for the 3 kernel types adjusting for the hardness effect; test for equality of the mean size for the 3 kernel types adjusting for the hardness effect;… The largest F statistic value somewhat equates to the “biggest” difference and thus the variable that will discriminate between the 3 kernel types the best after adjusting for the hardness effect. 3. Find the next best discriminator among all the remaining variables. This variable produces the largest F statistic value in a one-way ANCOVA model with the first and second variable as a covariate. 4. Continue the process. 2005 Christopher R. Bilder 7.64 The forward selection procedure stops when the F-tests no longer produce any significant outcomes. When there are no significant variables in step 1, the variables will not be helpful in discriminating between the populations. Backward elimination 1. Find the variable that is the worst discriminator among all the variables. This variable produces the smallest F statistic value in a one-way ANCOVA model using the other p-1 variables as covariates. Remove this variable from the variable pool. Example: Wheat kernels (sorry about the notation) Example models: skdenij = + i + i1skhardij + i2sksizeij + i3skwtij + i4skmstij + i5hrwij + ij sksizeij = + i + i1skdenij + i2skhardij + i3skwtij + i4skmstij + i5hrwij + ij 2. Find the next worst discriminator among all the remaining variables. This variable produces the smallest F statistic value in a one-way ANCOVA model using the other p-2 variables as covariates. Remove this variable from the variable pool. 3. Continue the process. 2005 Christopher R. Bilder 7.65 The backward elimination procedure stops when all of the F-tests produce a significant outcome. Stepwise selection After each forward selection step, backward elimination is used to determine if any variables can be removed from the variable pool. Johnson’s recommendations Use backward elimination when the number of variables is at most 15. Use stepwise selection when the number of variables is more than 15. Use =0.01 for backward elimination. When using stepwise selection, use 0.250.50 for entry and =0.15 for elimination of a variable. Use variable selection procedures only as a guide to help choose the variables. Caveats These procedures do not measure how well the variables actually discriminate! These procedures are examining differences between means, not differences between entire populations. Means can be different, but populations can “overlap” – see example below. 2005 Christopher R. Bilder 7.66 Example: Discrimination between 4 populations using 2 variables In the plot below, both x1 and x2 would probably be found to be “significant” by the variable selection procedures. Which variable will do the better job of discriminating between the 4 populations? f(x)= contours of 4 different populations where is a very small number greater than 0 X2 + Difference in means for X2, but an + overlap of the populations Mean = + + + X1 Difference in means for X1 and separation of the populations 2005 Christopher R. Bilder 7.67 In the plot below, both x1 and x2 would probably be found to be “significant” by the variable selection procedures. Which variable do you think would be found to be the “most significant” by the variable sections procedures? f(x)= contours of 4 different populations where is a very small number greater than 0 X2 + Difference in means for X2, but an + overlap of the populations Mean = + + + X1 Difference in means for X1 and separation of the populations See Figures 7.5 and 7.6 of Johnson (1998). 2005 Christopher R. Bilder 7.68 Example: Wheat kernels (wheat.sas) Reduce the number of variables used for the discriminant analysis without creating a larger classification error. PROC STEPDISC can be used to perform the variable selection procedures. Below is part of the SAS code and output used to perform backward elimination. See Johnson (1998, p.249) for a stepwise selection example. title2 'Backward elimination using alpha=0.01'; proc stepdisc data=set1 method=backward sls=0.01; class type1; var skden skhard sksize skwt skmst hrw; run; The STEPDISC Procedure The Method for Selecting Variables is BACKWARD Observations 276 Variable(s) in the Analysis 6 Class Levels 3 Variable(s) will be Included 0 Significance Level to Stay 0.01 Class Level Information Variable type1 Name Frequency Weight Proportion Healthy Healthy 96 96.0000 0.347826 Scab Scab 84 84.0000 0.304348 Sprout Sprout 96 96.0000 0.347826 The STEPDISC Procedure Backward Elimination: Step 0 All variables have been entered. Multivariate Statistics Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.491901 19.02 12 536 <.0001 Pillai's Trace 0.543834 16.74 12 538 <.0001 Average Squared 0.271917 Canonical Correlation 2005 Christopher R. Bilder 7.69 Backward Elimination: Step 1 Statistics for Removal, DF = 2, 268 Partial Variable R-Square F Value Pr > F skden 0.2115 35.95 <.0001 skhard 0.0158 2.15 0.1186 sksize 0.0055 0.74 0.4790 skwt 0.1123 16.96 <.0001 skmst 0.0037 0.49 0.6105 hrw 0.0015 0.20 0.8164 Variable hrw will be removed. Variable(s) that have been Removed hrw Multivariate Statistics Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.492646 22.85 10 538 <.0001 Pillai's Trace 0.542885 20.12 10 540 <.0001 Average Squared 0.271442 Canonical Correlation Backward Elimination: Step 2 Statistics for Removal, DF = 2, 269 Partial Variable R-Square F Value Pr > F skden 0.2142 36.67 <.0001 skhard 0.0162 2.21 0.1114 sksize 0.0055 0.75 0.4741 skwt 0.1118 16.94 <.0001 skmst 0.0042 0.57 0.5640 Variable skmst will be removed. Variable(s) that have been Removed skmst hrw Multivariate Statistics Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.494749 28.46 8 540 <.0001 Pillai's Trace 0.539764 25.04 8 542 <.0001 Average Squared 0.269882 Canonical Correlation Backward Elimination: Step 3 Statistics for Removal, DF = 2, 270 Partial Variable R-Square F Value Pr > F skden 0.2143 36.82 <.0001 skhard 0.0151 2.06 0.1290 2005 Christopher R. Bilder 7.70 sksize 0.0068 0.92 0.3983 skwt 0.1097 16.63 <.0001 Variable sksize will be removed. Variable(s) that have been Removed sksize skmst hrw Multivariate Statistics Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.498133 37.66 6 542 <.0001 Pillai's Trace 0.533732 33.00 6 544 <.0001 Average Squared 0.266866 Canonical Correlation Backward Elimination: Step 4 Statistics for Removal, DF = 2, 271 Partial Variable R-Square F Value Pr > F skden 0.2150 37.10 <.0001 skhard 0.0112 1.54 0.2160 skwt 0.2556 46.52 <.0001 Variable skhard will be removed. Variable(s) that have been Removed skhard sksize skmst hrw Multivariate Statistics Statistic Value F Value Num DF Den DF Pr > F Wilks' Lambda 0.503800 55.61 4 544 <.0001 Pillai's Trace 0.523153 48.35 4 546 <.0001 Average Squared 0.261577 Canonical Correlation Backward Elimination: Step 5 Statistics for Removal, DF = 2, 272 Partial Variable R-Square F Value Pr > F skden 0.2289 40.37 <.0001 skwt 0.2926 56.25 <.0001 No variables can be removed. No further steps are possible. Backward Elimination Summary Number Partial Wilks' Pr < Step In Removed R-Square F Value Pr > F Lambda Lambda 0 6 . . . 0.49190126 <.0001 1 5 hrw 0.0015 0.20 0.8164 0.49264634 <.0001 2 4 skmst 0.0042 0.57 0.5640 0.49474856 <.0001 3 3 sksize 0.0068 0.92 0.3983 0.49813344 <.0001 4 2 skhard 0.0112 1.54 0.2160 0.50379958 <.0001 2005 Christopher R. Bilder 7.71 Average Squared Number Canonical Pr > Step In Removed Correlation ASCC 0 6 0.27191712 <.0001 1 5 hrw 0.27144225 <.0001 2 4 skmst 0.26988206 <.0001 3 3 sksize 0.26686581 <.0001 4 2 skhard 0.26157659 <.0001 Notes: All variables are used at first. The “worst” discriminator is hrw since it has the lowest F statistic value. The partial R-square measures the proportion reduction in the sum of squared error by adding the corresponding variable to the model given the other variables of interest are already in the model. See a regression book such as Neter, Kutner, Nachtsheim, and Wasserman (Chapter 7, 1996) A low partial R2 means that the variable is not helpful, and a large partial R2 means that the variable is helpful. The Wilk’s lambda and Pillai’s trace statistics are used in multivariate analysis of variance (MANOVA). In there first appearance in the output, they are testing Ho:Healthy=Sprout=Scab Ha:Not all equal 2005 Christopher R. Bilder 7.72 where i,skden i,skhard i,sksize i for i=Healthy, Sprout, and Scab i,skwt i,skmst i,hrw These statistics will be discussed in Chapter 11. The average squared canonical correlation will be discussed in Chapter 12. This can be interpreted for now as a measure of how good the remaining variables do in discriminating between the wheat classes. The closer to 1, the better the variables will discriminate. Only skden and skwt remain at the end of the backward elimination procedure. Next, PROC DISCRIM should be used again to determine how well the remaining variables discriminate between the three wheat classes. Hopefully, the classification error rate has not substantially increased with the removal of the four variables. 2005 Christopher R. Bilder 7.73 Previous error rates (use all 6 variables): Classification Error Rates Actual Healthy Scab Sprout Overall Error 1=2=3 Crossvalidation 30.21% 26.19% 45.83% 34.42% priors=prop. Different i Crossvalidation 31.25% 22.62% 41.67% 31.85% priors=equal 1=2= Crossvalidation 31.25% 22.62% 46.88% 33.58% priors=equal Different i Crossvalidation 31.25% 23.81% 40.63% 32.25% prior=prop. New error rates (use only density and weight): Classification Error Rates Actual Healthy Scab Sprout Overall Error 1=2=3 Crossvalidation 29.17% 23.81% 45.83% 33.33% priors=prop. Different i Crossvalidation 37.50% 25.00% 37.50% 33.33% priors=equal 1=2= Crossvalidation 29.17% 23.81% 45.83% 32.94% priors=equal Different i Crossvalidation 37.50% 26.19% 37.50% 34.06% prior=prop. It does not look like much (if anything) is lost by using only 2 of the variables instead of all 6! 2005 Christopher R. Bilder 7.74 Notes: The test for equal covariances again rejected equality. Since there are only two variables used for the discrimination, we can view the discrimination process in two dimensions. Below are three plots summarizing the discriminant analysis. The plots are done in R with the data with classifications first exported out of SAS: **********************************************************; * Export information out into ASCII text file to be used * with R; data smaller_set; set cross_set1; keep skden skwt type1 _into_; run; proc export data=smaller_set outfile = "c:\chris\unl\stat873\chapter 7\class_wheat.txt" DBMS=DLM Replace; run; Code is in the classify_plots.R file. #I first exported an ASCII text file from SAS which # contained the density, weight, class, and _INTO_ wheat<-read.table(file = "C:\\chris\\UNL\\STAT873\\Chapter 7\\class_wheat.txt", header=TRUE) par(pty="s") #Original obs. plot(x = wheat$skwt, y = wheat$skden, xlab="Weight", ylab="Density", type="n", main = "Density vs. Weight \n Original observations", panel.first=grid(col="gray", lty="dotted")) points(x = wheat$skwt[wheat$type1=="Healthy"], y = wheat$skden[wheat$type1=="Healthy"], pch = 1, col = 1, cex = 1) 2005 Christopher R. Bilder 7.75 points(x = wheat$skwt[wheat$type1=="Sprout"], y = wheat$skden[wheat$type1=="Sprout"], pch = 2, col = 2, cex = 1) points(x = wheat$skwt[wheat$type1=="Scab"], y = wheat$skden[wheat$type1=="Scab"], pch = 5, col = 3, cex = 1) legend(locator(1), legend=c("Healthy", "Sprout", "Scab"), bg = "white", pch = c(1,2,5), col=c(1,2,3)) #Classified obs. plot(x = wheat$skwt, y = wheat$skden, xlab="Weight", ylab="Density", type="n", main = "Density vs. Weight \n Classified observations", panel.first=grid(col="gray", lty="dotted")) points(x = wheat$skwt[wheat$X_INTO_=="Healthy"], y = wheat$skden[wheat$X_INTO_=="Healthy"], pch = 1, col = 1, cex = 1) points(x = wheat$skwt[wheat$X_INTO_=="Sprout"], y = wheat$skden[wheat$X_INTO_=="Sprout"], pch = 2, col = 2, cex = 1) points(x = wheat$skwt[wheat$X_INTO_=="Scab"], y = wheat$skden[wheat$X_INTO_=="Scab"], pch = 5, col = 3, cex = 1) legend(locator(1), legend=c("Healthy", "Sprout", "Scab"), bg = "white", pch = c(1,2,5), col=c(1,2,3)) #Overlaid plot(x = wheat$skwt, y = wheat$skden, xlab="Weight", ylab="Density", type="n", main = "Density vs. Weight \n Classified (large points) overlaid on the original (small points) observations", panel.first=grid(col="gray", lty="dotted")) points(x = wheat$skwt[wheat$type1=="Healthy"], y = wheat$skden[wheat$type1=="Healthy"], pch = 1, col = 1, cex = 0.75) points(x = wheat$skwt[wheat$type1=="Sprout"], y = wheat$skden[wheat$type1=="Sprout"], pch = 2, col = 2, cex = 0.75) points(x = wheat$skwt[wheat$type1=="Scab"], y = wheat$skden[wheat$type1=="Scab"], pch = 5, col = 3, cex = 0.75) 2005 Christopher R. Bilder 7.76 points(x = wheat$skwt[wheat$X_INTO_=="Healthy"], y = wheat$skden[wheat$X_INTO_=="Healthy"], pch = 1, col = 1, cex = 1.5) points(x = wheat$skwt[wheat$X_INTO_=="Sprout"], y = wheat$skden[wheat$X_INTO_=="Sprout"], pch = 2, col = 2, cex = 1.5) points(x = wheat$skwt[wheat$X_INTO_=="Scab"], y = wheat$skden[wheat$X_INTO_=="Scab"], pch = 5, col = 3, cex = 1.5) legend(locator(1), legend=c("Healthy", "Sprout", "Scab"),bg = "white", pch = c(1,2,5), col=c(1,2,3)) 2005 Christopher R. Bilder 7.77 Density vs. Weight Original observations 2.0 Healthy Sprout 1.8 Scab 1.6 Density 1.4 1.2 1.0 0.8 10 20 30 40 Weight Density vs. Weight Classified observations 2.0 Healthy Sprout 1.8 Scab 1.6 Density 1.4 1.2 1.0 0.8 10 20 30 40 Weight 2005 Christopher R. Bilder 7.78 Density vs. Weight Classified (large points) overlaid on the original (small points) observations 2.0 Healthy Sprout 1.8 Scab 1.6 Density 1.4 1.2 1.0 0.8 10 20 30 40 Weight Question: Suppose there were more than 3 variables left after the variable selection procedure, how could you show the results of the discriminant analysis? 2005 Christopher R. Bilder 7.79 7.6 Canonical Discriminant Functions New variables (called canonical variables) are created from an original set in order to reduce the dimensionality of the problem. These new variables are used in the discriminant analysis. Creating the new variables has similar goals as with PCA and FA, but the computations are different. Hopefully, the canonical variables can be interpreted. The objectives are similar to what is shown on the graph on the previous page. Reduce the dimension of the problem so that one can graphically view the discriminant analysis. This section will not be discussed in class. If you would like to know more about it, lecture notes for it are available on the scheduled web page of the course website. 2005 Christopher R. Bilder 7.80 7.7 Nearest Neighbor Discriminant Analysis The previous methods rely on a multivariate normal distribution assumption. The methods discussed next do not. Consider observation r. Find the closest observation, call it observation s. Classify observation r as coming from the population corresponding to observation s. Example: Use Euclidean distance (Mahalanobis distance with the covariance matrix equal to the identity matrix) to classify the RED DOT observation. Suppose a bivariate random sample is taken from 3 populations. The observations are plotted below corresponding to their observation number. The goal is to classify the RED DOT observation. Nearest Neighbors using Euclidean distance x2 3 2 3 1 11 3 3 2 2 1 3 3 Closest observation 3 2 2 1 x1 2 3rd Closest observation 2nd Closest observation 2005 Christopher R. Bilder 7.81 Since the nearest observation corresponds to population #1, the observation is classified into population #1. In general, nearest neighbor discriminant analysis uses the Mahalanobis distance (using a pooled covariance matrix) to measure the distance between observations. In addition, the “k” nearest neighbors are found. The largest proportion of nearest neighbors correspond to the classification for the observation of interest. If there is a tie, the “k+1” nearest neighbors are used. If there is still a tie, the “k+2” nearest neighbors … Example: Use Euclidean distance to classify the RED DOT observation. If k=3 is used, the RED DOT is classified into population #2. Note that none of this relies on any distributional assumptions!!! Therefore, this is a “nonparametric” procedure. Example: wheat.sas and wheat.dat 2005 Christopher R. Bilder 7.82 Equal priors are used for now to help demonstrate the method. title2 'k nearest neighbor'; proc discrim data=set1 method=npar k=5 list crosslist crossvalidate; class type1; var skden skhard sksize skwt skmst hrw; priors equal; run; title2 'k nearest neighbor'; proc discrim data=set1 method=npar k=5 crossvalidate outcross=cross_set1; class type1; var skden skwt; priors equal; run; The DISCRIM Procedure Observations 276 DF Total 275 Variables 6 DF Within Classes 273 Classes 3 DF Between Classes 2 Class Level Information Variable Prior type1 Name Frequency Weight Proportion Probability Healthy Healthy 96 96.0000 0.347826 0.333333 Scab Scab 84 84.0000 0.304348 0.333333 Sprout Sprout 96 96.0000 0.347826 0.333333 Classification Results for Calibration Data: WORK.SET1 Resubstitution Results using 5 Nearest Neighbors Squared Distance Function 2 -1 D (X,Y) = (X-Y)' COV (X-Y) Posterior Probability of Membership in Each type1 m (X) = Proportion of obs in group k in 5 k nearest neighbors of X Pr(j|X) = m (X) PRIOR / SUM ( m (X) PRIOR ) j j k k k Posterior Probability of Membership in type1 2005 Christopher R. Bilder 7.83 From Classified Obs type1 into type1 Healthy Scab Sprout 1 Healthy Scab * 0.3784 0.4324 0.1892 2 Healthy Healthy 0.6000 0.0000 0.4000 3 Healthy Sprout * 0.2000 0.0000 0.8000 4 Healthy Healthy 1.0000 0.0000 0.0000 5 Healthy Scab * 0.1892 0.4324 0.3784 6 Healthy Healthy 0.8000 0.0000 0.2000 * Misclassified observation T Tie for largest probability Classification Summary for Calibration Data: WORK.SET1 Resubstitution Summary using 5 Nearest Neighbors Squared Distance Function 2 -1 D (X,Y) = (X-Y)' COV (X-Y) Posterior Probability of Membership in Each type1 m (X) = Proportion of obs in group k in 5 k nearest neighbors of X Pr(j|X) = m (X) PRIOR / SUM ( m (X) PRIOR ) j j k k k Number of Observations and Percent Classified into type1 From type1 Healthy Scab Sprout Other Total Healthy 68 3 22 3 96 0.83 3.13 22.92 3.13 100.00 Scab 9 66 8 1 84 10.71 78.57 9.52 1.19 100.00 Sprout 27 6 57 6 96 28.13 6.25 59.38 6.25 100.00 Total 104 75 87 10 276 37.68 27.17 31.52 3.62 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for type1 Healthy Scab Sprout Total Rate 0.2917 0.2143 0.4063 0.3041 Priors 0.3333 0.3333 0.3333 Classification Results for Calibration Data: WORK.SET1 Cross-validation Results using 5 Nearest Neighbors Squared Distance Function 2 -1 D (X,Y) = (X-Y)' COV (X-Y) Posterior Probability of Membership in Each type1 2005 Christopher R. Bilder 7.84 m (X) = Proportion of obs in group k in 5 k nearest neighbors of X Pr(j|X) = m (X) PRIOR / SUM ( m (X) PRIOR ) j j k k k Posterior Probability of Membership in type1 From Classified Obs type1 into type1 Healthy Scab Sprout 1 Healthy Scab * 0.3808 0.4307 0.1884 2 Healthy Healthy 0.6025 0.0000 0.3975 3 Healthy Sprout * 0.0000 0.0000 1.0000 276 Scab Sprout * 0.0000 0.4354 0.5646 * Misclassified observation T Tie for largest probability Classification Summary for Calibration Data: WORK.SET1 Cross-validation Summary using 5 Nearest Neighbors Squared Distance Function 2 -1 D (X,Y) = (X-Y)' COV (X-Y) Posterior Probability of Membership in Each type1 m (X) = Proportion of obs in group k in 5 k nearest neighbors of X Pr(j|X) = m (X) PRIOR / SUM ( m (X) PRIOR ) j j k k k Number of Observations and Percent Classified into type1 From type1 Healthy Scab Sprout Other Total Healthy 56 7 33 0 96 58.33 7.29 34.38 0.00 100.00 Scab 9 57 14 4 84 10.71 67.86 16.67 4.76 100.00 Sprout 36 11 49 0 96 37.50 11.46 51.04 0.00 100.00 Total 101 75 96 4 276 36.59 27.17 34.78 1.45 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for type1 Healthy Scab Sprout Total Rate 0.4167 0.3214 0.4896 0.4092 Priors 0.3333 0.3333 0.3333 2005 Christopher R. Bilder 7.85 2nd PROC DISCRIM OUTPUT: The DISCRIM Procedure Observations 276 DF Total 275 Variables 2 DF Within Classes 273 Classes 3 DF Between Classes 2 Class Level Information Variable Prior type1 Name Frequency Weight Proportion Probability Healthy Healthy 96 96.0000 0.347826 0.333333 Scab Scab 84 84.0000 0.304348 0.333333 Sprout Sprout 96 96.0000 0.347826 0.333333 Classification Summary for Calibration Data: WORK.SET1 Resubstitution Summary using 5 Nearest Neighbors Squared Distance Function 2 -1 D (X,Y) = (X-Y)' COV (X-Y) Posterior Probability of Membership in Each type1 m (X) = Proportion of obs in group k in 5 k nearest neighbors of X Pr(j|X) = m (X) PRIOR / SUM ( m (X) PRIOR ) j j k k k Number of Observations and Percent Classified into type1 From type1 Healthy Scab Sprout Other Total Healthy 69 6 16 5 96 71.88 6.25 16.67 5.21 100.00 Scab 9 63 11 1 84 10.71 75.00 13.10 1.19 100.00 Sprout 13 9 71 3 96 13.54 9.38 73.96 3.13 100.00 Total 91 78 98 9 276 32.97 28.26 35.51 3.26 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for type1 Healthy Scab Sprout Total Rate 0.2813 0.2500 0.2604 0.2639 Priors 0.3333 0.3333 0.3333 The DISCRIM Procedure Classification Summary for Calibration Data: WORK.SET1 2005 Christopher R. Bilder 7.86 Cross-validation Summary using 5 Nearest Neighbors Squared Distance Function 2 -1 D (X,Y) = (X-Y)' COV (X-Y) Posterior Probability of Membership in Each type1 m (X) = Proportion of obs in group k in 5 k nearest neighbors of X Pr(j|X) = m (X) PRIOR / SUM ( m (X) PRIOR ) j j k k k Number of Observations and Percent Classified into type1 From type1 Healthy Scab Sprout Other Total Healthy 62 10 24 0 96 64.58 10.42 25.00 0.00 100.00 Scab 9 58 16 1 84 10.71 69.05 19.05 1.19 100.00 Sprout 22 15 59 0 96 22.92 15.63 61.46 0.00 100.00 Total 93 83 99 1 276 33.70 30.07 35.87 0.36 100.00 Priors 0.33333 0.33333 0.33333 Error Count Estimates for type1 Healthy Scab Sprout Total Rate 0.3542 0.3095 0.3854 0.3497 Priors 0.3333 0.3333 0.3333 Notes: The METHOD=NPAR and K=5 options specifies that nearest neighbor discriminant analysis should be used with k=5 nearest neighbors. Why was 5 used? Just for illustrative purposes. In practice, try a few different values and see which one does the best job with classification. The LIST and CROSSLIST options allow for the printing of the posterior probabilities. These are calculated as 2005 Christopher R. Bilder 7.87 prop. from group j 3 prop. from group i i=1 For example, the posterior probabilities (resubstitution) for observation #1 using PRIORS EQUAL is: prop. from Healthy 2 / 96 0.3784 3 prop. from group i 2 / 96 2 / 84 1/ 96 i=1 prop. from Scab 2 / 84 0.4324 3 prop. from group i 2 / 96 2 / 84 1/ 96 i=1 prop. from Sprout 1/ 96 0.1892 3 prop. from group i 2 / 96 2 / 84 1/ 96 i=1 If the PRIORS=PROPORTIONAL option is specified, the posterior probabilities are calculated as (prop. from group j) p j K (prop. from group i)pi i=1 Notice that using this option results in posterior probabilities equal to 0, 0.2, 0.4, 0.6, 0.8, and 1 only for this data set. For example, the posterior probability 2005 Christopher R. Bilder 7.88 (resubstitution) that observation #1 is classified as healthy is: (prop. from Healthy)pHealthy 3 (prop. from group i)(pi ) i=1 2 / 96 96 / 276 2 2 / 96 96 / 276 2 / 84 84 / 276 1/ 96 96 / 276 5 This corresponding output is not presented here; however, this helps to illustrate why proportional priors may be more desirable! Resubstitution – An observation is classified using the k closest observations including the observation under consideration itself. Question: Suppose k=1 nearest neighbor discriminant analysis is used to classify the observations in the “training” data set (i.e., the original data set that is used to develop the discriminant rule). What is the classification error rate? Crossvalidation – An observation is classified using the k closest observations excluding the observation under consideration. The error rates for this example appear to be worse than the rates from the other discriminant analysis methods. 2005 Christopher R. Bilder 7.89 Notice the classification error rates are lower for using just two of the variables instead of all six! Observations classified as “OTHER” are observations that had a tie in the posterior probabilities. I do not believe that SAS goes to the next closest observation to break the tie. Below are plots of the original data and the classifications. Compare the classification plot with those obtained using the other discriminant analysis methods. Notice the nearest neighbor method do not have absolute distinctions between the classified observations – they are intermixed. Also, there is not a simple way to overlay these two plots in a SAS graph sheet (the OVERLAY option in the PLOT statement will not work). One could possibly overlay the plots using PROC GREPLAY. 2005 Christopher R. Bilder 7.90 2005 Christopher R. Bilder 7.91 Other notes: In order to use k nearest neighbor discriminant analysis on a hold out, validation, or new data set, the original or training data set needs to be used. With the previous discriminant analysis methods, only mean vectors and covariance matrices were needed (possibly also priors,…). See the Loess method for fitting nonparametric regression models in regression analysis. This method uses some of the same “nearest neighbor” ideas to fit the model. Section 10.4 of Neter, Kutner, Nachtsheim, and Wasserman (1996) has a discussion on it. Johnson (1998) does not give any guidance to choosing k. An ad hoc way is to try a few different k values and pick the one that gives the lowest classification error rate. When should k nearest neighbor discriminant analysis be used? When there is reason to believe the data does not come from multivariate normal distributions. When does data come from a multivariate normal distribution? Usually only when it is simulated! What should you do? Try all types of discriminant analysis and use the one that gives the lowest classification error rates where the 2005 Christopher R. Bilder 7.92 error rate is measured by an unbiased or close to unbiased classification method. To visually see the classification error rates, trellis dot plots can be used. For example, below is a simple plot showing the crossvalidation error rates for a number of different nearest neighbor discriminant analysis procedures. The code is in err_plots.R and the error rates are in err_rates_NN.csv. #Typed in all error rates into a Excel file and then saved it as a comma delimited file nn.err.rates<-read.table(file = "C:\\chris\\UNL\\STAT873\\Chapter 7\\err_rates_NN.csv", sep = ",", header=TRUE) #Load package library(lattice) lset(col.whitebg()) #Set color theme for plots dotplot(formula = Method ~ crossval.err | var.used, data = nn.err.rates, xlab = "Crossvalidation error rate", ylab = "Method", groups = priors, layout = c(1, 2, 1), auto.key = TRUE) 2005 Christopher R. Bilder 7.93 Priors = equal Priors = proportional Variables = skden and skwt NN - K=7 NN - K=5 NN - K=3 NN - K=1 Method Variables = all NN - K=7 NN - K=5 NN - K=3 NN - K=1 0.35 0.40 0.45 Crossvalidation error rate Other types of nearest neighbor methods: Instead of using the k nearest neighbors, a region around the observation can be examined. This region can be specified to be a particular radius around the observation of interest. See the R= and KERNEL= options for PROC DISCRIM. 2005 Christopher R. Bilder