VIEWS: 5 PAGES: 20 POSTED ON: 12/11/2011 Public Domain
Face Recognition using Multivariate Statistical Techniques Introduction: Face recognition by digital computers has obtained a lot of attention in recent years. There are many practical applications to such a technique. Face recognition can be used in criminal identification, in security systems, and to increase the interaction between humans and computers. Face images are complex. Added to this complexity face images come with a variety of light effects, backgrounds, etc. Including all these variations in a model is a difficult task. Hence face recognition is a challenging task. The methods followed here are simple and are based on finding a set of faces (very few) that explain most of the variance in the data set and using different classification rules based on these few faces. A comparison of the methods adopted for classification is also done. The faces analyzed are obtained from the ORL database of faces at http://www.uk.research.att.com/facedatabase.html . This website provides the following information about the database: “The database contains a set of face images taken between April 1992 and April 1994 at the AT&T laboratories. The database was originally used in the context of a face recognition project carried out in collaboration with the Speech, Vision and Robotics Group of the Cambridge University Engineering Department. There are ten different images of each of 40 distinct subjects. For some subjects, the images were taken at different times, varying the lighting, facial expressions (open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position (with tolerance for some side movement)”. This makes a total of 400 face images, with 10 images corresponding to each person. The resolution of each image is 112x92 pixels. All the images were in .pgm image format. All the images are grayscale images; the data in each image consists of pixel intensity values ranging from 0 to 255, with 0 corresponding to black and 255 to white. Some images from the data set are shown below: Initial steps: The data are first unpacked from the .pgm image format as follows: for any image, all the pixel intensity values are read continuously starting from pixel position (1,1) by going column wise until a new row, then the pixel values of this row are read and finally all the values are made into a row vector. All the images are unpacked in this fashion which gives a data matrix X of 400x10304 dimensions. (i.e. 400 images and each image having 112x92=10304 pixels). Hence, each image can be considered to be a vector in the 10304 dimensional space. For statistical analysis, the pixels (columns of the data matrix) can be considered as random variables and different images (rows of the data matrix) can be considered as samples of these variables. Most of the variables are expected to be very highly correlated. This is because most of the pixels which form the background will have the same intensity in an image. So, they would be highly correlated. Similarly, a lot of pixels that make up the hair would also be correlated. So, there is a lot of redundancy in the data. Basic Idea: The objective is to correctly identify the individual given an image of his/her face. To this end, one of the classification procedures mentioned below is desired to be followed. 1) Linear discriminant analysis (LDA) assuming multivariate normality of features given groups and common covariance. 2) Fisher discriminants procedure which does not assume multivariate normality of features given groups but assumes common covariance. A brief description of each of the methods mentioned above follows: Method 1: The method, based on the assumptions, obtains the posterior probabilities in terms of the spooled covariance matrix of the groups and the individual group means. The classification rule is then based on these posterior probabilities. The group which minimizes the expected loss of misclassification is chosen by the classifier. The method is very much dependent on the assumptions mentioned. One can say that this method is very sensitive to deviations from the assumptions. (Equal losses are assumed). Method 2: This method is based on Fisher’s discriminants. The discriminants obtain a low dimensional representation of the data, just as in PCA. The difference here is that, the low dimensional space is chosen so that the groups are separated from each other as much as possible (based on certain criteria). The classification rule then is based on the distance between a new observation and the individual group means in this dicriminant space. The classifier chooses the group whose mean is nearest to the new observation. The maximum number of discriminants obtained is given by s=min(g-1,p) where g is the number of groups and p is the number of variables in each group. In the two methods the prior probabilities are all assumed to be equal, this is reasonable because the probability of any subject being subject to identification (recognition) is equal. It turns out that the classification procedure based on Fisher’s dicriminants is the same as linear discriminant analysis assuming multivariate normality of features given groups, if the prior probabilities are all the same and if all the discriminants are used for classification (Johnson & Wichern pg.694 [2]). Hence, method 2 and method 1 give equivalent classification rules under such conditions. Even though, theoretically method 1 can be performed on the raw pixel data, which is 400x10304 in dimensions computationally it is not feasible to estimate the covariance matrix which would be of 10304x10304 dimensions. This would require huge amounts of memory. A way out, would be to consider using PCA to reduce the dimensionality before applying LDA. Also, as the principal components are a linear combination of 10304 variables they would be closer to normality. This follows from the central limit theorem which states that a linear combination of many variables which are independent and having any distribution, is normal distributed. Even though the variables are not independent here, the principal components would still be close to normality as would be shown later. Hence using PCA we can reduce dimensionality and increase the approximation to normality at one shot. The two methods are discussed in detail below. Method 1: We want to classify the individuals based on their face images by Linear Discriminant Analysis (LDA). Each 10304 pixel intensity vector can be thought of as a vector of features and each individual can be thought of as a class into which the features are classified. Even with the assumptions already made, LDA cannot be performed on the raw pixel intensity values (can be thought of features) as the covariance matrix is of large dimensions (10304x10304). Computing the covariance matrix and its inverse are both computationally not feasible. To overcome this problem Principal Component Analysis (PCA) is first performed on the pixel intensity data. But, to get the principal components of X which is 400x10304 we have to obtain the covariance matrix which is of 10304x10304 dimensions and then find its eigenvectors. So, the problem still remains. Nevertheless, we have to note that the rank of X is only 400, which means there will only be 400 eigenvectors which have non-zero eigenvalues. A simple way of finding these eigenvectors is proposed by (Turk and Pentland [1]). If v is an eigenvector of XXT then an eigenvector of XTX is XTv. Mathematically, if v is an eigenvector of XXT then XX T v v Multiplying both sides by XT we get ( X T X ) X T v ( X T v) Hence XTv is an eigenvector of XTX. Note that X, as used here, is mean centered. The mean of X (before mean centering) i.e. the mean of all faces is called the average face. The average face is shown below, Figure 8 The average face Also, note that the eigenvalues do not change i.e. the eigenvalues of XXT are the same as those of XXT. The eigenvectors calculated this way are normalized. The order of the covariance matrix from which eigenvectors have to be estimated is reduced from 10304x10304 to 400x400 (i.e. from square of pixels to square of number of images). The eigenvectors and eigenvalues are estimated using MATLAB. The larger an eigenvalue the more important is that eigenvector, in the sense that the variance of the data set in that direction is more compared to the variance in the direction of eigenvectors with lesser eigenvalues. These eigenvectors are called eigenfaces(Turk and Pentland[1]). The components of each eigenvector can be regarded as a weight of the corresponding pixel in forming the total image. The components of the eigenvectors calculated, generally do not lie between 0- 255, which is required to visualize an image. Hence, the eigenvectors are transformed so that the components lie between the 0-255 range, for visualization. These eigenfaces can be regarded as the faces that make up all the faces in the database i.e. a linear combination of these faces can be used to represent any face in the data set. Also, at least one face can be represented by a linear combination of all the eigenfaces. The first few eigenfaces (after transformation) are shown below: Figure 9 Eigenfaces The eigenfaces represent vectors in the 10304 dimensional space spanned by the variables (pixels). If we consider these new vectors to be our new axes to represent the data, these new axes are the principal components we are seeking for. They are just a rotation of the original axes to more meaningful directions. To illustrate how principal components are closer to normality, the normal probability plots of some of the first few principal components are given. To check the multivariate normality of the principal components we check for univariate normality of individual principal components (PC’s). The normal (unconditional) probability plots of some of the important PC’s are shown below, and for comparison the normal probability plots of the raw variables are shown as well Figure 1 Normal probability plot of the first pixel Figure 2 Normal probability plot of the second pixel Figure 3 Normal probability plot of the third pixel Figure 4 Normal probability plot of PC 5. Figure 5 Normal probability plot of PC 2 Figure 6 Normal Probability plot of PC 1 Figure 7 Normal probability plot of PC 9 The plots (Figures 1, 2, 3) show clearly that the unconditional distribution of the raw pixels is not normal. Actually we want to test whether the conditional distribution of the pixels given a class is multivariate normal or not, but instead we test for the unconditional distribution. This is because each class has only 10 data points (10 face images), and not much could be read from the conditional plot with such a small amount of data. Also, it is unreasonable to expect huge differences in conditional densities. As the unconditional distributions look normal, the conditional distributions might not be far removed from normality. Hence we can test for the unconditional distribution instead of the conditional one. The plots (Figures 5, 6, 7) clearly show that the principal components are closer to normality than the raw pixels. Also, all the principal components are uncorrelated (by definition) with each other. Hence all the new features (principal components) will be independent, because uncorrelated normal random variables are independent. Hence the features are multivariate normal i.e. f(x) has a multivariate normal density, where x is the vector of principal components. It has already been said that the distribution of various pixels across the classes does not vary much hence we can assume f(x|y) to be close to multivariate normal too. The first few eigenfaces clearly separate the face from hair and the background and as such explain most of the variance in the dataset. The distinction between the face, hair and the background becomes blurred as we go along. The following plot shows the % variance explained versus the number of principal components.The first few principal components are selected which explain the variance of the data set upto 75% (the basis for this is explained later). Figure 10 A scree plot for the first few principal components The %variance explained of any principal component is calculated by the ratio of the corresponding eigenvalue to the sum of all the eigenvalues. The number of principal components to be considered is based on how good a classification one obtains using them. Considering just five components based on the scree plot does not give good classification. Projections (Reconstruction of an Image) The projection of a vector v onto a vector u is given by: v, u (Proj)vu = u u, u If v is any image (after subtracting the average face) and u is any eigenface, then the above formula gives the projection of the face v on to the eigenface u. Any image can be projected onto each of the first few eigenfaces. All the projected vectors are added to get a final image (reconstructed). After this, the average face has to be added again to get the actual projected face onto the lower dimensional space spanned by the eigenfaces. As an example, a face and its projection into 5, 20, 50, 150, 400 dimensions (i.e. using 5,10,15,20,400 eigenfaces respectively) is shown below: Figure 11 Projected Faces Any image in the database can be completely reconstructed by using all the eigenfaces. The last image above is a complete reconstruction of the original image. Performing LDA: With the above basic frame work and with the additional assumption of equivalence of the covariance matrices for each group we can use LDA for classification. The assumption of equivalence of covariance matrices is quite strong. No statistical check has been performed to check this assumption, as most of these tests are not robust enough (e.g. Box’s test). Instead, two-fold cross validation has been performed to check the error rate on the final classification and the assumptions are validated based on these results. The standard procedure for LDA was followed. The covariance matrix has been estimated using a spooled estimator. The prior probabilities have been considered to be equal. With this spooled estimator and the means of the individual groups, the y ’s and y ’s for each group have been calculated by 1 y y ' y 1 2 y 1 y The classification rule (Bayes’s minimum cost rule) using equal losses then becomes, J x arg min I ( j y) P( y | x) arg min P( y | x) arg maxP( j | x) y 1 y j arg max j 'j x Where x is a features vector, J is the total number of classes (=40). The classification obtained depends on the number of principal components considered. The number of principal components to be used for classification is chosen so that the APER (apparent error rate) is almost 0%. Using 23 principal components, it was found that the apparent error rate (APER) is 0.02%.The APER is 0% when 60 or more principal components are considered. A closer estimate of the AER (actual error rate) can be calculated using two- fold cross validation. These were calculated based on the confusion matrices. The confusion matrices are not shown here as they are of 40x40 dimensions. Instead the final cross validated error rates are shown. The following table shows the number of principal components used and the AER: No of PC’s 23 60 100 Error rate % 11 10.25 11.5 Cross validated error rates Table 1 It can be seen that the error rate does not decrease much with increase in the number of principal components. In fact, the error rate when 100 PC’s are used is more than when only 23 PC’s are used. This can be attributed to the larger rounding errors when dealing with inverses of large dimension covariance matrices. (When using 100 PC’s the spooled covariance matrix becomes 100x100). As the error rate does not decrease much with increasing the number of PC’s, 23 PC’s can be considered to be optimal. These 23 PC’s explain the variance of the data set upto 75%. Also, as the APER as well as AER is quite small we can assume that our assumptions of multivariate normality of features given a group and the equivalence of covariance matrices are not unreasonable. Method 2: The Fisher discriminants are the eigenvectors of Σ-1B. Where, Σ is the common covariance matrix and B is the between groups matrix. As in method 1, a spooled estimator of the covariance matrix Σ is used. Again, this method is also performed after reducing the dimensionality using PCA. This method is based on simple metric criteria. Mathematically, the classification rule chooses class j if r ' is minimum. Where r is the total number of discriminants used, x is the 2 al x x j l 1 new observation and x j is the mean of group j i.e. the class, whose mean is closest to the new observation (in a squared distance sense), is chosen. It has been already mentioned that method 2 is the same as method 1 if all Fisher discriminants are used and if all the groups are assumed to have equal prior probabilities. To check if this is true all 23 discriminants are chosen and the observations are classified. The error rate obtained was 12.5% as shown in the last entry of table 1. This is approximately the same as the error rate obtained from LDA. Further, to check the efficiency of this method the number of Fisher discriminants as well as the number of PC’s considered was varied. The tables below show the cross validated error rates in each case: No.of 10 15 23 discriminants Error rate % 13.25 13.5 12.5 Cross validated error rates using 23 PC’s Table 2 No.of 15 30 60 discriminants Error rate % 11 12 11 Cross validated error rates using 60 PC’s Table 3 No.of 20 50 100 discriminants Error rate % 12.5 10 10 Cross validated error rates using 100 PC’s Table 4 The cross validated error rates show that, even though classification tends to improve as the number of discriminants and/or the number of PC’s increase, the improvement is negligible. Hence, using 10 discriminants and 23 PC’s is enough for good classification. It was also found that new observations from the groups 17,39,14,27 were misclassified more often than not. Observations from class 14 were mostly classified into class 28 and 22. Likewise observations from class 17 were classified mostly into class 29 and 23; similarly observations from class 39 were mostly classified into class 30. To see the proximity between these groups the data is plotted in the two dimensional Fisher discriminant space. The plot is shown below: The plot shows all the groups and their centroids. It can be seen that the centroid of group 14 is closest to centroids of groups 24 and 28. Similarly groups 30 and 39 are near to each other. Groups 27 and 23 are also close to group 17. The proximity between the groups is clear. Hence the rule performs poorly in these cases. Based on the error rates of the two methods, it is clear that both method 1 and method 2 give almost the same classification. Conclusions: 1) LDA and Fishers method for classification have been successfully applied in face recognition. 2) The cross validated error rates are small when these methods were used. 3) Both the methods give approximately the same classification. 4) The assumptions of multivariate normality given features and equal covariance matrices can be assumed to be correct, based on the evidence given by cross validated error rates. References 1. M. Turk, A. Pentland. “Eigenfaces for Recognition”. Journal of Cognitive Neuroscience. Vol 3, No. 1. 71-86, 1991. 2. Applied multivariate statistical analysis, Johnson and Wichern (1992). Method 3: Multiple logistic regression: There are no assumptions in this rule. This method is compared with the other two methods, as it is expected to give better results compared to the other two. The parameters in the posterior probability model are estimated by maximizing conditional likelihood treating feature vectors as fixed. Mathematically, The classification rule then becomes Using all the 40 classes for classification proved to be very expensive computationally. Hence for this method only the first ten classes were considered. For the purpose of estimating the parameters, RPLC software was used. Multiple logistic regression is the same as reference point logistic classification with one reference point per class. Using RPLC with nk=10, gives one reference point per class. The data has been divided into training and test sets. The principal component scores of training set as well as test set have been obtained based on the principal components of training set. The test and training sets have been interchanged to do two-cross validation. The following tables give the results given by RPLC software for different choice of principal components: loop icv nk lambda trn1 trn2 trn3 tst1 tst2 tst3 1 0 10 0.0000 0.0003 0.0003 0.0000 16.6038 0.0067 0.9000 loop icv nk lambda trn1 trn2 trn3 tst1 tst2 tst3 1 0 10 0.0000 0.0003 0.0003 0.0000 11.7149 0.1427 0.6600 Using 23 PC’s Table 1 loop icv nk lambda trn1 trn2 trn3 tst1 tst2 tst3 1 0 10 0.0002 0.0002 0.0000 0.0000 12.6038 0.0288 0.8500 loop icv nk lambda trn1 trn2 trn3 tst1 tst2 tst3 1 0 10 0.0000 0.0004 0.0004 0.0000 12.9544 0.1189 0.6200 Using 15 PC’s Table 2 loop icv nk lambda trn1 trn2 trn3 tst1 tst2 tst3 1 0 10 0.0000 0.0013 0.0013 0.0000 38.3964 0.0025 0.9000 loop icv nk lambda trn1 trn2 trn3 tst1 tst2 tst3 1 0 10 0.0584 0.0002 0.0492 0.0000 30.7654 0.0867 0.7000 Using 5 PC’s Table 3 From the tables it can be seen that the error rates on the training sets (column trn3) are zero even with 5 PC’s. But the error rates on test sets are very high. Even with 23 PC’s An error rate of 90% is shown by the software. Hence method 1 and method 2 perform well when compared to multiple logistic regression. Method 3: This method is the most robust of the three. The posterior probabilities are modeled directly without any distributional assumptions concerning the marginal density of the features. The form of the posterior probabilities is assumed to be the same as obtained in method 1, the conditional distribution is not restricted to being multivariate normal but to belong to a bigger family of exponential functions. The classification rule used here is the same as mentioned in method 1. The classifier chooses the group which minimizes the expected loss of misclassification. As the method doesn’t assume equal covariance matrices and multivariate normality, it is more robust from deviations against these assumptions. (Equal losses are assumed). 3) Multiple logistic regression (MLR) which doesn’t assume both of the above.