face by xiaopangnv


									              Face Recognition using Multivariate Statistical Techniques


Face recognition by digital computers has obtained a lot of attention in recent years.
There are many practical applications to such a technique. Face recognition can be used
in criminal identification, in security systems, and to increase the interaction between
humans and computers. Face images are complex. Added to this complexity face images
come with a variety of light effects, backgrounds, etc. Including all these variations in a
model is a difficult task. Hence face recognition is a challenging task. The methods
followed here are simple and are based on finding a set of faces (very few) that explain
most of the variance in the data set and using different classification rules based on these
few faces. A comparison of the methods adopted for classification is also done.

The faces analyzed are obtained from the ORL database of faces at
http://www.uk.research.att.com/facedatabase.html . This website provides the following
information about the database: “The database contains a set of face images taken
between April 1992 and April 1994 at the AT&T laboratories. The database was
originally used in the context of a face recognition project carried out in collaboration
with the Speech, Vision and Robotics Group of the Cambridge University Engineering
Department. There are ten different images of each of 40 distinct subjects. For some
subjects, the images were taken at different times, varying the lighting, facial expressions
(open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the
images were taken against a dark homogeneous background with the subjects in an
upright, frontal position (with tolerance for some side movement)”. This makes a total of
400 face images, with 10 images corresponding to each person. The resolution of each
image is 112x92 pixels. All the images were in .pgm image format. All the images are
grayscale images; the data in each image consists of pixel intensity values ranging from 0
to 255, with 0 corresponding to black and 255 to white. Some images from the data set
are shown below:
Initial steps:

The data are first unpacked from the .pgm image format as follows: for any image, all the
pixel intensity values are read continuously starting from pixel position (1,1) by going
column wise until a new row, then the pixel values of this row are read and finally all the
values are made into a row vector. All the images are unpacked in this fashion which
gives a data matrix X of 400x10304 dimensions. (i.e. 400 images and each image having
112x92=10304 pixels). Hence, each image can be considered to be a vector in the 10304
dimensional space.

For statistical analysis, the pixels (columns of the data matrix) can be considered as
random variables and different images (rows of the data matrix) can be considered as
samples of these variables. Most of the variables are expected to be very highly
correlated. This is because most of the pixels which form the background will have the
same intensity in an image. So, they would be highly correlated. Similarly, a lot of pixels
that make up the hair would also be correlated. So, there is a lot of redundancy in the

Basic Idea:

The objective is to correctly identify the individual given an image of his/her face. To this
end, one of the classification procedures mentioned below is desired to be followed.
   1) Linear discriminant analysis (LDA) assuming multivariate normality of features
       given groups and common covariance.
   2) Fisher discriminants procedure which does not assume multivariate normality of
       features given groups but assumes common covariance.

A brief description of each of the methods mentioned above follows:

Method 1: The method, based on the assumptions, obtains the posterior probabilities in
terms of the spooled covariance matrix of the groups and the individual group means.
The classification rule is then based on these posterior probabilities. The group which
minimizes the expected loss of misclassification is chosen by the classifier. The method
is very much dependent on the assumptions mentioned. One can say that this method is
very sensitive to deviations from the assumptions. (Equal losses are assumed).

Method 2: This method is based on Fisher’s discriminants. The discriminants obtain a
low dimensional representation of the data, just as in PCA. The difference here is that, the
low dimensional space is chosen so that the groups are separated from each other as
much as possible (based on certain criteria). The classification rule then is based on the
distance between a new observation and the individual group means in this dicriminant
space. The classifier chooses the group whose mean is nearest to the new observation.
The maximum number of discriminants obtained is given by s=min(g-1,p) where g is the
number of groups and p is the number of variables in each group.

In the two methods the prior probabilities are all assumed to be equal, this is reasonable
because the probability of any subject being subject to identification (recognition) is
equal. It turns out that the classification procedure based on Fisher’s dicriminants is the
same as linear discriminant analysis assuming multivariate normality of features given
groups, if the prior probabilities are all the same and if all the discriminants are used for
classification (Johnson & Wichern pg.694 [2]). Hence, method 2 and method 1 give
equivalent classification rules under such conditions.
Even though, theoretically method 1 can be performed on the raw pixel data, which is
400x10304 in dimensions computationally it is not feasible to estimate the covariance
matrix which would be of 10304x10304 dimensions. This would require huge amounts of
memory. A way out, would be to consider using PCA to reduce the dimensionality before
applying LDA. Also, as the principal components are a linear combination of 10304
variables they would be closer to normality. This follows from the central limit theorem
which states that a linear combination of many variables which are independent and
having any distribution, is normal distributed. Even though the variables are not
independent here, the principal components would still be close to normality as would be
shown later. Hence using PCA we can reduce dimensionality and increase the
approximation to normality at one shot.

The two methods are discussed in detail below.

Method 1:

We want to classify the individuals based on their face images by Linear Discriminant
Analysis (LDA). Each 10304 pixel intensity vector can be thought of as a vector of
features and each individual can be thought of as a class into which the features are
classified. Even with the assumptions already made, LDA cannot be performed on the
raw pixel intensity values (can be thought of features) as the covariance matrix is of large
dimensions (10304x10304). Computing the covariance matrix and its inverse are both
computationally not feasible.

To overcome this problem Principal Component Analysis (PCA) is first performed on the
pixel intensity data. But, to get the principal components of X which is 400x10304 we
have to obtain the covariance matrix which is of 10304x10304 dimensions and then find
its eigenvectors. So, the problem still remains. Nevertheless, we have to note that the rank
of X is only 400, which means there will only be 400 eigenvectors which have non-zero
A simple way of finding these eigenvectors is proposed by (Turk and Pentland [1]). If v
is an eigenvector of XXT then an eigenvector of XTX is XTv. Mathematically, if v is an
eigenvector of XXT then
XX T v  v
Multiplying both sides by XT we get
( X T X ) X T v   ( X T v)
Hence XTv is an eigenvector of XTX.
Note that X, as used here, is mean centered. The mean of X (before mean centering) i.e.
the mean of all faces is called the average face. The average face is shown below,

                                          Figure 8
                                      The average face
Also, note that the eigenvalues do not change i.e. the eigenvalues of XXT are the same as
those of XXT. The eigenvectors calculated this way are normalized. The order of the
covariance matrix from which eigenvectors have to be estimated is reduced from
10304x10304 to 400x400 (i.e. from square of pixels to square of number of images). The
eigenvectors and eigenvalues are estimated using MATLAB. The larger an eigenvalue
the more important is that eigenvector, in the sense that the variance of the data set in that
direction is more compared to the variance in the direction of eigenvectors with lesser

These eigenvectors are called eigenfaces(Turk and Pentland[1]). The components of each
eigenvector can be regarded as a weight of the corresponding pixel in forming the total
image. The components of the eigenvectors calculated, generally do not lie between 0-
255, which is required to visualize an image. Hence, the eigenvectors are transformed so
that the components lie between the 0-255 range, for visualization. These eigenfaces can
be regarded as the faces that make up all the faces in the database i.e. a linear
combination of these faces can be used to represent any face in the data set. Also, at least
one face can be represented by a linear combination of all the eigenfaces. The first few
eigenfaces (after transformation) are shown below:

                                         Figure 9

The eigenfaces represent vectors in the 10304 dimensional space spanned by the
variables (pixels). If we consider these new vectors to be our new axes to represent the
data, these new axes are the principal components we are seeking for. They are just a
rotation of the original axes to more meaningful directions. To illustrate how principal
components are closer to normality, the normal probability plots of some of the first few
principal components are given. To check the multivariate normality of the principal
components we check for univariate normality of individual principal components
(PC’s). The normal (unconditional) probability plots of some of the important PC’s are
shown below, and for comparison the normal probability plots of the raw variables are
shown as well

                                       Figure 1
                         Normal probability plot of the first pixel
                Figure 2
Normal probability plot of the second pixel

               Figure 3
 Normal probability plot of the third pixel

               Figure 4
     Normal probability plot of PC 5.
          Figure 5
Normal probability plot of PC 2

          Figure 6
Normal Probability plot of PC 1

          Figure 7
Normal probability plot of PC 9
The plots (Figures 1, 2, 3) show clearly that the unconditional distribution of the raw
pixels is not normal. Actually we want to test whether the conditional distribution of the
pixels given a class is multivariate normal or not, but instead we test for the unconditional
distribution. This is because each class has only 10 data points (10 face images), and not
much could be read from the conditional plot with such a small amount of data. Also, it is
unreasonable to expect huge differences in conditional densities. As the unconditional
distributions look normal, the conditional distributions might not be far removed from
normality. Hence we can test for the unconditional distribution instead of the conditional

The plots (Figures 5, 6, 7) clearly show that the principal components are closer to
normality than the raw pixels. Also, all the principal components are uncorrelated (by
definition) with each other. Hence all the new features (principal components) will be
independent, because uncorrelated normal random variables are independent. Hence the
features are multivariate normal i.e. f(x) has a multivariate normal density, where x is the
vector of principal components. It has already been said that the distribution of various
pixels across the classes does not vary much hence we can assume f(x|y) to be close to
multivariate normal too.

The first few eigenfaces clearly separate the face from hair and the background and as
such explain most of the variance in the dataset. The distinction between the face, hair
and the background becomes blurred as we go along. The following plot shows the %
variance explained versus the number of principal components.The first few principal
components are selected which explain the variance of the data set upto 75% (the basis
for this is explained later).
                                              Figure 10
                          A scree plot for the first few principal components

The %variance explained of any principal component is calculated by the ratio of the
corresponding eigenvalue to the sum of all the eigenvalues. The number of principal
components to be considered is based on how good a classification one obtains using
them. Considering just five components based on the scree plot does not give good

Projections (Reconstruction of an Image)

The projection of a vector v onto a vector u is given by:
              v, u 
(Proj)vu =            u
              u, u 
If v is any image (after subtracting the average face) and u is any eigenface, then the
above formula gives the projection of the face v on to the eigenface u. Any image can be
projected onto each of the first few eigenfaces. All the projected vectors are added to get
a final image (reconstructed). After this, the average face has to be added again to get the
actual projected face onto the lower dimensional space spanned by the eigenfaces.
As an example, a face and its projection into 5, 20, 50, 150, 400 dimensions (i.e. using
5,10,15,20,400 eigenfaces respectively) is shown below:

                                         Figure 11
                                      Projected Faces

Any image in the database can be completely reconstructed by using all the eigenfaces.
The last image above is a complete reconstruction of the original image.

Performing LDA:

With the above basic frame work and with the additional assumption of equivalence of
the covariance matrices for each group we can use LDA for classification. The
assumption of equivalence of covariance matrices is quite strong. No statistical check has
been performed to check this assumption, as most of these tests are not robust enough
(e.g. Box’s test). Instead, two-fold cross validation has been performed to check the error
rate on the final classification and the assumptions are validated based on these results.

The standard procedure for LDA was followed. The covariance matrix has been
estimated using a spooled estimator. The prior probabilities have been considered to be
equal. With this spooled estimator and the means of the individual groups, the  y ’s

and  y ’s for each group have been calculated by

       1
y       y ' y

       2 
 y   1  y
The classification rule (Bayes’s minimum cost rule) using equal losses then becomes,
                   J                                            
 x   arg min   I ( j  y) P( y | x)   arg min   P( y | x)   arg maxP( j | x) 
                                                                 
               y 1                                 y j        
 arg max  j   'j x   
Where x is a features vector, J is the total number of classes (=40). The classification
obtained depends on the number of principal components considered. The number of
principal components to be used for classification is chosen so that the APER (apparent
error rate) is almost 0%. Using 23 principal components, it was found that the apparent
error rate (APER) is 0.02%.The APER is 0% when 60 or more principal components are
considered. A closer estimate of the AER (actual error rate) can be calculated using two-
fold cross validation. These were calculated based on the confusion matrices. The
confusion matrices are not shown here as they are of 40x40 dimensions. Instead the final
cross validated error rates are shown. The following table shows the number of principal
components used and the AER:

          No of PC’s           23                   60                    100
          Error rate %         11                   10.25                 11.5
                                    Cross validated error rates
                                              Table 1
It can be seen that the error rate does not decrease much with increase in the number of
principal components. In fact, the error rate when 100 PC’s are used is more than when
only 23 PC’s are used. This can be attributed to the larger rounding errors when dealing
with inverses of large dimension covariance matrices. (When using 100 PC’s the spooled
covariance matrix becomes 100x100). As the error rate does not decrease much with
increasing the number of PC’s, 23 PC’s can be considered to be optimal. These 23 PC’s
explain the variance of the data set upto 75%. Also, as the APER as well as AER is quite
small we can assume that our assumptions of multivariate normality of features given a
group and the equivalence of covariance matrices are not unreasonable.
Method 2:

The Fisher discriminants are the eigenvectors of Σ-1B. Where, Σ is the common
covariance matrix and B is the between groups matrix. As in method 1, a spooled
estimator of the covariance matrix Σ is used. Again, this method is also performed after
reducing the dimensionality using PCA. This method is based on simple metric criteria.
Mathematically, the classification rule chooses class j if
 r '
                 is minimum. Where r is the total number of discriminants used, x is the
  al x  x j      
 l 1              
                     

new observation and x j is the mean of group j i.e. the class, whose mean is closest to the

new observation (in a squared distance sense), is chosen.

It has been already mentioned that method 2 is the same as method 1 if all Fisher
discriminants are used and if all the groups are assumed to have equal prior probabilities.
To check if this is true all 23 discriminants are chosen and the observations are classified.
The error rate obtained was 12.5% as shown in the last entry of table 1. This is
approximately the same as the error rate obtained from LDA. Further, to check the
efficiency of this method the number of Fisher discriminants as well as the number of
PC’s considered was varied. The tables below show the cross validated error rates in each

          No.of               10                 15                23
          Error rate %        13.25              13.5              12.5
                          Cross validated error rates using 23 PC’s
                                            Table 2
         No.of               15               30                 60
         Error rate %        11               12                 11
                         Cross validated error rates using 60 PC’s
                                         Table 3

         No.of               20               50                 100
         Error rate %        12.5             10                 10
                        Cross validated error rates using 100 PC’s
                                         Table 4

The cross validated error rates show that, even though classification tends to improve as
the number of discriminants and/or the number of PC’s increase, the improvement is
negligible. Hence, using 10 discriminants and 23 PC’s is enough for good classification.
It was also found that new observations from the groups 17,39,14,27 were misclassified
more often than not. Observations from class 14 were mostly classified into class 28 and
22. Likewise observations from class 17 were classified mostly into class 29 and 23;
similarly observations from class 39 were mostly classified into class 30.

To see the proximity between these groups the data is plotted in the two dimensional
Fisher discriminant space. The plot is shown below:
The plot shows all the groups and their centroids. It can be seen that the centroid of group
14 is closest to centroids of groups 24 and 28. Similarly groups 30 and 39 are near to
each other. Groups 27 and 23 are also close to group 17. The proximity between the
groups is clear. Hence the rule performs poorly in these cases. Based on the error rates of
the two methods, it is clear that both method 1 and method 2 give almost the same

    1) LDA and Fishers method for classification have been successfully applied in face
    2) The cross validated error rates are small when these methods were used.
    3) Both the methods give approximately the same classification.
    4) The assumptions of multivariate normality given features and equal covariance
        matrices can be assumed to be correct, based on the evidence given by cross
        validated error rates.

1. M. Turk, A. Pentland. “Eigenfaces for Recognition”. Journal of Cognitive
   Neuroscience. Vol 3, No. 1. 71-86, 1991.
2. Applied multivariate statistical analysis, Johnson and Wichern (1992).
Method 3:
Multiple logistic regression: There are no assumptions in this rule. This method is
compared with the other two methods, as it is expected to give better results compared to
the other two. The parameters in the posterior probability model are estimated by
maximizing conditional likelihood treating feature vectors as fixed. Mathematically,

The classification rule then becomes

Using all the 40 classes for classification proved to be very expensive computationally.
Hence for this method only the first ten classes were considered. For the purpose of
estimating the parameters, RPLC software was used. Multiple logistic regression is the
same as reference point logistic classification with one reference point per class. Using
RPLC with nk=10, gives one reference point per class. The data has been divided into
training and test sets. The principal component scores of training set as well as test set
have been obtained based on the principal components of training set. The test and
training sets have been interchanged to do two-cross validation. The following tables give
the results given by RPLC software for different choice of principal components:

loop     icv      nk       lambda trn1         trn2     trn3      tst1       tst2     tst3
1        0        10       0.0000    0.0003 0.0003 0.0000 16.6038 0.0067 0.9000

loop     icv      nk       lambda trn1         trn2     trn3      tst1       tst2     tst3
1        0        10       0.0000    0.0003 0.0003 0.0000 11.7149 0.1427 0.6600
                                       Using 23 PC’s
                                          Table 1

loop     icv      nk       lambda trn1        trn2      trn3      tst1       tst2     tst3
1        0        10       0.0002    0.0002 0.0000 0.0000 12.6038 0.0288 0.8500

loop     icv      nk       lambda trn1        trn2      trn3      tst1       tst2     tst3
1        0        10       0.0000    0.0004 0.0004 0.0000 12.9544 0.1189 0.6200
                                       Using 15 PC’s
                                          Table 2

loop     icv      nk       lambda trn1        trn2      trn3      tst1       tst2     tst3
1        0        10       0.0000    0.0013 0.0013 0.0000 38.3964 0.0025 0.9000

loop     icv      nk       lambda trn1        trn2      trn3      tst1       tst2     tst3
1        0        10       0.0584    0.0002 0.0492 0.0000 30.7654 0.0867 0.7000
                                       Using 5 PC’s
                                          Table 3
From the tables it can be seen that the error rates on the training sets (column trn3) are
zero even with 5 PC’s. But the error rates on test sets are very high. Even with 23 PC’s
An error rate of 90% is shown by the software.

Hence method 1 and method 2 perform well when compared to multiple logistic

Method 3: This method is the most robust of the three. The posterior probabilities are
modeled directly without any distributional assumptions concerning the marginal density
of the features. The form of the posterior probabilities is assumed to be the same as
obtained in method 1, the conditional distribution is not restricted to being multivariate
normal but to belong to a bigger family of exponential functions. The classification rule
used here is the same as mentioned in method 1. The classifier chooses the group which
minimizes the expected loss of misclassification. As the method doesn’t assume equal
covariance matrices and multivariate normality, it is more robust from deviations against
these assumptions. (Equal losses are assumed).

   3) Multiple logistic regression (MLR) which doesn’t assume both of the above.

To top