Embed
Email

face

Document Sample

Shared by: xiaopangnv
Categories
Tags
Stats
views:
1
posted:
12/11/2011
language:
pages:
20
Face Recognition using Multivariate Statistical Techniques









Introduction:





Face recognition by digital computers has obtained a lot of attention in recent years.

There are many practical applications to such a technique. Face recognition can be used

in criminal identification, in security systems, and to increase the interaction between

humans and computers. Face images are complex. Added to this complexity face images

come with a variety of light effects, backgrounds, etc. Including all these variations in a

model is a difficult task. Hence face recognition is a challenging task. The methods

followed here are simple and are based on finding a set of faces (very few) that explain

most of the variance in the data set and using different classification rules based on these

few faces. A comparison of the methods adopted for classification is also done.





The faces analyzed are obtained from the ORL database of faces at

http://www.uk.research.att.com/facedatabase.html . This website provides the following

information about the database: “The database contains a set of face images taken

between April 1992 and April 1994 at the AT&T laboratories. The database was

originally used in the context of a face recognition project carried out in collaboration

with the Speech, Vision and Robotics Group of the Cambridge University Engineering

Department. There are ten different images of each of 40 distinct subjects. For some

subjects, the images were taken at different times, varying the lighting, facial expressions

(open / closed eyes, smiling / not smiling) and facial details (glasses / no glasses). All the

images were taken against a dark homogeneous background with the subjects in an

upright, frontal position (with tolerance for some side movement)”. This makes a total of

400 face images, with 10 images corresponding to each person. The resolution of each

image is 112x92 pixels. All the images were in .pgm image format. All the images are

grayscale images; the data in each image consists of pixel intensity values ranging from 0

to 255, with 0 corresponding to black and 255 to white. Some images from the data set

are shown below:

Initial steps:





The data are first unpacked from the .pgm image format as follows: for any image, all the

pixel intensity values are read continuously starting from pixel position (1,1) by going

column wise until a new row, then the pixel values of this row are read and finally all the

values are made into a row vector. All the images are unpacked in this fashion which

gives a data matrix X of 400x10304 dimensions. (i.e. 400 images and each image having

112x92=10304 pixels). Hence, each image can be considered to be a vector in the 10304

dimensional space.





For statistical analysis, the pixels (columns of the data matrix) can be considered as

random variables and different images (rows of the data matrix) can be considered as

samples of these variables. Most of the variables are expected to be very highly

correlated. This is because most of the pixels which form the background will have the

same intensity in an image. So, they would be highly correlated. Similarly, a lot of pixels

that make up the hair would also be correlated. So, there is a lot of redundancy in the

data.





Basic Idea:





The objective is to correctly identify the individual given an image of his/her face. To this

end, one of the classification procedures mentioned below is desired to be followed.

1) Linear discriminant analysis (LDA) assuming multivariate normality of features

given groups and common covariance.

2) Fisher discriminants procedure which does not assume multivariate normality of

features given groups but assumes common covariance.





A brief description of each of the methods mentioned above follows:





Method 1: The method, based on the assumptions, obtains the posterior probabilities in

terms of the spooled covariance matrix of the groups and the individual group means.

The classification rule is then based on these posterior probabilities. The group which

minimizes the expected loss of misclassification is chosen by the classifier. The method

is very much dependent on the assumptions mentioned. One can say that this method is

very sensitive to deviations from the assumptions. (Equal losses are assumed).





Method 2: This method is based on Fisher’s discriminants. The discriminants obtain a

low dimensional representation of the data, just as in PCA. The difference here is that, the

low dimensional space is chosen so that the groups are separated from each other as

much as possible (based on certain criteria). The classification rule then is based on the

distance between a new observation and the individual group means in this dicriminant

space. The classifier chooses the group whose mean is nearest to the new observation.

The maximum number of discriminants obtained is given by s=min(g-1,p) where g is the

number of groups and p is the number of variables in each group.





In the two methods the prior probabilities are all assumed to be equal, this is reasonable

because the probability of any subject being subject to identification (recognition) is

equal. It turns out that the classification procedure based on Fisher’s dicriminants is the

same as linear discriminant analysis assuming multivariate normality of features given

groups, if the prior probabilities are all the same and if all the discriminants are used for

classification (Johnson & Wichern pg.694 [2]). Hence, method 2 and method 1 give

equivalent classification rules under such conditions.

Even though, theoretically method 1 can be performed on the raw pixel data, which is

400x10304 in dimensions computationally it is not feasible to estimate the covariance

matrix which would be of 10304x10304 dimensions. This would require huge amounts of

memory. A way out, would be to consider using PCA to reduce the dimensionality before

applying LDA. Also, as the principal components are a linear combination of 10304

variables they would be closer to normality. This follows from the central limit theorem

which states that a linear combination of many variables which are independent and

having any distribution, is normal distributed. Even though the variables are not

independent here, the principal components would still be close to normality as would be

shown later. Hence using PCA we can reduce dimensionality and increase the

approximation to normality at one shot.





The two methods are discussed in detail below.





Method 1:





We want to classify the individuals based on their face images by Linear Discriminant

Analysis (LDA). Each 10304 pixel intensity vector can be thought of as a vector of

features and each individual can be thought of as a class into which the features are

classified. Even with the assumptions already made, LDA cannot be performed on the

raw pixel intensity values (can be thought of features) as the covariance matrix is of large

dimensions (10304x10304). Computing the covariance matrix and its inverse are both

computationally not feasible.





To overcome this problem Principal Component Analysis (PCA) is first performed on the

pixel intensity data. But, to get the principal components of X which is 400x10304 we

have to obtain the covariance matrix which is of 10304x10304 dimensions and then find

its eigenvectors. So, the problem still remains. Nevertheless, we have to note that the rank

of X is only 400, which means there will only be 400 eigenvectors which have non-zero

eigenvalues.

A simple way of finding these eigenvectors is proposed by (Turk and Pentland [1]). If v

is an eigenvector of XXT then an eigenvector of XTX is XTv. Mathematically, if v is an

eigenvector of XXT then

XX T v  v

Multiplying both sides by XT we get

( X T X ) X T v   ( X T v)

Hence XTv is an eigenvector of XTX.

Note that X, as used here, is mean centered. The mean of X (before mean centering) i.e.

the mean of all faces is called the average face. The average face is shown below,









Figure 8

The average face

Also, note that the eigenvalues do not change i.e. the eigenvalues of XXT are the same as

those of XXT. The eigenvectors calculated this way are normalized. The order of the

covariance matrix from which eigenvectors have to be estimated is reduced from

10304x10304 to 400x400 (i.e. from square of pixels to square of number of images). The

eigenvectors and eigenvalues are estimated using MATLAB. The larger an eigenvalue

the more important is that eigenvector, in the sense that the variance of the data set in that

direction is more compared to the variance in the direction of eigenvectors with lesser

eigenvalues.





These eigenvectors are called eigenfaces(Turk and Pentland[1]). The components of each

eigenvector can be regarded as a weight of the corresponding pixel in forming the total

image. The components of the eigenvectors calculated, generally do not lie between 0-

255, which is required to visualize an image. Hence, the eigenvectors are transformed so

that the components lie between the 0-255 range, for visualization. These eigenfaces can

be regarded as the faces that make up all the faces in the database i.e. a linear

combination of these faces can be used to represent any face in the data set. Also, at least

one face can be represented by a linear combination of all the eigenfaces. The first few

eigenfaces (after transformation) are shown below:









Figure 9

Eigenfaces





The eigenfaces represent vectors in the 10304 dimensional space spanned by the

variables (pixels). If we consider these new vectors to be our new axes to represent the

data, these new axes are the principal components we are seeking for. They are just a

rotation of the original axes to more meaningful directions. To illustrate how principal

components are closer to normality, the normal probability plots of some of the first few

principal components are given. To check the multivariate normality of the principal

components we check for univariate normality of individual principal components

(PC’s). The normal (unconditional) probability plots of some of the important PC’s are

shown below, and for comparison the normal probability plots of the raw variables are

shown as well









Figure 1

Normal probability plot of the first pixel

Figure 2

Normal probability plot of the second pixel









Figure 3

Normal probability plot of the third pixel









Figure 4

Normal probability plot of PC 5.

Figure 5

Normal probability plot of PC 2









Figure 6

Normal Probability plot of PC 1









Figure 7

Normal probability plot of PC 9

The plots (Figures 1, 2, 3) show clearly that the unconditional distribution of the raw

pixels is not normal. Actually we want to test whether the conditional distribution of the

pixels given a class is multivariate normal or not, but instead we test for the unconditional

distribution. This is because each class has only 10 data points (10 face images), and not

much could be read from the conditional plot with such a small amount of data. Also, it is

unreasonable to expect huge differences in conditional densities. As the unconditional

distributions look normal, the conditional distributions might not be far removed from

normality. Hence we can test for the unconditional distribution instead of the conditional

one.





The plots (Figures 5, 6, 7) clearly show that the principal components are closer to

normality than the raw pixels. Also, all the principal components are uncorrelated (by

definition) with each other. Hence all the new features (principal components) will be

independent, because uncorrelated normal random variables are independent. Hence the

features are multivariate normal i.e. f(x) has a multivariate normal density, where x is the

vector of principal components. It has already been said that the distribution of various

pixels across the classes does not vary much hence we can assume f(x|y) to be close to

multivariate normal too.





The first few eigenfaces clearly separate the face from hair and the background and as

such explain most of the variance in the dataset. The distinction between the face, hair

and the background becomes blurred as we go along. The following plot shows the %

variance explained versus the number of principal components.The first few principal

components are selected which explain the variance of the data set upto 75% (the basis

for this is explained later).

Figure 10

A scree plot for the first few principal components





The %variance explained of any principal component is calculated by the ratio of the

corresponding eigenvalue to the sum of all the eigenvalues. The number of principal

components to be considered is based on how good a classification one obtains using

them. Considering just five components based on the scree plot does not give good

classification.









Projections (Reconstruction of an Image)





The projection of a vector v onto a vector u is given by:

 v, u 

(Proj)vu = u

 u, u 

If v is any image (after subtracting the average face) and u is any eigenface, then the

above formula gives the projection of the face v on to the eigenface u. Any image can be

projected onto each of the first few eigenfaces. All the projected vectors are added to get

a final image (reconstructed). After this, the average face has to be added again to get the

actual projected face onto the lower dimensional space spanned by the eigenfaces.

As an example, a face and its projection into 5, 20, 50, 150, 400 dimensions (i.e. using

5,10,15,20,400 eigenfaces respectively) is shown below:









Figure 11

Projected Faces





Any image in the database can be completely reconstructed by using all the eigenfaces.

The last image above is a complete reconstruction of the original image.





Performing LDA:





With the above basic frame work and with the additional assumption of equivalence of

the covariance matrices for each group we can use LDA for classification. The

assumption of equivalence of covariance matrices is quite strong. No statistical check has

been performed to check this assumption, as most of these tests are not robust enough

(e.g. Box’s test). Instead, two-fold cross validation has been performed to check the error

rate on the final classification and the assumptions are validated based on these results.





The standard procedure for LDA was followed. The covariance matrix has been

estimated using a spooled estimator. The prior probabilities have been considered to be

equal. With this spooled estimator and the means of the individual groups, the  y ’s



and  y ’s for each group have been calculated by



 1

y    y ' y

1



 2 

 y   1  y

The classification rule (Bayes’s minimum cost rule) using equal losses then becomes,

 J   

 x   arg min   I ( j  y) P( y | x)   arg min   P( y | x)   arg maxP( j | x) 

   

 y 1   y j 



 arg max  j   'j x 

Where x is a features vector, J is the total number of classes (=40). The classification

obtained depends on the number of principal components considered. The number of

principal components to be used for classification is chosen so that the APER (apparent

error rate) is almost 0%. Using 23 principal components, it was found that the apparent

error rate (APER) is 0.02%.The APER is 0% when 60 or more principal components are

considered. A closer estimate of the AER (actual error rate) can be calculated using two-

fold cross validation. These were calculated based on the confusion matrices. The

confusion matrices are not shown here as they are of 40x40 dimensions. Instead the final

cross validated error rates are shown. The following table shows the number of principal

components used and the AER:





No of PC’s 23 60 100

Error rate % 11 10.25 11.5

Cross validated error rates

Table 1

It can be seen that the error rate does not decrease much with increase in the number of

principal components. In fact, the error rate when 100 PC’s are used is more than when

only 23 PC’s are used. This can be attributed to the larger rounding errors when dealing

with inverses of large dimension covariance matrices. (When using 100 PC’s the spooled

covariance matrix becomes 100x100). As the error rate does not decrease much with

increasing the number of PC’s, 23 PC’s can be considered to be optimal. These 23 PC’s

explain the variance of the data set upto 75%. Also, as the APER as well as AER is quite

small we can assume that our assumptions of multivariate normality of features given a

group and the equivalence of covariance matrices are not unreasonable.

Method 2:





The Fisher discriminants are the eigenvectors of Σ-1B. Where, Σ is the common

covariance matrix and B is the between groups matrix. As in method 1, a spooled

estimator of the covariance matrix Σ is used. Again, this method is also performed after

reducing the dimensionality using PCA. This method is based on simple metric criteria.

Mathematically, the classification rule chooses class j if

 r '

    is minimum. Where r is the total number of discriminants used, x is the

2

  al x  x j 

 l 1 

 



new observation and x j is the mean of group j i.e. the class, whose mean is closest to the



new observation (in a squared distance sense), is chosen.





It has been already mentioned that method 2 is the same as method 1 if all Fisher

discriminants are used and if all the groups are assumed to have equal prior probabilities.

To check if this is true all 23 discriminants are chosen and the observations are classified.

The error rate obtained was 12.5% as shown in the last entry of table 1. This is

approximately the same as the error rate obtained from LDA. Further, to check the

efficiency of this method the number of Fisher discriminants as well as the number of

PC’s considered was varied. The tables below show the cross validated error rates in each

case:





No.of 10 15 23

discriminants

Error rate % 13.25 13.5 12.5

Cross validated error rates using 23 PC’s

Table 2

No.of 15 30 60

discriminants

Error rate % 11 12 11

Cross validated error rates using 60 PC’s

Table 3









No.of 20 50 100

discriminants

Error rate % 12.5 10 10

Cross validated error rates using 100 PC’s

Table 4





The cross validated error rates show that, even though classification tends to improve as

the number of discriminants and/or the number of PC’s increase, the improvement is

negligible. Hence, using 10 discriminants and 23 PC’s is enough for good classification.

It was also found that new observations from the groups 17,39,14,27 were misclassified

more often than not. Observations from class 14 were mostly classified into class 28 and

22. Likewise observations from class 17 were classified mostly into class 29 and 23;

similarly observations from class 39 were mostly classified into class 30.





To see the proximity between these groups the data is plotted in the two dimensional

Fisher discriminant space. The plot is shown below:

The plot shows all the groups and their centroids. It can be seen that the centroid of group

14 is closest to centroids of groups 24 and 28. Similarly groups 30 and 39 are near to

each other. Groups 27 and 23 are also close to group 17. The proximity between the

groups is clear. Hence the rule performs poorly in these cases. Based on the error rates of

the two methods, it is clear that both method 1 and method 2 give almost the same

classification.





Conclusions:

1) LDA and Fishers method for classification have been successfully applied in face

recognition.

2) The cross validated error rates are small when these methods were used.

3) Both the methods give approximately the same classification.

4) The assumptions of multivariate normality given features and equal covariance

matrices can be assumed to be correct, based on the evidence given by cross

validated error rates.

References





1. M. Turk, A. Pentland. “Eigenfaces for Recognition”. Journal of Cognitive

Neuroscience. Vol 3, No. 1. 71-86, 1991.

2. Applied multivariate statistical analysis, Johnson and Wichern (1992).

Method 3:

Multiple logistic regression: There are no assumptions in this rule. This method is

compared with the other two methods, as it is expected to give better results compared to

the other two. The parameters in the posterior probability model are estimated by

maximizing conditional likelihood treating feature vectors as fixed. Mathematically,









The classification rule then becomes









Using all the 40 classes for classification proved to be very expensive computationally.

Hence for this method only the first ten classes were considered. For the purpose of

estimating the parameters, RPLC software was used. Multiple logistic regression is the

same as reference point logistic classification with one reference point per class. Using

RPLC with nk=10, gives one reference point per class. The data has been divided into

training and test sets. The principal component scores of training set as well as test set

have been obtained based on the principal components of training set. The test and

training sets have been interchanged to do two-cross validation. The following tables give

the results given by RPLC software for different choice of principal components:









loop icv nk lambda trn1 trn2 trn3 tst1 tst2 tst3

1 0 10 0.0000 0.0003 0.0003 0.0000 16.6038 0.0067 0.9000





loop icv nk lambda trn1 trn2 trn3 tst1 tst2 tst3

1 0 10 0.0000 0.0003 0.0003 0.0000 11.7149 0.1427 0.6600

Using 23 PC’s

Table 1





loop icv nk lambda trn1 trn2 trn3 tst1 tst2 tst3

1 0 10 0.0002 0.0002 0.0000 0.0000 12.6038 0.0288 0.8500





loop icv nk lambda trn1 trn2 trn3 tst1 tst2 tst3

1 0 10 0.0000 0.0004 0.0004 0.0000 12.9544 0.1189 0.6200

Using 15 PC’s

Table 2





loop icv nk lambda trn1 trn2 trn3 tst1 tst2 tst3

1 0 10 0.0000 0.0013 0.0013 0.0000 38.3964 0.0025 0.9000





loop icv nk lambda trn1 trn2 trn3 tst1 tst2 tst3

1 0 10 0.0584 0.0002 0.0492 0.0000 30.7654 0.0867 0.7000

Using 5 PC’s

Table 3

From the tables it can be seen that the error rates on the training sets (column trn3) are

zero even with 5 PC’s. But the error rates on test sets are very high. Even with 23 PC’s

An error rate of 90% is shown by the software.





Hence method 1 and method 2 perform well when compared to multiple logistic

regression.









Method 3: This method is the most robust of the three. The posterior probabilities are

modeled directly without any distributional assumptions concerning the marginal density

of the features. The form of the posterior probabilities is assumed to be the same as

obtained in method 1, the conditional distribution is not restricted to being multivariate

normal but to belong to a bigger family of exponential functions. The classification rule

used here is the same as mentioned in method 1. The classifier chooses the group which

minimizes the expected loss of misclassification. As the method doesn’t assume equal

covariance matrices and multivariate normality, it is more robust from deviations against

these assumptions. (Equal losses are assumed).





3) Multiple logistic regression (MLR) which doesn’t assume both of the above.



Related docs
Other docs by xiaopangnv
180617
Views: 0  |  Downloads: 0
apostar-por-crear-una-empresa
Views: 0  |  Downloads: 0
Contemplative Pedagogy Principles and Design
Views: 1  |  Downloads: 0
PreApplications
Views: 1  |  Downloads: 0
Basic or Pure Science vs. Applied Science
Views: 0  |  Downloads: 0
Algorithmic Problems Related To The Internet
Views: 0  |  Downloads: 0
E07-PC-23-03a_EFET Wish list
Views: 0  |  Downloads: 0
ATT
Views: 2  |  Downloads: 0
1793A_Example
Views: 1  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!