Face Detection Project Report by esk19463

VIEWS: 1,343 PAGES: 20

									                       Computer Project for Digital Image Processing
                                EE368 Spring 2001/2002

                Face Detection Project Report
                       Ana Bertran, Huanzhou Yu, Paolo Sacchetto
                          {nuska, hzhyu, paolos}@stanford.edu

Human face detection by computer systems has become a major field of interest. Face
detection algorithms are used in a wide range of applications, such as security control, video
retrieving, biometric signal processing, human computer interface, face recognitions and
image database management. However, it is difficult to develop a complete robust face
detector due to various light conditions, face sizes, face orientations, background and skin
colors. In this report, we propose a face detection method for color images. Our method
detects skin regions over the entire image, and then generates face candidates based
on a connected component analysis. Finally, the face candidates are divided into human face
and non-face images by an enhanced version of the template-matching method.
Experimental results demonstrate successful face detection over the EE368 training images.

1. Introduction
    There have been many attempts to solve human face detection problem. The early
approaches are aimed for gray level images only, and image pyramid schemes are necessary to
scale with unknown face sizes. View-based detectors are popular in this category, including
Rowley’s neural networks classifier [1], Sung and Poggio’s correlation templates matching
scheme based on image invariants [2] and Eigen-face decomposition [3]. Model based
detection is another category of face detectors [4].
    For color images, various literatures have shown that is possible to separate human skin
regions from complex background based on either YCbCr or HSV color space [5, 6, 7, 8]. The
face candidates can be generated from the identified skin regions. Numerous approaches can
be applied to classify face and non-face from the face candidates, such as wavelet packet
analysis [6], template matching for faces, eyes and mouths [8, 9, 10], feature extraction using
watersheds and projections [5].
    In this project, a new face detector for color images is developed. The objective of this
project is to develop a very efficient algorithm, in terms of low computational complexity,
with the maximum number of face detections and the minimum number of false alarms. To
achieve these objectives first, the image is transformed to HSV color space, where the skin
pixels are determined. The skin regions in HSV space are described by crossing regions of
several 3D linear equations, which are found using training data. Also, the median luminance
condition Y value of the image is determined. For high luminance images, the chances that
more non-skin pixels are set to skin regions are high, thus an additional but simple
classification on YCbCr space is performed to remove hair pixels. Hence, a binary mask of the
original image can be obtained. This binary mask is then filtered with some image morphology
processing to break connections between faces and remove scattered noise. A connected
component analysis is followed to determine the face candidates. The final step is to
determine real faces from the face candidates using a multi-layer classification scheme. The
application of this project justifies an assumption that the faces will have approximately the
same size. So, we use a correlation template matching for the face candidates that are close to

                       Computer Project for Digital Image Processing
                                EE368 Spring 2001/2002
the median size. For large boxes, convolution template matching is used instead because it is
more likely that only part of the face candidate box contains the face. Another finer level of
template matching is applied to remove hand-like non-faces and five more face templates are
tested again to avoid missing a human face. Moreover, the standard deviation of the pixel gray
levels for the face candidates is also used to remove non-faces caused by uniform skin-color-
like region, such as floors, buildings and clothes.
     In the following sections, we will present the detailed algorithm of our face detector. We
will show that our detector gives 100% accuracy on six out of seven project training images.
The only missing face on one of the images is due to very dark glasses. No false alarms are
found in any of the seven images. Finally, conclusion and future works are discussed.

2. Color Segmentation
    The first step is color segmentation of the image. Several color spaces are available but
Hue-Saturation-Value (HSV) color map is the most adequate for differentiating the skin
regions from the rest of the photo contents. A set of equations that maximize the amount of
skin pixels while minimizing the number of background pixels can be found using the plot of
skin regions vs. non-skin regions in H vs. S, S vs. V and H vs. V. These bounding equations
are used to generate the first binary image. However, some face candidate boxes contained
two people because their black hairs were connected and included as skin region. In order to
balance taking out the hair vs. loosing some of the face skin pixels, the luminance and
chrominance (YCbCr) color space is also used to differentiate the black hair pixels from the
skin pixels in case of high luminance images.
    The following
    Figure 2-1 ÷ Figure 2-3 show the skin vs. non-skin regions in HSV, YCbCr and RGB
color spaces for the training image no. 1.

      Figure 2-1: Skin data (blue) vs. background data (red) in HSV color space.

                        Computer Project for Digital Image Processing
                                 EE368 Spring 2001/2002

Figure 2-2: Skin Data (blue) vs. background data (red) in YCbCr color space. A lot of
   the background pixels (red) occupy the same space as the skin pixels (blue).

 Figure 2-3: Skin Data (blue) vs. background data (red) in RGB color space. A lot of the
              background pixels occupy the same space as the skin pixels.
    As can be seen from the previous plots the HSV space has less non-skin pixels
overlapping with skin pixels verses the YCbCr and the RGB color space. Moreover, the RGB
color space doesn't differentiate the luminance information from the color information. It was
observed that the color of the image background contents, such as floors, building and
clothes, is similar to skin color. In the HSV space it is very convenient to get rid of the skin-
color-like pixels because they mostly fall between H values of 0.1 and 0.2, while most skin
pixels are less than 0.1.

                                         Computer Project for Digital Image Processing
                                                  EE368 Spring 2001/2002

          Figure 2-4 is a comparison of skin vs. close to skin pixels in HSV and YCbCr.

              Wall                                                                                                   Wall
              Pixels                                                                                                 Pixels

    Plot of wall pixels vs. skin pixels in HSV                               Plot of wall pixels vs. skin pixels in YCbCr
space. By setting the threshold at <0.1 we will                          space. The wall pixels fall right on the skin area.
reject the wall pixels.
                             Figure 2-4: Wall pixels vs. Skin pixels in HSV and YCbCr.

     The H parameter contains the color information and as can be seen by the plot the
majority of the skin pixels in the training images fell in a range below 0.1 and above 0.8 for H.
The wall presented the greatest problem but the wall’s pixels fall between 0.1 and 0.2.
     The use of linear equations to delimit the skin vs. non-skin regions is another advantage of
the HSV color space. There is a linear trend in S vs. V where the majority of the skin pixels
fall within two bounding equations. This linear equations are simple so that the complexity of
our algorithm is reduced, which leads into a short processing time.
                                 Image 3
    1.5                                                                                          Image 4



    0.5                                                                  0.5

     0                                                                    0
          0   0.1      0.2     0.3        0.4   0.5   0.6   0.7                0   0.1   0.2   0.3       0.4   0.5    0.6   0.7
                                     S                                                               S

                    Figure 2-5: S vs. V for training images with different light conditions.

                        Computer Project for Digital Image Processing
                                 EE368 Spring 2001/2002
     As can be seen in Figure 2-5, there is vertical V offset from one training image to another.
To eliminate this offset, V is normalized by subtracting the whole image’s V mean from the
data points. S and H parameters don’t have to be normalized since there is no significant
offset from training image to training image.
      Figure 2-6 shows the skin data points from all the training images. The S vs. V trends are
separated in two populations one for data points with H<0.1 and one for data points with
H>0.8. The population separation provides a more precise segmentation, so that the number
of face candidates is reduced. The overall computation time is reduced due to less template-
matching correlation and convolution operations in favor of extra computational time in the
first step, which is minimal due to only logical operations.

    Figure 2-6: Graphs of the skin pixel samples from all images and the bounding
                                    equations used.

    The bounding equations are a trade-off of removing the non-skin pixels while keeping the
skin ones. An example of such optimization is that instead of H>0.9 we have chosen H>0.8
due to otherwise undetected faces. On the other hand, the lower H limit is set at 0.1 instead of
0.2 in order to get rid of most of the wall pixels while loosing some unimportant face pixels
such as faces with a lot of already segmented pixels.

                        Computer Project for Digital Image Processing
                                 EE368 Spring 2001/2002
    The final sets of bounding equations are:

    Population 1)       H<0.1                   S<=0.8
                        V<-1.33.*S+0.986        V>-0.603.*S-0.039;

    Population 2)       F=H>0.8                 S<0.7
                        V<-1.51.*S+0.853        V>-0.671.*S-0.062;

   The two population equation sets are combined with an OR operation. Figure 2-7
shows the first iteration of binary images without hair removal.

                                             These faces did not have
                                             enough separating pixels

            Figure 2-7: First iteration of binary images without hair removal.

    The first set gave a non satisfying result due to some face candidate boxes that contained
two people. These people could have been separated because there is hair between them. Plots
of hair vs. skin samples indicated it is difficult to differentiate them in the HSV space. Instead,
the YCbCr space, as can be seen by Figure 2-8, had a clear differentiating line between the hair
vs. skin pixels. A Cb value that gives the best tradeoff of taking out enough hair pixels vs.
leaving sufficient skin pixels is experimentally chosen.
    It was determined that it was not possible to make a satisfying tradeoff between losing
enough hair pixels while keeping enough skin pixels that would work for all luminance
conditions. Thus the training images are divided into two sets: ones with high luminance and
ones with low luminance based on a threshold. For those whose luminance was higher than

                        Computer Project for Digital Image Processing
                                 EE368 Spring 2001/2002
 the threshold we took out the hair while we kept the hair for the other group since otherwise
 too many skin pixels would be lost while that group had no problems with face separation.

                     Figure 2-8: Skin (blue) vs. black hair (red) pixels.
     The hair removal procedure gives satisfying results. One of the binary images can be seen
 in Figure 3-1.
     The color segmentation scheme has several limitations. One such limitation is that the
 facial features (mouth and eyes) in our training images fall in the same color space area as the
 skin color in HSV, YCbCr and RGB. Thus, it is not possible to make these facial features
 more prominent in order to increase the correlation with the template. A similar limitation is
 observed with certain hair colors such as light brown.

Figure 2-9: Plot of skin (blue) vs. eyes      Figure 2-10: Plot of skin (blue) vs. eyes (green) vs.
(green) vs. mouth(black) vs. fair hair        mouth(yellow) vs. fair hair (cyan) vs. background
(magenta) samples in YCbCr space.                        (red) samples in HSV space.

                        Computer Project for Digital Image Processing
                                 EE368 Spring 2001/2002
    Edge filtering is investigated as an alternative to taking out the hair since it highlights the
face edges. Figure 2-11 shows that this methodology introduces too many black pixels within
our face candidates (due to edges within the face areas being highlighted) which caused some
faces to be divided in half. Between the three possibilities of edge filtering: horizontal, vertical
or both horizontal filtering gives better results because it introduces less face divisions.
However neither of the edge filtering options is good enough, thus this technique is
abandoned in favor of the “taking out the hair” technique.

  Figure 2-11: All edge filtering, Vertical Filtering only and Horizontal Filtering only
   Overall we are satisfied with the segmentation results since they provided a good enough
compromise between the different training images to successfully carry out the next steps.

3. Connected Component Analysis
   The color segmentation generates a Binary Mask with the same size of the original image.
Figure 3-1 shows an example of the Binary Mask generated from the training image no. 5.

                                      Figure 3-1: Binary Mask

    Figure 3-1 includes most of the skin regions, such as faces, hands and arms. However,
some regions similar to skin also appear white: pseudo-skin pixels such as clothes, floors and
buildings. The goal of the Connected Component algorithm is to analyze the connection

                       Computer Project for Digital Image Processing
                                EE368 Spring 2001/2002
property of skin regions, and identify the face candidates, which will be described by
rectangular boxes.
     Ideally, each face is a connected region and separated from each other. However, in some
circumstances, two or even three faces can be connected by ears or high luminance hairs. In
addition, pseudo-skin pixels are scattered and generate hundreds of connected components,
which cost unnecessary computations if they are identified as face candidates. Therefore, pre-
processing of the binary mask before connected component analysis is necessary.
     Figure 3-2 shows two faces that are connected. However, the connection is thin compared
to the inside regions of the face and it can be broken by image morphology operations. In
particular, one row direction and one column direction image erosion operations are applied
so that more pixels are eroded in column directions. This is based on the observation that
faces are usually connected more horizontal. In addition, within a face, connections between
the parts above and below the eyes are fragile, and it is desired not to erode this connection.
At the same time, erosion operations act similar to median filter, and can remove pseudo-skin
pixels because of their scattered and weak connection property.
     The light condition of the image plays an important role for the quality of the binary
masks. In strong light condition, there tends to be more pseudo-skin pixels, and this requires
more erosion operations. But, too much erosion will make the faces fall apart for weak light
images. Therefore, we perform the erosion operation adaptively depending on the light
condition of the image. The light condition can be determined based on the median luminance
value of all the pixels in the image. For strong light images, two additional column erosions
are included. Between first and second level erosions, holes are filled so that later erosions
only happen at edges of the connected components and will not cause regions inside faces to
fall apart.
     A pre-processing of the binary image that breaks connection between faces in strong light
conditions is shown in Figure 3-2÷Figure 3-5. In particular, Figure 3-2 is the binary image
from skin segmentation that has two faces connected. Figure 3-3 is the image obtained by the
first level column and row erosion. Figure 3-4 is the image obtained by the hole filling and
second level column erosion. Figure 3-5 is the image obtained by the third level erosion with
two separate connected components for two faces.

                       Computer Project for Digital Image Processing
                                EE368 Spring 2001/2002

         Figure 3-2: Connected Faces                Figure 3-3: First Level Erosion

         Figure 3-4: Hole Filling and               Figure 3-5: Third Level Erosion
             Second Level Erosion

     The connected component analysis consists of labeling the pre-processed masks by
looking at the connectivity of neighboring pixels. Each connected component is considered as
a potential face candidate and a rectangular boundary box is computed.
     An adaptive scheme to filter out some non-face boxes based on size information is used.
Assuming that all the faces will not be larger and smaller than the median size to some extent,
it is possible to remove the boxes that have unreasonable large or small areas, width, height
and ratio of width to height. The rest of the boundary boxes are considered face candidates
and passed to the template matching step. Figure 3-6 shows the face candidates obtained by
applying the connected component analysis to the Figure 3-1 binary mask.

                                           - 10 -
                       Computer Project for Digital Image Processing
                                EE368 Spring 2001/2002

                                 Figure 3-6: Face Candidates
    The above algorithm is able to identify all faces except one for all the seven training
images, without two or more faces being in one face candidate. The lost face is due to the dark
glasses the person wears.

4. Template Matching
    The template-matching compares the face candidate image with the face template,
measures the level of similarity and concludes whether it is human face or a non-face. Several
enhancements have been made to optimize the template-matching algorithm for the training
images given by the EE368 instructors. A multi-layer classification scheme has been
implemented to avoid missing faces or having non-faces. The color space chosen for the
template matching is gray because the best results have been experimentally obtained. The
template matching algorithm loads the face and non-face template images, it computes the 2
dimensional (2-D) cross-correlation or the 2-D convolution.
    The face template is an image made by averaging all faces on the training images.
Figure 4-1 shows the used Face Template image.

                                   Figure 4-1: Face Template image

                                           - 11 -
                        Computer Project for Digital Image Processing
                                 EE368 Spring 2001/2002
    A few human faces are not detected if only one face template is used. The reason for the
undetected faces is due to the very different color of skin or face profiles found across several
subjects. Additional face templates are used to detect the missing faces. Figure 4-2 shows the
additional face template images.

                            Figure 4-2: Additional Face Template images
    A few non-faces, such as hands or clothes that have similar color to the skin, are detected
as human faces if only one face template is used. To avoid this issue, hand templates have
been created to remove hand-like non-faces.

                                Figure 4-3: Non-face Template images
     After loading all template images, the median box sizes of all face candidates present in
the image under test is determined. Then each face candidate is analyzed one by one. A
rotation of the face template or test candidate is not performed because it is not required by
any face case present in the training images.
     If the face candidate box size is similar or smaller than the median face size, the face
candidate is resized to the face template size and the 2-D cross-correlation is applied. Then, if
the cross-correlation with face templates is greater than a predetermined threshold, it is
concluded that the face candidate is a human face or a non-face otherwise. If the cross-
correlation with a non-face template is greater than a predetermined threshold, it is concluded
the face candidate is a non-face image. Moreover, standard deviation of the gray pixels is also
computed to remove non-faces such as clothes or other regions having uniform skin-color-
like. If the cross-correlation value is less than a predetermined threshold, it is concluded the
face candidate is a non-face image.
     On the other hand, if the face candidate size is larger than the median face size, a
convolution template matching is used because it is more likely that only part of the face

                                            - 12 -
                       Computer Project for Digital Image Processing
                                EE368 Spring 2001/2002
candidate contains the face. There could be a face inside the large box because the received
face candidate might consist of faces with long shining hair or two faces that are too close or
even superposed. Applying the cross-correlation function doesn’t work for large boxes
because the face would be resized to a very small size. To avoid missing a face inside a big
box, the 2-D convolution function is applied. The convolution between the inverse of the face
template and the face candidate is carried out. Then, the peak of the convolution is computed
and normalized using the face template weight.

   Figure 4-4 shows an example of a big image box that contains a human face

                     Figure 4-4: Large Size Image that contains a Human Face

    If the peak value is greater than a predetermined threshold, it is concluded there is one or
more faces otherwise it is a non-face alarm. The Matlab code can be improved detecting the
number of peaks that are greater than a threshold in order to detect multiple faces inside the
big image. This optimization is not made at present because using the training images no case
of multiple faces inside one box has been found. The following results have been found
applying Training Image no. 4 to the Matlab procedure and evaluate.m procedure given by the

                                            - 13 -
                        Computer Project for Digital Image Processing
                                 EE368 Spring 2001/2002

    Figure 4-5 shows the Resulting Image. The green boxes represent the detected human
faces and the red boxes the non-faces. Human faces or non-faces Case Numbers have been
introduced on Figure 4-5 to help the following review of the resulting image:
Case 1:      a human face detected by the cross-correlation operation with the face template
             presented in Figure 4-1,
Case 2:      a human face detected by the cross-correlation operation with the face template
             presented in Figure 4-2,
Case 3:      a non-face, a hand in particular, removed by the cross-correlation operation with
             the non-face template presented in Figure 4-3,
Case 4:      a non-face, a pant in particular, removed by the standard deviation requirement,
Case 5:      a human face detected by the convolution operation with the face template
             presented in Figure 4-1,
Case 6:      a non-face removed by the convolution operation with the face template presented
             in Figure 4-1.
    This enhanced template-matching algorithm is capable of detecting all human faces and
no false alarms. We classified half faces as non-faces. In order to detect half faces it is possible
to determine if the candidate box is located at the edge of the image, and then use half
template to carry out the cross-correlation operation.



                  4                   3                     6

                          Figure 4-5: Resulting Image. Green boxes are faces,
                                      red boxes are non-faces.

                                             - 14 -
                           Computer Project for Digital Image Processing
                                    EE368 Spring 2001/2002

  5. Conclusion
      We have presented a face detection algorithm for color images that uses color
  segmentation, connected component analysis and multi-layer template-matching. Our method
  uses the color information in HSV space, compensates for the luminance condition of the
  image, and overcomes the difficulty of separating faces that are connected together using
  image morphology processing. Finally, an enhanced version of the template-matching
  algorithm is used to detect all human faces and reject the non-faces such as hands and clothes.
      Experimental results have shown that our approach detected 164 out of 165 faces present
  in the seven project training images (half faces are classified as non-faces). The only one
  missing face is due to very dark glasses. No false alarms are raised in any of the seven images.
  The average run time on ISE lab workstation is ~12 seconds.
      Future work will be focused on verifying the algorithm performance against general
  images and studying the required modifications to make the algorithm robust with any image.

  6. Reference
[1]  H. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection. In Proc. IEEE Conf.
     on Computer Visioin and Pattern Recognition, pages 203-207, San Francisco, CA, 1996
[2] Kah-Kay Sung and Tomaso Poggio. Example-based learning for view-based human face detection.
     A.I. Memo 1521, CBCL Paper 112, MIT, December 1994
[3] B. Moghaddam and A. Pentland. Probabilistic visual learning for object representation. In S.K.
     Nayar and T. Poggio, editors, Early Visual Learning, pages 99--130. Oxford Univ. Press,
[4] O. Jesorsky, K. J. Kirchberg, R.W. Frischholz. Robust Face Detection Using the Hausdorff
     Distance. In Proc. Third International Conference on Audio- and Video-based Biometric
     Person Authentication, Halmstad, Sweden, 2001
[5] K. Sobottka and I. Pitas. Looking for faces and facial features in color images. Pattern Recognition
     and Image Analysis: Advances in Mathematical Theory and Applications, Russian
     Academy of Sciences, 1996.
[6] Garcia C., Zikos G., Tziritas G., Face Detection in Color Images using Wavelet Packet Analysis.
     Proceedings of the 6th IEEE International Conference on Multimedia Computing and
     Systems (ICMCS'99), 7--11 June, 1999, Florence, p.703-708.
[7] J.-C. Terrillon, M. David, and S. Akamatsu. Automatic detection of human faces in natural scene
     images by use of a skin color model and of invariant moments. In Proc. of the Third International
     Conference on Automatic Face and Gesture Recognition, Nara, Japan, 1998. pp. 112-117
[8] R.-L. Hsu, M. Abdel-Mottaleb, and A. K. Jain, Face detection in color images. IEEE Trans.
     Pattern Analysis and Machine Intelligence, vol. 24, no. 5, pp. 696-706, May 2002
[9] Chuo-Ling Chang, Edward Li, Zhifei Wen, Rendering Novel Views of faces Using Disparity
     Estimation. Stanford EE368 Spring 2000/2001 Final Project
[10] Xiaoyan Mu, Mehmet Artiklar, Metin Artiklar, Mohamad Hassoun and Paul Watta,
     Training Algorithms for Robust Face Recognition using Template-matching Approach. Proceedings of
     the IJCNN’01, Washington DC, July 15-19, 2001

                                                - 15 -
                 Computer Project for Digital Image Processing
                          EE368 Spring 2001/2002

Appendix I: Detection results of our algorithm on seven EE368 training

                          Training_1 Score: 23/23

                                    - 16 -
Computer Project for Digital Image Processing
         EE368 Spring 2001/2002

         Training_2 Score: 23/23

         Training_3 Score: 24/24

                   - 17 -
Computer Project for Digital Image Processing
         EE368 Spring 2001/2002

         Training_4 Score: 21/ 22

         Training_5 Score: 26/26

                   - 18 -
Computer Project for Digital Image Processing
         EE368 Spring 2001/2002

         Training_6 Score: 25/25

         Training_7 Score: 22/22

                   - 19 -
                       Computer Project for Digital Image Processing
                                EE368 Spring 2001/2002

Appendix II (Work breakdown)

The project was broken down in three parts and each one of us was designated the main
person responsible for each part (Color Segmentation – Ana Bertran, From Binary Image to
Face Candidates - Huanzhou Yu, Face vs. non- Face decision Paolo Sacchetto). However, we
each helped each other in developing the parts that needed extra help. Thus we all put the
same amount of time and effort in the project. We also each wrote one part of the project but
we all revised the final draft and modified each others part. Same thing applies to the slides.

                                           - 20 -

To top