Reliable and Fast Tracking of Faces under Varying Pose by linxiaoqin


									                    Reliable and Fast Tracking of Faces under Varying Pose
                              Tao Yang1 , Stan Z.Li2, Quan Pan1, Jing Li1, Chunhui Zhao1
             College of Automatic Control, Northwestern Polytechnical University, Xi’an, China , 710072
              Center for Biometrics and Security Research & National Laboratory of Pattern Recogntiion
                    Institute of Automation, Chinese Academy of Sciences, Beijing, China, 100080
   , ,,

                       Abstract                               many real applications. To solve this problem, Wang [4]
                                                              propose a graphical model based method, which combines
   This paper presents a system that is able to track         the factorial and the switching Hidden Markov
multiple faces under varying pose (tilted and rotated)        Model(HMM). Feraud et al. [5] adopt the view based
reliably in real-time. The system consists of two             representation for face detection. Wiskott et al. [6] build
interactive modules. The first module performs detection      elastic bunch graph templates for multiview face detection
of face subject to rotations. The second does online          and tracking. Gong et al. [7] study the trajectories of faces
learning based face tracking. A mechanism of switching        in linear PCA feature spaces as they rotate, and use kernel
between the two modules is embedded into the system to        support vector machines (SVMs) for multipose face
automatically decide the best strategy for reliable           detection and pose estimation [8]. Viola et al [9] train a
tracking. The mechanism enables smooth transit between        decision tree to determine the viewpoint class. Li et al [10]
the detection and tracking module when one of them gives      propose FloatBoost algorithm and use a detector pyramid
no results or unreliable results. Results demonstrate that    to handle rotated faces. Huang [11] develop a nested
the system can make reliable real-time tracking of            cascade detector for multiview face detection.
multiple faces in complex background under out-of-plane          Many visual based tracking methods use some low
rotation, up to 90 degree tilting, fast nonlinear motion,     level features such as color and contour to track objects
partial occlusion, large scale changes, and camera            including faces [12],[13],[14],[15]. Monte Carlo methods
motion.                                                       [13] adopt sampling techniques to model the posterior
                                                              probability distribution of the object state and track
1. Introduction                                               objects through inference in the dynamical Bayesian
                                                              network. A robust non-parametric technique, the mean
   Real-time object tracking in complex environment has       shift algorithm, has also been proposed for visual tracking.
many practical applications, such as visual surveillance      In [14] human faces are tracked by projecting the face
and biometric identification, and is a challenging research   color distribution model onto the color frame and moving
topic in computer vision applications. Accurate and real-     the search window to the mode (peak) of the probability
time face tracking will improve the performance of face       distributions by climbing density gradients. In [15]
recognition, human activity analysis and high-level event     tracking of non-rigid objects is done through finding the
understanding.                                                most probable target position by minimizing the metric
   Face detection and tracking have recently received         based. Some other methods are presented to track human
much attention. The system of Toyama [1] made a               heads, for example, Birchfield [16] present an algorithm
successful face tracking that uses Incremental Focus of       combining the intensity gradient and the color histogram.
Attention (IFA), a state-based architecture which allows      Although much progress has been made in face tracking
fast recovery of lost targets within a unified framework.     and detection, none of the existing algorithms and systems
Viola and Jones [2] use AdaBoost for face detection. This     are able to handle multiple largely tilted and rotated faces
is related to an earlier work of Tieu and Viola [3] for       reliably in real-time. The tacked faces can either be lost
boosting image retrieval. The systems have much               easily when the faces are tilted or rotated to a large degree
advanced previous techniques in accuracy and achieve          or much occluded. Moreover, those algorithms are not fast
real-time performance. However, this work deals               enough to handle abrupt changes such as jumping and
primarily with frontal faces.                                 running, especially for non-frontal view faces. These
   The ability to detect and track faces of varying head      limitations must be overcome for a wide range of real
pose (termed “multiview” faces hereafter) is important for    applications.
      Input Frame
                                                                 detail. Section 4 discusses extensive results. Section 5
                                                                 describes conclusion and future extension.

    Face Detection                                               2. System Overview
                                                                    The system consists of two interactive modules (Figure
                      No                         Under      No   1): (1) a face detection module, and (2) an online learning
       Is Face ?                                                 based face tracking module. The detection module
                                                                 incorporated the ideas from Viola [9] and Li [10] for
       Yes                                      Yes              detection of faces under rotations. The second module,
                                                                 tracking is performed by a dominant color feature
        Under         Yes                                        selection method based on mean shift analysis. Different
       Tracking?                                                 to other mean shift tracking methods based on minimize
       No                                                        the distance between the kernel distribution for the object
                                                                 in the current frame and the model , our system learns the
       New Face              Update Face
                                                                 color distributions of the objects under track in an online
       Confirm             Pattern of Tracker
                                                                 mode, and computes the weight of each pixel by fusing
                                                                 the probabilities of the pixels in the tracked regions and
     New Tracker                     Online Learning             the surrounding area. This way, salient color features of
     Initialization                Based Face Tracking
                                                                 the faces can be selected automatically and dynamically
                                                                 for each frame, making the tracker robust even in complex
         Output Detection and Tracking Result                       The system is performed by interaction between the
                                                                 above two modules. For each input frame, the detection
                                                                 module is used to find all possible faces and update or
                      Next Frame                                 initialize a new tracker of the tracking module. To reduce
                                                                 the influence of false detects (false alarms of the face
                                                                 detection module), we analyze the time-prints of each new
 Figure 1. Diagram of Real-time Face Tracking System.            face in consecutive frames, and use the result to remove
                                                                 noise. Once a new face is confirmed, it will be added to
                                                                 the objects list being tracked, and the color distribution of
   This paper presents a novel real-time face tracking           the face and its surrounding area are computed for the
system to solve the above problems. The system consists          initialization of the tracker. If a tracked face does not be
of two interactive modules: (1) a face detection module,         detected at a certain frame, for instance under rotation or
and (2) an online learning based face tracking module.           partial occlusion, we will use the recorded face pattern to
The advantage of the first module is the high accuracy of        estimate the target position. Once a tracker got detection,
face detection in position and scale. However, it may fail       its parameter such as scale, position and color distribution
with largely tilts and rotations. The second module could        of face and surrounding will be updated by the detection
achieve real-time multiple faces tracking in various head        result. To avoid the tracking problems, the tracker will be
poses, but it is sensitive with large face scale changes. To     considered lost target if it isn’t detected for several
overcome the weaknesses of the two modules above, the            consecutive frames. The final output tracking result is the
system transition between them when one module gives             integration of the two modules.
no results or unreliable results.
   Results demonstrate that the system can make reliable         3. Online Learning Based Face Tracking
real-time tracking faces in video sequences under out-of-
plane rotation, up to 90 degree tilting, partial occlusion,         The mean-shift algorithm is a nonparametric statistical
large scale changes, camera motion and multiple persons          method for seeking the nearest mode of a point sample
in complex background. The speed is 10~12 frames per             distribution [15],[17],[18]. The algorithm has been
second for images of size 320x240. A demo can be found           adopted as an efficient technique for real-time object
at .                tracking. One of the key issues in mean shift algorithm is
   The remainder of this paper is organized as follows:          how to produce the sample weight image at time. In the
Section 2 introduces the diagram of the system. Section 3        sample weight image, the pixels on the object have high
presents online learning based face tracking module in
weight, while pixels on the background have low weight.          process, the state S is initialized by the face detection
Any features that may separate the object form                   result. We use (1) to compute the height, and the face in
background can be used to produce the weight map. For            the tracker is modeled as a rectangle with ratio α
instance, motion feature in moving object tracking with                           h = α ⋅ w,α ∈ [1,1.5]                  (1)
static camera, skin color model in face tracking, texture
similarity and output of the correlation from a detection        Where w is the width of the detected face. α is an
module such as a classifier.                                     experimental variable of face model, and fixed as 1.2 in
                                                                 our system.
  The problem we address in the system is how to                    We use the output of face detection module to initial
produce the sample weight image in which the face’s              starting values x, y, w, h and d 0 . In the system, if a
weight is greatly higher than the weight of the dynamic
                                                                 detected face doesn’t overlap with any of the existing
scene in mean shift analysis. Although the pixel’s motion
                                                                 tracker, it will be assigned as a possible new face. If the
characteristic can be used to separate the moving person
                                                                 detection times of the new face are higher than a threshold
and the static background, and thus improve the accuracy
                                                                 in the following frames, it will be confirmed and a new
of the weight image, many of the existing motion
                                                                 tracker will be initialized.
segmentation algorithms are based on background
subtraction technique with static camera. Considering the
                                                                 3.2. Dominant Feature Selection
camera is active in many application fields, we prefer to
build a system which has little constraints about the
                                                                    The goal in feature selection is to find the dominant
camera motion.
                                                                 features in feature space, so as to produce the sample
   In the system, we take the color distribution of the          weight image in mean shift tracking. The two main
object as the main feature. Typically, color feature based       components in feature selection are feature space creation
weight image is determined by computing the                      and online feature selection. Considering the camera
Bhattacharyya coefficient between the color histogram of         motion and dynamic background, the color cue is selected
the object model and the current mean shift window.              to build feature space. Without loss of generality, many
Instead of using Bhattacharyya coefficient, we compute           color space such as RGB, HSV and YUV can be chosen
the weight image through fusion the probability of the           as the feature space. Because we get almost the same
pixel in the color distribution of the object model and the      tracking results with those color spaces above in our
surroundings. Firstly, the color distribution of the face        experiments, in this system the RGB color distribution of
model is taken as the feature space. Then the color              the object is chosen as the feature space. We build an
distribution of the surrounding in the current frame is used     RGB histogram with N = Nr ⋅ Ng ⋅ Nb bins. Thus we
to select the dominate color in feature space for the next
frame during the mean-shift tracking. There are mainly           have the object color model qt
two advantages of the online learning based method: (1) It
does not need compute the similarity coefficient between                  qt = {qt (1),K, qt (n)}, n = 1,K, N                (2)
two color distribution and thus can achieve real-time
object tracking even on common PC. (2) The online
surrounding learning mechanism could make the tracking
processing quite robust even in complex environment.
                                                                 where    ∑ q ( n) = 1
                                                                          n =1

                                                                 Then the observation model       M (k ) can be defined as
3.1 Candidate Initialization
                                                                         M (k ) = ∑∑ Wqk ( x, y )                            (3)
   Usually the face is represented by a certain region in                              x   y
the image, and its shape can be chosen as ellipse [16] or                      k
                                                                 Where Wq is the weight image with the object color
rectangle. In the system, the face is modeled as rectangle
window and the size of the window is changed online, its         model q at time k , and x, y range over the rectangle
state is defined as S = {x, y, w, h, d } , where ( x, y ) is                                               k
                                                                 region. Given a pixel I ( x, y ) , let Pt ( x, y ) denote its
the center of the window, ( w, h) represents the width and       probability in the object color model. Instead of using
height separately. d denotes the lost detection times of         Bhattacharyya coefficient , we compute the pixel weight
the face tracker recently. I ( x, y ) represents the intensity   at ( x, y ) from (4) without feature selection.
value of the input image I at ( x, y ) . During the tracking             Wqk ( x, y ) = Pt k ( x, y )                         (4)
Because the mean shift iteration is based on observation
model M , under the condition of equation (4), while the
iteration stops, the candidate will move to the nearby
highest modes of the observation density, where the pixels
with high probabilities in the object color model.
   Contrast to computing the Bhattacharyya coefficient                 #11                  #29                  #61
between the color histogram of the object model and the
current mean shift window, an advantage of (4) is that it’s
simple and fast to be implemented. However, because (4)
only consider the probability of each pixel in the object
color model, it may fail in complex environment which
has similar color to the object model. To solve this
                                                                      #187                 #189                 #192
problem, we develop an efficient feature selection method
which continually using the color distribution of the
surrounding area to select the dominant color feature from
the object color model, the selected feature will be used to
produce the weight image in mean shift tracking.
   Considering the nonlinear motion of the object, we
choose a large circle area around the current object                  #196                 #467                  #477
position to estimate the color distribution of the
                    k         k                                Figure 2. Dynamic weight map and face tracking result
background. Let qt and qb denote the color distribution        under large rotation and jumping. The green rectangle
of the object and its surrounding background separately,       (frame #11) shows detected new face. The cross in red
                                                 k             shows the tracking result. Pixel with high weight is
the distribution of dominant color feature of   qd is given    displayed with high-luminance of green color. Note that
below                                                          after the feature selection with equation (5), the similar
                                                               color of the face and the surrounding (Hair) will be
         ⎧     1                   max( tk (n), ε )
         ⎪       T           if                      >T        punished and particular color of the face (Skin) will be
         ⎪     c                   max( b (n), ε )
                                        qk                     encouraged.
         ⎪1 max( tk (n), ε )
                q                   max( tk (n), ε )
         ⎪                   if T >                  > 1 (5)   like jumping (Figure 2, frame#61, frame #187, frame
qd (n) = ⎨c max( b (n), ε )
                q k
                                    max( b (n), ε )
                                         qk                    #192, frame #196), and up to 90 degree tilting (Figure 2,
         ⎪                                                     frame #467, frame #477). The online learning based
         ⎪                                                     tracker successfully handled those difficult conditions
         ⎪      0                   otherwise                  above in real-time.
         ⎩                                                        Note that although the person’s hair is in the searching
Where c is normalized coefficient, and ε is a small value
                                                               window of mean-shift tracker (Figure 2, frame # 61),
                                                               after the feature selection with equation (5), the similar
to get rid of small probabilities and prevents dividing by
                                                               color of the face and the surrounding will be punished and
zero. T is a threshold that prevents too high contrast
                                                               particular color of the face will be encouraged. As a result,
between the target and the background model. Equation
                                                               pixels with discriminate color between the face and
(4) can be modified as (6).
                                                               surrounding have set with high weight, and the hair inside
          Wqk ( x, y ) = Pdk ( x, y )                    (6)   the searching window is signed with low weight. Thus the
Where P ( x, y ) denotes the pixel probability in the          mean shift tracker will be more robust in such weighted
dominant color distribution qd .                                  During the tracking process, if a tracked face does not
   Figure 2 shows a sequence of dynamic weight map and         be detected at a certain frame, the size of it’s rectangle
face tracking result under large rotation and jumping.         window is constant for simplicity in the online learning
Here the cross in red shows the tracking result and pixel      based mean-shift analysis. Once a tracker got detection,
with high weight is displayed with high-luminance of           its parameter such as scale, position and color distribution
green color.In Figure 2, a new face is confirmed (Figure 2,    of face and surrounding will be updated by the detection
frame #11, green rectangle) and a tracker is initialized       result. The mean shift iteration is based on dominant
according to the face pattern. In the following frames,        feature density, while the iteration stops, the Kullback-
difficult conditions are included like out-of-plane rotation   Liebler (KL) distance    D k will be used to estimate the
(Figure 2, frame #29, frame #61), highly nonlinear motion
                   #79                           #349                      #372                           #394

               #214                              #249                      #353                          #417
Figure 3. Real-time multiple faces tracking in indoor environment. The first and second rows contain a sequence of tracking
result with two persons. The third row contains a sequence of tracking result with four persons. The cross shows the position
of the person’s face. The cross in green represents the face is detected by the face detector. The cross in red represents
the output of online learning based tracker.

            # 99                              # 128                      #166                           #170

          #242                            #290                        #314                          #357
Figure 4. A sequence of two interacting persons tracking in indoor environment with an active camera. Note that serious
problems as heavy rotation (Figure 4, frame #166, frame #170, frame #314, frame # 357), scale changes (Figure 4, frame #
99 and frame # 290) and changing background are correctly handled.

                                                                  4. Experimental Results
    similarity of the color distribution between face model
     qt and the current iteration result qm                          The system is implemented on a standard PC
                                                                  (Pentium IV at 3.0GHz). The video image size is
                   N                                              320x240 (24 bits per pixel) captured by Sony DCR9E
         D k = ∑ q m (i ) ⋅ log(q m (i ) / q t (i ))
                   k              i
                                                        (7)       at 25fps. The system is tested in typical indoor and
                   i =1                                           outdoor environments, with large degree head rotations
                                                                  in plane and out of plane, partial occlusion, large scale
                                                              k   changes, multiple persons, and nonlinear fast moving
    The iteration result will be accepted only when D is
                                                                  in complex background. It works at 10~12 fps. We use
    larger than a threshold.
                                                                  color histogram in RGB color space with 10x10x10
                                                                  bins for building the color distribution of the object
and the surrounding are. We deliberately selected clips     (#60172037 and #60518002), and Foundation of
taken under difficult conditions, especially those with     National Laboratory of Pattern Recognition
rotation and occlusion which well known face                (#1M99G50).
detection system will be failed. The following presents
results.                                                    REFERENCES
   Figure 3 shows an example of tracking multiple
faces in an indoor environment. The cross shows the         [1] K. Toyama, “ Prolegomena for Robust Face Tracking”, MSR
                                                            Technical Report, MSR-TR-98-65, November 1998.
position of the person’s face. The cross in green on a      [2] P. Viola and M. Jones, “Rapid Object Detection Using a Boosted
face represents the detection of the face, whereas the      Cascade of Simple Features”, In Proceedings of the IEEE Conference on
cross in red represents the face is under tracking. Note    Computer Vision and Pattern Recognition, Dec. 2001.
                                                            [3] K. Tieu and P. Viola, “Boosting Image Retrieval”, In Proceedings of
that our system successfully track multiple faces in        the IEEE Conference on Computer Vision and Pattern Recognition,
real-time under various difficult conditions, such as       Volume. 1, pages: 228-235, 2000.
out-of-plane rotations in the range of [-90,90] (in         [4] P. Wang and Q. Jin, “Multi-View Face Detection under Complex
                                                            Scene based on Combined SVMs”, International Conference on Pattern
degrees) (first row, frame #79), up-and-down nodding        Recognition, 2004.
rotations approximately in the range of [-90,90] (in        [5] J. Feraud, O. Bernier, and M. Collobert, “A Fast and Accurate Face
degrees) (first row, frame #349, frame#372), partial        Detector for Indexation of Face Images”, In Proceedings of Fourth IEEE
                                                            Conference on Automatic Face and Gesture Recognition, 2000.
occlusion (first row, frame #349, frame #372, frame         [6] L. Wiskott, J. Fellous, N. Kruger, and C.V. Malsburg, “Face
#394) and large scale changes (third row, frame #249).      Recognition By Elastic Bunch Graph Matching”, IEEE Transactions on
Many well known detectors [2] may fail at those             Pattern Analysis and Machine Intelligenc, Volume. 19, no. 7, pages: 775-
                                                            779, July 1997.
situations.                                                 [7] S. Gong, S. McKenna, and J. Collins, “An Investigation into Face
   Figure 4 gives an example of face tracking with an       Pose Distribution”, In Proceedings of the IEEE Conference on Face and
active camera. This video clip includes 410 frames and      Gesture Recognition, 1996.
                                                            [8] Y.M. Li, S.G. Gong, and H. Liddell, “Support Vector Regression And
only 44 frames contain frontal face that can be detected.   Classification Based Multi-View Face Detection and Recognition”, In
Serious problems as heavy rotation (Figure 4, frame         Proceedings of the IEEE Conference on Face and Gesture Recognition,
#166, frame #170, frame #314, frame #357), scale            pages: 300-305, Mar. 2000.
                                                            [9] M. Jones and P. Viola , “Fast Multi-view Face Detection”, In
changes (Figure 4, frame #99 and frame #290) and            Proceedings of the IEEE Conference on Computer Vision and Pattern
changing background are correctly handled. It is hard       Recognition, June 2003.
for those systems which only detect face in still frame     [10] S.Z. Li, L.Zhu, Z.Q. Zhang, A. Blake, H.J. Zhang, H. Shum,
                                                            “Statistical Learning of Multi-View Face Detection”, In Proceedings of
to achieve the same tracking result.                        The 7th European Conference on Computer Vision(ECCV). Copenhagen,
                                                            Denmark. May, 2002.
                                                            [11] C. Huang, H.Z.Ai, B. Wu, “Boosting Nested Cascade Detector for
5. Conclusion                                               Multi-View Face Detection”, In Proceedings 17th International
                                                            Conference on (ICPR'04), Volume 2 August 23- 26, 2004.
   We have presented a reliable real-time system that is     [12] Y. Wu, T. Yu and G. Hua, “Tracking Appearances with Occlusions”,
able to track multiple faces with largely tilts and         In Proceedings of the IEEE Conference on Computer Vision and Pattern
                                                            Recognition, Vol.I, pp.789-795, Madison, WI, June, 2003.
rotations in fast motion with high accuracy. The main       [13] M. Isard and A. Blake, “Condensation – conditional density
contributions of the work are the following: First, we      propagation for visual tracking”, International Journal on Computer
                                                            Vision, 29(1):5–28, 1998.
presented a novel system architecture, which                [14] G. R. Bradski, “Computer Vision Face Tracking as a Component of
dynamically switches between face detection and             a Perceptural User Interface”, Intel Technology Journal,1998,2, pages: 1-
tracking modules, and overcome weaknesses of the            15.
                                                            [15] D.Comaniciu, V.Ramesh, and P.Meer, “ Real-time tracking of non-
two modules.. Second, we described a online learning        rigid objects using mean shift”, In Proceedings of the IEEE Conference
based faces tracking algorithm. It can improve              on Computer Vision and Pattern Recognition, Volume 2,2000.
system’s performance in difficult conditions such as        [16] Stan Birchfield, “Elliptical Head Tracking Using Intensity Gradients
out-of-plane rotation, large tilting, partial occlusion,    and Color Histograms”, In Proceedings of the IEEE Conference on
large scale changes, camera motion, multiple persons,       Computer Vision and Pattern Recognition, Santa Barbara, California,
                                                            pages: 232-237, June 1998.
and nonlinear fast motion in complex background.            [17] V. R. Dorin et al, “Kernel-based object tracking”, IEEE
Future work will focus on integrate more cues and           Transactions on Pattern Analysis and Machine Intelligence, Volume:25,
                                                            May 2003.
features as evidences for face tracking.                    [18] Gaël JAFFRÉ and Alain CROUZIL, “ Non-Rigid Object
                                                            Localization from Color Model using Mean Shift”, In Proceedings of the
                                                            IEEE International Conference on Image Processing, volume 3,
                Acknowledgements                            Barcelona, Spain, September 2003, pages: 317-320.

  The work presented in this paper was sponsored by
National Natural Science Foundation of China

To top