Face detection for video summaries by vqx13199


									                   Face detection for video summaries

                        Jean Emmanuel Viallet and Olivier Bernier

                   France Télécom Recherche & Développement
                 Technopole Anticipa, 2, avenue Pierre Marzin
                            22307 Lannion, France
     jeanemmanuel.viallet, olivier.bernier@rd.francetelecom.com

       Abstract. In an image, the faces of the persons are the first information looked
       for. Performing efficient face detection in a video with persons (excluding
       cartoons and nature videos) allows to classify shots, and to obtain automatically
       face summaries. Shot sampling greatly improves time processing. Scene layout
       (same number of person, similar face position and size) provides a criterion to
       establish a similarity measure between shots. Similar shots are gathered within
       shot clusters and all but one shot of a cluster are discarded from the summaries.

1 Introduction

Persons are the principal shooting operator, the major subject of shooting and also the
primary concern of audience. This concern and aptitude for persons and specifically
for person faces is illustrated every day in television magazines (paper or electronic)
where, it is a well established convention, summaries of programs are quasi
systematically illustrated with images of persons and most of the time with (cropped)
close-up shots of face.

An alternative to summarizing a whole video to a unique image consists in segmenting
the video in shots. Segmentation consists in finding the location and the nature of the
transition between two adjacent shots and has led to numerous techniques, trimmed to
the nature of the transition both in the compressed [1] and uncompressed domain [2].
Each identified shot is summarized by a keyframe [3]. Shot detection and key frame
extraction rely on low level information (colour, movement) but nothing is known on
the content of the shot or key frame (presence/absence of persons or of specific
objects). We present a technique to summarize video using face information obtained
by face detection. This technique is adequate for videos with persons but unsuitable
for videos such as cartoons or nature documentaries (without faces of persons).

Sequences or scenes are narrative units, of level of abstraction greater than shots and
thus scene segmentation may vary subjectively according to the director, editor or
audience. Some key frames or shots can be viewed as carrying little information
(intermediate shots) or similar information (alternate shots in a dialogue scene). We
remove such key frames and decrease the size of the summary [4].
2 Face detection

Since face represents high-level information to which humans are very sensitive,
face/non face shot classification [5] and face-based summaries are relevant. Early
work on face detection performed by Rowley, Pentland and others dealt with frontal
face detection. Most of the work performed on face based video indexation deals with
video news and face detection of anchors [6, 7]. Such videos are of particular interest
since overlaid text and audio recognition efficiently contribute to indexation. As face
detection is concerned, these videos are characterized by typical frontal face, waist
high shot and central position when there is one anchor. The anchor usually looks at
the camera, which eases the face detection process. Unfortunately, in most non-news
videos, such as those dealt with in this paper, frontal views are not always available
and side view face detection is needed [8, 9]; our own face detector achieves detection
up to angles of 60° [10].

The performances of a face detector on a video can be evaluated by processing every
image. Apart from being tedious, such an evaluation is biased since many images of a
shot are highly similar. The performances are thus estimated on a shot basis (Table 1).
Since this work does not focuss on automatic shot detection, shot boundaries (cuts for
the videos processed) are manually determined. A shot is manually labelled face, if at
least one image of the shot exhibits a least a face; otherwise it is labelled non-face
shot. A face shot is correctly classified if at least one face is detected on at least one
frame of the shot. A face shot is incorrectly classified if no face is detected on any of
the frames of the shot. A non-face shot is correctly (incorrectly) classified if there is
no (at least one) detection is detected on any (at least one) frame of the shot.
Face detection rate, estimated on shots, if of 56% (Table I) and is below the 75% rate
obtained on still images [10]. Face detection typically fails because of face
orientation, size, occultation and colorimetry (when skin tone pre-filtering is
implemented in order to accelerate the face detection process).
Only five non-face shots (less than 4%) are misclassified (Table 1). The false alarm
rate on video is equivalent to the one obtained with still images [10] (one false alarm
for every full 250 images processed). Although the obtained false alarms are highly
correlated (they look alike), their temporal stability is low and no false alarm is
obtained on more than two consecutive frames. This low temporal stability could be
used to automatically filter the false alarms [11].
The overall rate of correctly classified shots is 85%.

2.1 Frame sampling and face detection

An alternative to using a fast face detector such as the one described in [12] is to
process a limited number of frames per shot. This is of particular interest for the non-
face shots (the more numerous of the tested videos), which otherwise must be entirely
scanned before being classified as a non-face shot. On the 185 shots of the seven
videos we have processed, a 737 shot/hour rhythm is found, corresponding to an
average shot length of 109 frames. Once the limits of a shot are known (obtained with
an automatic shot segmentation for example), face detection is performed on frames
sampled along the shot, until detection occurs; the corresponding frame represents the
key frame of the shot. When no detection is found, the shot is classified as non-face
and will be discarded from the video summary.
On the processed videos, the face shot detection yield only slightly increases for
sampling rate greater than 3 to 4 samples per shot. A face shot is a shot where
detection occurs (face or false alarm). The detection yield is the ratio of the number of
face shot obtained for s samples over the number of face shot obtained when all the
frames are processed. Sampling is equivalent to Group Of Picture processing for
compressed video [11]. On average, only 3.65 frames are processed when a maximum
of four samples per shot is selected.

Table 1. Face and non-face shot classification.

                                      Correctly classified   Incorrectly classified
        50 face shots                        56%                     44%
        135 non face shots                  96.3%                   3.7%
        Total 185 shots                     85.4%                   14.6%

3 Video summaries

The shot summary, obtained from shot segmentation, has a number of key frames
images equal to the number of detected shot (key frames are manually selected as the
middle of the shots images) (Fig. 1 top). Each of the shots (and corresponding key
frames) has a priori the same importance. The face shot summary (Fig. 1 bottom left),
far smaller than the shot summary, only keeps the key frames where detection has
occurred. The face summary collects the (cropped) images of the faces detected and
discards similar faces (Fig. 1 bottom right).
A video could be summarized with the (cropped) image of the first face detected. A
one face image summary limits processing time and provides a summary more
interesting than the first image of a video (often a dark image or a video credits image)
with which, until recently, video search engines used to summarize video before
selecting a within video frame as summary [13].
From the face information knowledge, different processes may be thought off. For
example, retaining the face key frame corresponding to the longest face shot is
presumably preferable to selecting the key frame of the longest shot.
A more difficult process deals with selecting an image corresponding to the face
detected in the greatest number of shots. Ascertaining that the face belongs to the
same person [14] can be straightforward when the images have little difference (for
example, top images in Fig. 1 bottom left) but is usually difficult (bottom images in
Fig. 1 bottom left).
4 Scene layout similarity and shot clusters

A same person may be found in different scene corresponding either to a change of
location, time or characters (for example the bottom images in Fig. 1 bottom left). A
same person may also be encountered in different shots of a scene, owing to the
editing technique of alternate shots or to the insertion of shots.
From one scene to another, the changes of pose, of facial expression, of light
conditions and of background are among the major reasons which make face
identification difficult, regardless of the fact that face recognition only succeeds with
front view faces [15].
On the contrary, within a scene without camera change, the position of the face and
the background do not change much.
We consider that two shots i, j have a similar scene layout and belong to the same shot
cluster if the number of detected faces is the same in both shots and if the positions
and scales of the detected faces have changed less than a predefined value between the
two shots. Let us consider the relative variation of the horizontal position (Equation 1)
and of the vertical position (Equation 2) with respect to the width w and the height h
of the face, together with the relative variation of the size z (Equation 3) of the face; x
and y are the image coordinates of the position of the face.
Equation (4) measures the scene layout similarity, according to our criteria, when only
one person is found in the shots i and j. These shots are said to be similar, when their
mutual similarity Si,j is greater than a given threshold. If similar, these shots are
merged within a same shot cluster. Otherwise, if a shot cannot be merged to cluster, a
new cluster is initiated from this shot. The value of the threshold used in the following
experiment is set to 0.5 and corresponds to relative variation of lateral, vertical
position and size of face of 25%.

                                   xi - x j
                   X i, j = 2 *                                                    (1)
                                  wi + w j

                                  yi - y j
                   Yi, j = 2 *                                                     (2)
                                  hi + h j

                                  zi - z j
                   Z i, j = 2 *                                                    (3)
                                  zi + z j
                               1           1          1
                  Si, j =              *          *                                (4)
                            1 + X i , j 1 + Yi , j 1 + Z i , j

From the 28 shots for which faces have been detected, and the threshold value of 0.5,
the shot clusters obtained are given in figure 3, and presented on a video per video
The criteria used (Equations 5 to 8) to compare the effectiveness of shot segmentation
techniques [16], are also used to measure the quality of the shot cluster obtained.
                                  NC - NI
                   Accuracy =             = 0.77                                   (5)
                   Recall =             = 0.70                                     (6)
                              NT + ND
                               ND + NI
                   Error rate =        = 0.11                                      (7)
                               NT + NI
                   Precision =         =1                                          (8)
                               NC + NI

The total number of cluster is estimated by the author to NT = 18. This estimation is
subjective and a different, although close, number of clusters could have been found
by someone else. The situation is similar for shot segmentation for which there is no
ground truth. For instance, in one of the videos, the first cluster is obtained in the same
room as the second cluster, but with a greater field of view and a slightly different
camera angle (Fig. 2 top left and Fig. 3). In another video, the first cluster corresponds
to a location estimated to be different from the location of the last cluster (Fig. 2 top
right and Fig. 3).
The number of correctly identified cluster is NC = 14, the number of incorrectly
inserted cluster is NI = 0 and ND = 2 is the number of incorrectly deleted clusters. The
first deleted cluster corresponds to an obtained cluster, which incorrectly merges a
same person in a similar position but in two different places (Fig. 2 bottom left and
Fig. 3). The second deleted cluster (Fig. 2 bottom right and Fig. 3) corresponds to the
cluster that incorrectly merges a man and a woman. If the colorimetry of the images
had been taken into account, these two errors would have probably been dismissed as
shown by keyframe clustering based on compressed chromaticity signature [17].
Keeping only one sample per cluster yields smaller video summaries. Summaries
assembling (cropped) images of face (Fig. 1 bottom right) focalise on the person to the
detriment of contextual information.

4 Conclusion

Face detection is a mean to obtain video summaries, which people are familiar with
that is to say that focus on face information. The size of the obtained video summaries
is far smaller than the standard shot summary, and even benefits from non-detected
faces together with a low false alarm rate. Many of the face images are similar and can
be gathered in shot clusters and discarded from the summary.
Fig. 1. Top: the standard key frame "shot" summary. Bottom left: the "shot-face" summary
obtained by selecting shots where faces are detected. Bottom right: the "face" summary,
keeping only facial parts of the images and discarding similar redundant faces.

Fig. 2. Top left: same person and location, different frames. Top right: same person, different
location and face position. Bottom left: same person, different locations and similar positions.
Bottom right: different persons and locations, similar face position.
Fig. 3. Face shot clusters. Each of the seven black frames corresponds to a video. Columns
show the different clusters of a video and rows show the shots of a cluster, according to the
chronology of the video. Only the information's on the number, size and position of faces are
used and image colorimetry is not taken into account. Two clusters are incorrect: one merges a
woman and a man and, in the second one, a man is first in front of a bookshelves than in front
of a window. For the top video, two images of the second shot cluster, enclosed with hyphen
lines, are positioned on the first row for convenience.

1. Wang, H.L., Chang, S.F.: A Highly Efficient System for Automatic Face Region Detection
   in MPEG Video. CirSys Video, 7(4) (1997) 615-628
2. Demarty, C.H., Beucher, S.: Efficient morphological algorithms for video indexing. Content-
   Based and Multimedia Indexing, CBMI'99 (1999)
3. Chen, J.-Y., Taskiran C., Albiol, A., Delp, E. J., Bouman, C. A.: ViBE: A Video Indexing
   and Browsing Environment. Proceedings of the SPIE Conference on Multimedia Storage
   and Archiving Systems IV, 20-22 septembre, Boston, vol. 3846 (1999) 148-164
4. Aoki, H., Shimotsuji, S., Hori, O.: A shot classification method of selecting effective
   keyframe for video browsing. In Proc. of ACM Int'l Conf. on Multimedia, pages 1--10,
   Boston, MA, November 1996.
5. Chan, Y., Lin, S.H., Tan, Y.P., Kung, S.Y.: Video Shot Classification Using Human Faces.
   ICIP (1996) 843-846
6. Eickeler, S., Muller, S.: Content-Based Indexing of TV Broadcast News Using Hidden
   Markov Models. IEEE Int. Conference on Acoustics, Speech, and Signal Processing
   (ICASSP), Phoenix, Arizona (1999)
7. Liu, Z., Wang, Y.: Face Detection and Tracking in Video Using Dynamic Programming,
   ICIP00 (2000) MA02.08
8. Schneidermann, H., Kanade, T.: Probabilistic Modeling of Local Appearance and Spatial
   Relationships for Object Recognition. IEEE Computer Vision and Pattern Recognition,
   Santa Barbara (1998) 45-51
9. Wei, G., Li, D., Sethi, I. K.: Detection of Side View Faces in Color Images. WACV00
   (2000) 79-84
10. Féraud, R., Bernier, O., Viallet, J. E., Collobert, M.: A fast and accurate face detector based
   on neural networks. IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 23 (2001)
11. Wang, H., Stone, H. S., Chang, S.-F.: FaceTrack: Tracking and Summarizing Faces from
   Compressed Video. SPIE Multimedia Storage and Archiving Systems IV, Boston (1999)
12. Viola, P., Jones, M.: Robust Real-Time Face Detection. International Conference on
   Computer Vision 01 (2001) II:747
13. Altavista video search engine: http://www.altavista.com
14. Eickeler, S., Wallhoff, F., Iurgel, U., Rigoll, G.: Content-Based Indexing of Images and
   Video Using Face Detection and Recognition Methods. IEEE Int. Conference on Acoustics,
   Speech, and Signal Processing (ICASSP), Salt Lake City, Utah (2001)
15. Satoh, S.: Comparative Evaluation of Face Sequence Matching for Content-based Video
   Access. Proc. of Int'l Conf. on Automatic Face and Gesture Recognition (FG2000) (2000)
16. Ruiloba, R., Joly, P., Marchand-Millet, S., Quenot, G.: Towards a standard protocol for the
   evaluation of video-to-shots segmentation algorithms. CMBI 1999 Proceedings of the
   European workshop on content-based-multimedia indexing, Toulouse, France (1999)
17. Drew, M. S. Au. J.: Video keyframe production by efficient clustering of compressed
   chromaticity signatures. ACM Multimedia '00, pp.365368, November 2000

To top