a survey of methods for face detection by housework

VIEWS: 600 PAGES: 33

More Info
									A Survey of Methods for Face Detection

Andrew King 992 550 627 March 3, 2003


1 Introduction 1.1 1.2 The Problem of Face Detection . . . . . . . . . . . . . . . . . Current Work . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 2 4

2 Mathematical Models and Approaches


3 Classifiers


4 Results and Conclusions 4.1 4.2


Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Conclusions and Future Work . . . . . . . . . . . . . . . . . . 27


Chapter 1 Introduction


The Problem of Face Detection

In this paper we focus specifically on the problem of face detection in still images. Obviously the most straightforward variety of this problem is the detection of a single face at a known scale and orientation. Even this, it turns out, is a nontrivial problem. The most immediate application that comes to mind for face detection is as the first step in an automated face recognizer [12]. Thought of in this sense, face detection can be applied to systems for things such as automated surveillance and human traffic census. In and of itself, however, face detection is a fascinating problem. Efficient


face detection at framerate is an impressive goal; it is an analogue to face tracking (on which the literature, due to the subject’s obvious application to human-computer interaction [6], is extensive) that requires no knowledge of previous frames. As such, it is obviously a more challenging problem (particularly as many face tracking approaches are designed for human-computer interaction schemes designed specifically to this end). Furthermore, fast face detection has an apparent application to practical face tracking in the sense that it can be used to initialize tracking, e.g. when an interaction subject enters the frame or appears from an occluded position. Another reason that face detection is an important research problem is its role as a challenging case of a more general problem, i.e. object detection, for which the applications, once not restricted to faces, are manifold. Face detection is a beautiful paradigm for the general problem for several reasons. A face is naturally recognizable to a human being despite its many points of variation (e.g. skin tone, hairstyle, facial hair, glasses, etc.). Obviously a human being is able to detect a face in the context of an entire person, but we want a simple, context-free approach to detection. Another source of difficulty for faces is the complex 3-dimensional shape of one’s face, and the resulting difference in the appearance of a given face under different lighting


conditions, even in an otherwise identical environment [12]. There may be object detection methods that work well for more easily identifiable objects such as blocks, but a method that works well for faces can generally be trusted with the task of detection for a wide range of complex object structures. The generality of detecting faces in a single greyscale image is a major challenge. We have no standard method for determining illumination data, scene structure, or context of sub-images without performing extensive operations on the image before even considering faces. Hence a successful strategy for face detection will be able to dodge environmental tricks and traps, but cannot ever be expected to perform perfectly.


Current Work

There are various solutions to this problem, most of which deal with faces at arbitrary (at least, within a reasonable range) scales, though most assume an upright face (the method to be used for rotated faces is an obvious exhaustive analogue to any detection method for upright faces). Most of the methods discussed in this paper are concerned only with detecting forwardfacing faces. Of these methods, only Schneiderman and Kanade’s statistical


method considers profile detection [11]. However, their method considers only three face orientations, and practically speaking, each orientation is treated as a different object. The effects of this approach for detecting faces at various orientations is discussed in Chapter 4. Schneiderman and Kanade apply statistical likelihood tests, using feature output histograms to create their detector scheme in [11]. Rowley and Kanade use neural network-based filters in [10], obtaining good early results in what has apparently become a benchmark of sorts for face detection schemes. In another early work, Papageorgiou et al. propose a general object detection scheme which uses a wavelet representation and statistical learning techniques [8]. Osuna et al. apply Vapnik’s support vector machine technique to face detection in [7], and Romdhani et al. improve on that work by creating reduced training vector sets for their classifier in [9]. Fleuret and Geman attempt a coarse-to-fine approach to face detection, focusing on minimizing computation in their approach [3]. In perhaps the most impressive paper, Viola and Jones use the concept of an “integral image”, along with a rectangular feature representation and a boosting algorithm as its learning method, to detect faces at 15 frames per second [13]. This represents an improvement in computation time of an order of magnitude over previous


implementations of face detection algorithms. In Chapter 2, we describe the various mathematical models used for these methods. In Chapter 3, we specifically discuss the classifier for each approach. In Chapter 4, the results of these approaches are analyzed and compared.


Chapter 2 Mathematical Models and Approaches
Every method addressed in this paper uses a learning algorithm on a training set to begin the detection process. The training stage is extensive for some methods, and relatively small for others. This common training gives us an advantage when we consider the problem for the first time: we can assume that we have available to us data about a general face, and we can infer certain information regarding faces in general. The most intuitive solution to the problem of modeling faces is the geometric formulation which allows the detector to project a tested image onto


a learned subspace and determine whether or not it is close to that subspace. The natural thing to do with a training set, then, is to compute a manifold in Rn (from training images containing n pixels) from the most significant components of the general face. This is a very basic scheme, and is computationally burdensome.

Sung and Poggio use an adaptation of this scheme to create a detection scheme using Gaussian clusters in Rn . The basic idea of their detection model is using a multiple-mean Gaussian mixture model for both objects (the general case, as opposed to faces) and non-objects. Obviously the space with low object probability is the non-object space, so it is more accurate to say that among the Gaussian object clusters, negatively weighted clusters are placed so as to improve the definition of the object space. In terms of the detection problem, these negatively weighted clusters will be centred at images which can be mistaken for faces, but are not. Their implementation uses six each of the face and non-face clusters. Their learning method is appropriate for their means: a large focus of the detector is on discerning between faces and face-like non-faces. They use a “bootstrapping” strategy for creating a non-face training set consisting of only the most meaningful non-faces (as


opposed to a general non-face training set, which would contain many images which are so obviously not faces that they hold little weight in the detector). Sung and Poggio construct their face set in a very straightforward manner (enlarging their data set with rotations and reflections). The bootstrapping scheme for non-face generation begins with a small set of non-face samples. Their detector is then run, and false positives are added to their non-face set. This method can be iterated until a satisfactory data set has been reached. This makes for a very time-consuming construction, but the resulting set, using their negatively weighted cluster scheme for non-faces, is well suited to their demands. If necessary, the face data set can be bootstrapped in a similar manner. Sung and Poggio claim that their system can be made arbitrarily robust in this manner: “Both false positive and false negative detection errors can be easily corrected by further training with the wrongly classified patterns” [12]. This is in reference to error rates in training sets, and does not necessarily suggest that given time, the scheme can be improved arbitrarily for unseen data.

A different approach to separating faces and non-faces in image space is


used by Osuna et al., and followed up with work by Romdhani et al. in [7] and [9], respectively. Both are based on support vector machines, a classification method which is the result of V. Vapnik and others at AT&T Bell Labs, notably presented in [2]. The key to the model for a support vector machine is the choice of a manifold that separates the face set from the non-face set. In [7], a hyperplane is chosen, specifically the hyperplane which maximizes minimum distance on either side. A support vector set is, roughly speaking, a set of vectors (images) which are close to this hyperplane, and can therefore be used on their own to reacquire the hyperplane. In [9], an attempt to improve the performance of a similar system is made via reduced set vector machines. The methods in both papers use quadratic programming heavily, and exploit properties of their models’ kernel functions. This is more closely related to their classifiers, and will therefore be left mostly to Chapter 3. In terms of training sets, Osuna et al. exploit their model and the fact that most vectors will be ignored, or at least meaningless, in their quadratic programming formulation. Because the hardware requirements for training a support vector machine in a natural way are prohibitive, training data must be chosen in a nontrivial manner. First. a set of optimality conditions,


specifically the Kuhn-Tucker conditions, are considered. Only those vectors which are relevant to the training, i.e. support vectors, are used. Memory requirements are quadratic in the size of this working vector set, so obviously minimizing it is key. The proposed solution is to decompose the problem into smaller sub-problems, a standard solution to such a problem when doing so is possible. Romdhani et al. [9] work further on reducing this vector set in order to improve performance. In [9], it is argued that the support vector set in detectors like the detector in [7] forms a proportion of the entire training set that stands to be reduced significantly. There has been a decent amount of research done on improving the performance of support vector machines since their development less than 10 years ago. Rhomdhani et al. apply one such method to improve the performance of an SVM-based face detector. Basically speaking, given a vector Φ in the model’s feature space (expressable, thanks to the model, as a sum over the support vector set), there is a good approximation Φ to Φ. Specifically, there is a good approximation Φ which is expressable as a sum over a reduced vector set which is much smaller than the support vector set. Given a reduced set, the problem remains to minimize the norm of Φ − Φ . This can be done in terms of the model’s associated kernels.


Akin to Sung and Poggio’s bootstrap approach is the retraining performed in [9]. The positive results of such retraining are demonstrated in the context of a neural network-based detector in [10].

Schneiderman and Kanade propose a statistical model in [11]. To apply statistical methods to the problem, they represent visual attributes with wavelet coefficients. This method suits their needs because unlike other methods, with wavelets an image can be perfectly reconstructed from its transform with a coefficient set that has the same size as the image itself. Specifically speaking, their method uses three filter levels, giving 10 image sub-bands. This representation allows them to jointly model image data which is localized in space, frequency, and orientation. From this information, then, they are able to construct a histogram-based face detector. This method requires that initial histograms be constructed. Schneiderman and Kanade’s approach to this is similar to the approach of Sung and Poggio’s, and is in fact, loosely speaking, a statistical analogue of the bootstrapping method described previously [11, 12]. Rather than giving every training example acquired through bootstrapping equal weight, they use an approach for faces that explicitly minimizes error in the training data. This


is done using AdaBoost, an algorithm for converting a weak learning method into one with high accuracy [4] (Viola and Jones’ detector uses a boosting algorithm which is based on AdaBoost). In their training method for faces, Schneiderman and Kanade begin with a bootstrapping basis which is evenly weighted, then give more weight to training images which are identified as false positives. Just like Sung and Poggio’s bootstrapping, this training can be iterated to improve robustness.

In [10], Rowley et. al. present a face detection system based on artificial neural networks. This paper seems to have become an early standard in face detection, against which many researchers compare their results. Of course, this fact may be contributed to by the fact that Rowley provided several authors with test data [9, 13]. The neural network is the most novel part of the paper, as the general method for detection is fairly standard, in terms of scanning over every pixel at various scales. The neural network contains three types of hidden units: one set of units for quadrants of the 20 × 20 image, one set for quadrants of the quadrants, and one set for looking at overlapping horizontal strips of the image. The idea is clear: certain hidden units will help detect certain facial character-


istics. For example, since an oval binary mask is applied to the image in preprocessing, dark corner pixels will likely be removed in the case of a face. In this situation the quadrant hidden units are likely to sense the presence of eyes in the upper two quadrants. In order to train the neural network on a face data set, a large number of face images were used, in which feature points were labeled manually [10]. The locations of these feature points are averaged over the training set, then warped to coincide with predetermined points. Each face training image can then be aligned to the mean as the optimal solution to an overdetermined system. Iterating this method results in a suitably warped data set. This set is artificially enlarged as in other methods through rescaling, rotation, reflection, and translation. The result of this data set enlargement is that the neural network, as a filter, becomes invariant to these transformations within a range. Sung’s bootstrapping method is used to determine a non-face data set. Rowley et al. provide interesting classification methods, which will be discussed in Chapter 3.

Papageorgiou, Oren, and Poggio, in what can be considered to be a conceptual precursor to the work of Viola and Jones, use Haar wavelets to create


an overcomplete representation of the face class [8]. The focus of their paper is on the development of their wavelet model. They provide a simple application of this model (for objects in general) to face and pedestrian detection. They use an extension of two-dimensional Haar wavelets called the quadruple density transform to create their redundant representative set. This initial set consists of 1734 coefficients for vertical, horizontal, and diagonal wavelets at scales of 2 × 2 pixels and 4 × 4 pixels. To avoid prohibitive computational costs in training the classifier, this set is reduced to a set of 37 significant coefficients through statistical analysis. Again, training is done using bootstrapping methods, as in [12, 10, 7, 9, 11]. In this case, Papageorgiou et al. train their system using a variety of penalties for misclassification [8]. Their results show a marginal improvement when the penalty for missed positives is an order of magnitude greater than the penalty for false detections. In practice, however, it seems like there would be very little difference between the system under the various training schemes.

Following on the work of Papageorgiou et al., Viola and Jones present a much faster detector than any of their contemporaries [13]. The performance


can be attributed to the use of an attentional cascade, using low-featurenumber detectors based on a natural extension of Haar wavelets [5]. The cascade itself has more to do with their classifier than with their model, so it will be discussed in the next chapter. Each detector in their cascade fits objects to simple rectangular masks, basically speaking. In order to avoid making many computations when moving through their cascade, Viola and Jones introduce a new image representation which they call an integral image, which is just what it sounds like. For each pixel in the original image, there is exactly one pixel in the integral image, whose value is the sum of the original image values above and to the left. The integral image can be computed quickly, and drastically improves computation costs under the rectangular feature model. As explained in [13], the integral image allows rectangular sums to be computed in four array references. This is easy to see when the model is considered; under the conventional representation of an image, the computation time needed would be proportional to the size of the rectangle. At the highest levels of the attentional cascade, where most of the comparisons are made, the rectangular features are very large. As the computation progresses down the cascade, the features can get smaller and smaller, but fewer locations are


tested for faces. Thus the advantage of the integral image representation is clear. The remaining difficulty lies in creating and training the attentional cascade, which also lends very heavily to the detector’s efficiency. Training the attentional cascade is similar to the other training methods seen, obviously adapted to suit the situation. Because of the cascade’s nature, a very high detection rate is needed, but the false detection rate can also be very high, as the overall figures decrease exponentially with the figures for each individual cascade level, based on the depth of the cascade [13]. Each level of the cascade needs to reject examples that are closer to faces than the previous level (as each level inherits the previous level’s accepted images). Viola and Jones therefore pass a large number of non-face examples to train the first cascade level, then pass those detected by the first level on to the next level, and so on. For face training, each level is trained on the same face set. This method is similar in spirit to the more basic bootstrapping methods adapted from Sung’s method, but is geared toward the progressive nature of the attentional cascade.


Chapter 3 Classifiers
Each model requires a classifier to determine whether given data are faces or non-faces. The classifier is, in general, some threshold applied to the data, usually some sort of goodness-of-fit measure. The classifiers for the models in Chapter 2 are discussed in this chapter. Recall that Sung and Poggio model their face likelihood with Gaussian clusters and anti-clusters in Rn (n, in their case, happens to be 283) [12]. In their implementation, six clusters and six anti-clusters are used. Obviously the number of clusters, and to a lesser extent the number of anti-clusters, will have a great effect on the receiver operating characteristic (ROC) curve. The numbers of clusters (i.e. the classifier architecture) were determined


empirically. This detector was tested for a number of different architectures, and the “six and six” architecture provided the best results. The Gaussian clusters used are non-isotropic, that is, the axes of each cluster are not of equal length. Sung and Poggio justify this under the belief “that the actual face distribution can be locally more elongated along certain vector space directions than others” [12]. This seems like a reasonable generalization, but it leaves us with the problem of choosing a suitable distance function. A natural choice for a model based on these non-isotropic clusters is the normalized Mahalanobis distance. The normalized Mahalanobis distance between an image under consideration x and the centre µ of a Gaussian cluster is 1 Mn (x, µ) = (n ln 2π + ln |Σ| + (x − u)T Σ−1 (x − u)), 2 (3.1)

where Σ is the covariance matrix of the Gaussian cluster [12]. We can see that if the model contains a single Gaussian cluster, then thresholding at a fixed Mahalanobis distance from µ selects all vectors (images) which are within a fixed probability density in the model. Of several distance metrics tested for this model, a two-value combination yielded the best results. For a given vector, these two values are obtained for each cluster. The first, D1 is the Mahalanobis distance between the vector and the cluster centroid after 19

both have been projected to the space of the cluster’s 75 most significant eigenvectors. The second, D2 , is the Euclidean distance between the vector and its projection to this 75-dimensional space, i.e. its out-of-subspace error. For each cluster, then, this vector has a two-value distance. These are given a weighted sum and checked against a threshold to determine whether the vector is a face or not. Relative results for this particular classifier and its variants are discussed in Chapter 4. In terms of preprocessing work, Sung and Poggio perform the standard operations: image resizing, illumination gradient correction, and histogram equalization. Further, they mask the 19 × 19 pixel images, removing some border and especially corner pixels from consideration. Osuna et al. perform identical preprocessing for their support vector machine detector [12, 7]. For the support vector machine detectors, the obvious desire is to have faces on one side of the selected hyperplane and non-faces on the other side. This is the ideal classifier for the model. After training, the system is very similar to that of Sung and Poggio [7]. The simplicity of the classifier’s criterion for support vector machines makes the run-time computation of the methods in [7] and [9] extremely simple. Pre-processing of tested images must be performed, but in general, these two methods give impressive


run-time computational savings (run-time complexity for these machines is proportional to the size of the support vector set [9]. As training is the crux of both support vector methods, the classifier is of relatively little interest in contrast to the training algorithms themselves.

Schneiderman and Kanade, for their classifier, use 17 statistical image attributes, some of which relate to only one sub-band, and some of which relate to several. Recall that the sub-bands represent different frequencies, orientations, and spaces. This means that sub-bands, when sampled to form an attribute, can interact in a number of different ways. The detector samples each of these 17 attributes over the object. Obviously some attributes will contribute to detecting a face more than others, e.g. the eyes and nose are more significant than the chin [11]. Their classifier thresholds a pattern’s likelihood ratio, i.e. a threshold λ is chosen such that faces are exactly those regions for which
x,y∈region x,y∈region 17 k=1 17 k=1

Pk (patternk (x, y), x, y | object)

Pk (patternk (x, y), x, y | non-object)

> λ,


which is a very natural value to threshold. The run-time calculations needed for this scheme are extensive, so a heuristic coarse-to-fine strategy is used, first thesholding values for level 1 wavelet coefficients, then further threshold21

ing values for level 1 and 2 coefficients for areas not rejected, then performing the final classifier on the remaining regions.

Rowley et al. use arbitration between multiple neural networks to eliminate many of their false positives. However it is first important to understand the classification criteria for a single neural network as implemented in [10]. Back-propagation with momentum is used as the networks’ training algorithm, and the training is done iteratively. This will result in networks that are self-programmed to classify faces versus non-faces [1]. This is the beautiful part of this detector. What Rowley et al. do to reduce error after passing images through neural networks is twofold. The first heuristic used is based on the observation that false positives have overlapping multiple detections less frequently than do true faces. The merging step of the classifier demands that true faces must have a certain number of overlapping detections. These detections are projected over various image scales in the image pyramid and a weighted centroid is computed. The result is a single detection where there once were many (and, subsequently, fewer false positives). The second step of the classifier involves arbitration between multiple networks. Because the networks are trained with random initial variables, there


is nondeterminism among networks trained in the same manner. Several methods were tested for successful arbitration: heuristics involving logical operations, and a separate neural network, itself designed to arbitrate between several networks. All of these arbitration methods work well; extensive result tables are given in [10], and will be discussed in Chapter 4. Preprocessing is explained thoroughly in [10]. Once an oval mask has been applied to the image, a linear best-fit function is calculated and subtracted from the image to correct lighting conditions. Histogram equalization is then performed. This step sets contrast and compensates for camera variants. This detector, like others, implements a coarse-to-fine formulation to improve performance. The combination of the detection and error prevention methods in [10] makes for an impressive detector, as will be outlined more clearly later on. It is no wonder that this paper is regarded as a standard against which new detectors are measured.

Papageorgiou et al. use a support vector machine, as in [7, 9] for their classifier, due to the fact that such machines allow small parameter count and minimized generalization error, a concern which arises in [12] in the context of model architecture (this is a global concern, assuming the absence of a


complete data set). In fact, Papageorgiou et al. create a reduced coefficient set which is akin to the support vectors in [7], in the sense that a small subset of training data can be used to accurately represent the relationship between faces and non-faces. In this paper, however, the focus is not in the implementation (i.e. the classifier), but rather on the value of the overcomplete set of wavelet coefficients in representing complex object classes, specifically the classes of faces and pedestrians [8].

Viola and Jones use a classifier that, largely for the sake of computational efficiency, is based on an attentional cascade. The individual weak classifiers are based on a variant of the AdaBoost algorithm, which converts weak classifiers into a strong classifier via boosting. To be detected, an image must be detected by each level of a series of basic classifiers, each more discriminating than the last. The computational advantage is gained in the fact that the initial levels of the cascade can use very simple features for their classifiers, and therefore can reject the vast majority of locations in an image quickly. By the time the cascade levels become more meaningful, they are operating only on a small proportion of the initial image locations. In their implementation, Viola and Jones use a 10-layer cascade, each of which


contains 20 rectangular features. They compare this against their initial, less efficient detector, which uses 200 rectangular features. One effect of the cascade strategy is that each classifier must have an extremely high detection rate, but can get away with false positive rates that would in other circumstances be thought abysmal. The reason for this is not hard to see: The false positive rate F of the entire K-layer cascade is

F =

fi ,


where fi is the false positive rate of the ith classifier. Similarly, the cascade’s detection rate is


di ,


so the unusual restraints on detection and rejection rates are obviously justified [13].

More detailed analysis and research regarding sequential testing is done by Fleuret and Geman [3]. Their model is a statistical approach, like Schneiderman and Kanade’s in [11], but their focus is on theoretical work regarding cascades.


Chapter 4 Results and Conclusions



Viola and Jones easily present the best results in terms of computation time. In terms of their error rates, they provide impressive ROC curves and numerical figures that rival those of Rowley et al. [13]. Rowley et al. provide the most extensive test data of all papers addressed, and reach very impressive results through their merging and arbitration methods [10]. With their six cluster and six anti-cluster architecture, Sung and Poggio reach fairly good results, while numerically, Schneiderman and Kanade boast what seem to be the best numerical results [11]. Osuna et al. attain slightly better rates than


Sung et al. [7], and Romdhani et al. manage to vastly improve the speed of a support vector machine with only marginal loss in classification accuracy [9]. An interesting point to make is that while Schneiderman and Kanade only train their detector on front and profile face poses, the structure of faces aids the detection of partially averted faces: in some of the detected profiles in [11], this effect is noticeable. The reason for it can be seen if only the selected image window (a pentagon) is viewed. In some cases where the face is almost front-facing, half of the face, when viewed alone, looks very much like a profile. In this case, it doesn’t seem like more than three poses are necessary.


Conclusions and Future Work

Despite the broad range of general approaches, several aspects seem to be particularly effective in face (and in general, object) detection schemes. Sung’s bootstrapping method for training detectors is very effective, and the reasons are clear; it is important to best define those areas in the border between faces and non-faces when classifying a window. Also evident is the fact that


due to the nature of the problem, classifier cascades are necessary in order to attain computationally cheap detection. Since the vast majority of pixels in a given image will represent non-face windows, it is very important that these pixels be rejected with as little computation as possible. Viola and Jones seem to have the best application of this principle, and it is aided by the fact that their wavelet representation of their classifiers is flexible enough to perform meaningful rejection with very little cost. In this sense, the integral image is absolutely key to a fast application of Haar wavelets. While Rowley, Baluja and Kanade’s neural networks approach performs well and is beautiful in the sense that neural networks give such a conceptually natural simulation of human detection, the speed of such a scheme must be improved dramatically if it is to compete with the wavelet formulation. It seems unlikely that a detection scheme that does not use an easily decomposable model will move forward in the same way that wavelet applications seem able to. One obvious avenues for future research is model combinations, since weak classifiers can be used to quickly reject large portions of an image with an effective false negative rate of 0. After such rejection, a more powerful classifier could be used to scrutinize remaining areas.


It would be interesting to see the improvement in performance upon applying the integral image formulation to other Harr wavelet-based detectors, such as Papageorgiou et al. [8] and Schneiderman and Kanade [11]. In the same vein, it would be interesting to see a more powerful cascading scheme applied to the methods of Viola and Jones [13]. The work of Viola and Jones opens the way for further practical applications to face detection. One immediate such application would be buttressing face tracking methods. Running at 15 frames per second, a detector could add robust backup to a tracker. Furthermore, the integral image is a breakthrough in wavelet classification that can easily be seen to generalize well to other object classes.


[1] C.M. Bishop, Neural networks for pattern recognition, Oxford University Press, Oxford, 1995. [2] Corinna Cortes and Vladimir Vapnik, Support-vector networks, Machine Learning 20 (1995), no. 3, 273–297. [3] Francois Fleuret and Donald Geman, Coarse-to-fine face detection, International Journal of Computer Vision 41 (2001), no. 1/2, 85–107. [4] Yoav Freund and Robert E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, European Conference on Computational Learning Theory, 1995, pp. 23–37. [5] S. Mallat, A wavelet tour of signal processing, Academic Press, San Diego, 1998.


[6] Yoshio Matsumoto and Alexander Zelinsky, Real-time stereo face tracking system for visual human interfaces, Proc. Int’l Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time (1999), 77–82. [7] E. Osuna, R. Freund, and F. Girosi, Training support vector machines: An application to face detection, 1997. [8] C. P. Papageorgiou, M. Oren, and T. Poggio, A general framework for object detection, Proceedings of International Conference on Computer Vision (1998), 555–562. [9] S. Romdhani, P. Torr, B. Scholkopf, and A. Blake, Computationally efficient face detection, Proc. Int. Conf. on Computer Vision (2001), II:695–700. [10] Henry A. Rowley, Shumeet Baluja, and Takeo Kanade, Neural networkbased face detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998), no. 1, 23–38. [11] H. Schneiderman and T. Kanade, A statistical approach to 3d object detection applied to faces and cars, IEEE Conference on Computer Vision and Pattern Recognition - to appear (2000). 31

[12] Kah Kay Sung and Tomaso Poggio, Example-based learning for viewbased human face detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998), no. 1, 39–51. [13] Paul Viola and Michael Jones, Robust real-time object detection, International Journal of Computer Vision - to appear (2002).


To top