VIEWS: 600 PAGES: 33 CATEGORY: Childrens Literature POSTED ON: 12/8/2009 Public Domain
A Survey of Methods for Face Detection Andrew King 992 550 627 March 3, 2003 Contents 1 Introduction 1.1 1.2 The Problem of Face Detection . . . . . . . . . . . . . . . . . Current Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 4 2 Mathematical Models and Approaches 7 3 Classiﬁers 18 4 Results and Conclusions 4.1 4.2 26 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Conclusions and Future Work . . . . . . . . . . . . . . . . . . 27 1 Chapter 1 Introduction 1.1 The Problem of Face Detection In this paper we focus speciﬁcally on the problem of face detection in still images. Obviously the most straightforward variety of this problem is the detection of a single face at a known scale and orientation. Even this, it turns out, is a nontrivial problem. The most immediate application that comes to mind for face detection is as the ﬁrst step in an automated face recognizer [12]. Thought of in this sense, face detection can be applied to systems for things such as automated surveillance and human traﬃc census. In and of itself, however, face detection is a fascinating problem. Eﬃcient 2 face detection at framerate is an impressive goal; it is an analogue to face tracking (on which the literature, due to the subject’s obvious application to human-computer interaction [6], is extensive) that requires no knowledge of previous frames. As such, it is obviously a more challenging problem (particularly as many face tracking approaches are designed for human-computer interaction schemes designed speciﬁcally to this end). Furthermore, fast face detection has an apparent application to practical face tracking in the sense that it can be used to initialize tracking, e.g. when an interaction subject enters the frame or appears from an occluded position. Another reason that face detection is an important research problem is its role as a challenging case of a more general problem, i.e. object detection, for which the applications, once not restricted to faces, are manifold. Face detection is a beautiful paradigm for the general problem for several reasons. A face is naturally recognizable to a human being despite its many points of variation (e.g. skin tone, hairstyle, facial hair, glasses, etc.). Obviously a human being is able to detect a face in the context of an entire person, but we want a simple, context-free approach to detection. Another source of diﬃculty for faces is the complex 3-dimensional shape of one’s face, and the resulting diﬀerence in the appearance of a given face under diﬀerent lighting 3 conditions, even in an otherwise identical environment [12]. There may be object detection methods that work well for more easily identiﬁable objects such as blocks, but a method that works well for faces can generally be trusted with the task of detection for a wide range of complex object structures. The generality of detecting faces in a single greyscale image is a major challenge. We have no standard method for determining illumination data, scene structure, or context of sub-images without performing extensive operations on the image before even considering faces. Hence a successful strategy for face detection will be able to dodge environmental tricks and traps, but cannot ever be expected to perform perfectly. 1.2 Current Work There are various solutions to this problem, most of which deal with faces at arbitrary (at least, within a reasonable range) scales, though most assume an upright face (the method to be used for rotated faces is an obvious exhaustive analogue to any detection method for upright faces). Most of the methods discussed in this paper are concerned only with detecting forwardfacing faces. Of these methods, only Schneiderman and Kanade’s statistical 4 method considers proﬁle detection [11]. However, their method considers only three face orientations, and practically speaking, each orientation is treated as a diﬀerent object. The eﬀects of this approach for detecting faces at various orientations is discussed in Chapter 4. Schneiderman and Kanade apply statistical likelihood tests, using feature output histograms to create their detector scheme in [11]. Rowley and Kanade use neural network-based ﬁlters in [10], obtaining good early results in what has apparently become a benchmark of sorts for face detection schemes. In another early work, Papageorgiou et al. propose a general object detection scheme which uses a wavelet representation and statistical learning techniques [8]. Osuna et al. apply Vapnik’s support vector machine technique to face detection in [7], and Romdhani et al. improve on that work by creating reduced training vector sets for their classiﬁer in [9]. Fleuret and Geman attempt a coarse-to-ﬁne approach to face detection, focusing on minimizing computation in their approach [3]. In perhaps the most impressive paper, Viola and Jones use the concept of an “integral image”, along with a rectangular feature representation and a boosting algorithm as its learning method, to detect faces at 15 frames per second [13]. This represents an improvement in computation time of an order of magnitude over previous 5 implementations of face detection algorithms. In Chapter 2, we describe the various mathematical models used for these methods. In Chapter 3, we speciﬁcally discuss the classiﬁer for each approach. In Chapter 4, the results of these approaches are analyzed and compared. 6 Chapter 2 Mathematical Models and Approaches Every method addressed in this paper uses a learning algorithm on a training set to begin the detection process. The training stage is extensive for some methods, and relatively small for others. This common training gives us an advantage when we consider the problem for the ﬁrst time: we can assume that we have available to us data about a general face, and we can infer certain information regarding faces in general. The most intuitive solution to the problem of modeling faces is the geometric formulation which allows the detector to project a tested image onto 7 a learned subspace and determine whether or not it is close to that subspace. The natural thing to do with a training set, then, is to compute a manifold in Rn (from training images containing n pixels) from the most signiﬁcant components of the general face. This is a very basic scheme, and is computationally burdensome. Sung and Poggio use an adaptation of this scheme to create a detection scheme using Gaussian clusters in Rn . The basic idea of their detection model is using a multiple-mean Gaussian mixture model for both objects (the general case, as opposed to faces) and non-objects. Obviously the space with low object probability is the non-object space, so it is more accurate to say that among the Gaussian object clusters, negatively weighted clusters are placed so as to improve the deﬁnition of the object space. In terms of the detection problem, these negatively weighted clusters will be centred at images which can be mistaken for faces, but are not. Their implementation uses six each of the face and non-face clusters. Their learning method is appropriate for their means: a large focus of the detector is on discerning between faces and face-like non-faces. They use a “bootstrapping” strategy for creating a non-face training set consisting of only the most meaningful non-faces (as 8 opposed to a general non-face training set, which would contain many images which are so obviously not faces that they hold little weight in the detector). Sung and Poggio construct their face set in a very straightforward manner (enlarging their data set with rotations and reﬂections). The bootstrapping scheme for non-face generation begins with a small set of non-face samples. Their detector is then run, and false positives are added to their non-face set. This method can be iterated until a satisfactory data set has been reached. This makes for a very time-consuming construction, but the resulting set, using their negatively weighted cluster scheme for non-faces, is well suited to their demands. If necessary, the face data set can be bootstrapped in a similar manner. Sung and Poggio claim that their system can be made arbitrarily robust in this manner: “Both false positive and false negative detection errors can be easily corrected by further training with the wrongly classiﬁed patterns” [12]. This is in reference to error rates in training sets, and does not necessarily suggest that given time, the scheme can be improved arbitrarily for unseen data. A diﬀerent approach to separating faces and non-faces in image space is 9 used by Osuna et al., and followed up with work by Romdhani et al. in [7] and [9], respectively. Both are based on support vector machines, a classiﬁcation method which is the result of V. Vapnik and others at AT&T Bell Labs, notably presented in [2]. The key to the model for a support vector machine is the choice of a manifold that separates the face set from the non-face set. In [7], a hyperplane is chosen, speciﬁcally the hyperplane which maximizes minimum distance on either side. A support vector set is, roughly speaking, a set of vectors (images) which are close to this hyperplane, and can therefore be used on their own to reacquire the hyperplane. In [9], an attempt to improve the performance of a similar system is made via reduced set vector machines. The methods in both papers use quadratic programming heavily, and exploit properties of their models’ kernel functions. This is more closely related to their classiﬁers, and will therefore be left mostly to Chapter 3. In terms of training sets, Osuna et al. exploit their model and the fact that most vectors will be ignored, or at least meaningless, in their quadratic programming formulation. Because the hardware requirements for training a support vector machine in a natural way are prohibitive, training data must be chosen in a nontrivial manner. First. a set of optimality conditions, 10 speciﬁcally the Kuhn-Tucker conditions, are considered. Only those vectors which are relevant to the training, i.e. support vectors, are used. Memory requirements are quadratic in the size of this working vector set, so obviously minimizing it is key. The proposed solution is to decompose the problem into smaller sub-problems, a standard solution to such a problem when doing so is possible. Romdhani et al. [9] work further on reducing this vector set in order to improve performance. In [9], it is argued that the support vector set in detectors like the detector in [7] forms a proportion of the entire training set that stands to be reduced signiﬁcantly. There has been a decent amount of research done on improving the performance of support vector machines since their development less than 10 years ago. Rhomdhani et al. apply one such method to improve the performance of an SVM-based face detector. Basically speaking, given a vector Φ in the model’s feature space (expressable, thanks to the model, as a sum over the support vector set), there is a good approximation Φ to Φ. Speciﬁcally, there is a good approximation Φ which is expressable as a sum over a reduced vector set which is much smaller than the support vector set. Given a reduced set, the problem remains to minimize the norm of Φ − Φ . This can be done in terms of the model’s associated kernels. 11 Akin to Sung and Poggio’s bootstrap approach is the retraining performed in [9]. The positive results of such retraining are demonstrated in the context of a neural network-based detector in [10]. Schneiderman and Kanade propose a statistical model in [11]. To apply statistical methods to the problem, they represent visual attributes with wavelet coeﬃcients. This method suits their needs because unlike other methods, with wavelets an image can be perfectly reconstructed from its transform with a coeﬃcient set that has the same size as the image itself. Speciﬁcally speaking, their method uses three ﬁlter levels, giving 10 image sub-bands. This representation allows them to jointly model image data which is localized in space, frequency, and orientation. From this information, then, they are able to construct a histogram-based face detector. This method requires that initial histograms be constructed. Schneiderman and Kanade’s approach to this is similar to the approach of Sung and Poggio’s, and is in fact, loosely speaking, a statistical analogue of the bootstrapping method described previously [11, 12]. Rather than giving every training example acquired through bootstrapping equal weight, they use an approach for faces that explicitly minimizes error in the training data. This 12 is done using AdaBoost, an algorithm for converting a weak learning method into one with high accuracy [4] (Viola and Jones’ detector uses a boosting algorithm which is based on AdaBoost). In their training method for faces, Schneiderman and Kanade begin with a bootstrapping basis which is evenly weighted, then give more weight to training images which are identiﬁed as false positives. Just like Sung and Poggio’s bootstrapping, this training can be iterated to improve robustness. In [10], Rowley et. al. present a face detection system based on artiﬁcial neural networks. This paper seems to have become an early standard in face detection, against which many researchers compare their results. Of course, this fact may be contributed to by the fact that Rowley provided several authors with test data [9, 13]. The neural network is the most novel part of the paper, as the general method for detection is fairly standard, in terms of scanning over every pixel at various scales. The neural network contains three types of hidden units: one set of units for quadrants of the 20 × 20 image, one set for quadrants of the quadrants, and one set for looking at overlapping horizontal strips of the image. The idea is clear: certain hidden units will help detect certain facial character- 13 istics. For example, since an oval binary mask is applied to the image in preprocessing, dark corner pixels will likely be removed in the case of a face. In this situation the quadrant hidden units are likely to sense the presence of eyes in the upper two quadrants. In order to train the neural network on a face data set, a large number of face images were used, in which feature points were labeled manually [10]. The locations of these feature points are averaged over the training set, then warped to coincide with predetermined points. Each face training image can then be aligned to the mean as the optimal solution to an overdetermined system. Iterating this method results in a suitably warped data set. This set is artiﬁcially enlarged as in other methods through rescaling, rotation, reﬂection, and translation. The result of this data set enlargement is that the neural network, as a ﬁlter, becomes invariant to these transformations within a range. Sung’s bootstrapping method is used to determine a non-face data set. Rowley et al. provide interesting classiﬁcation methods, which will be discussed in Chapter 3. Papageorgiou, Oren, and Poggio, in what can be considered to be a conceptual precursor to the work of Viola and Jones, use Haar wavelets to create 14 an overcomplete representation of the face class [8]. The focus of their paper is on the development of their wavelet model. They provide a simple application of this model (for objects in general) to face and pedestrian detection. They use an extension of two-dimensional Haar wavelets called the quadruple density transform to create their redundant representative set. This initial set consists of 1734 coeﬃcients for vertical, horizontal, and diagonal wavelets at scales of 2 × 2 pixels and 4 × 4 pixels. To avoid prohibitive computational costs in training the classiﬁer, this set is reduced to a set of 37 signiﬁcant coeﬃcients through statistical analysis. Again, training is done using bootstrapping methods, as in [12, 10, 7, 9, 11]. In this case, Papageorgiou et al. train their system using a variety of penalties for misclassiﬁcation [8]. Their results show a marginal improvement when the penalty for missed positives is an order of magnitude greater than the penalty for false detections. In practice, however, it seems like there would be very little diﬀerence between the system under the various training schemes. Following on the work of Papageorgiou et al., Viola and Jones present a much faster detector than any of their contemporaries [13]. The performance 15 can be attributed to the use of an attentional cascade, using low-featurenumber detectors based on a natural extension of Haar wavelets [5]. The cascade itself has more to do with their classiﬁer than with their model, so it will be discussed in the next chapter. Each detector in their cascade ﬁts objects to simple rectangular masks, basically speaking. In order to avoid making many computations when moving through their cascade, Viola and Jones introduce a new image representation which they call an integral image, which is just what it sounds like. For each pixel in the original image, there is exactly one pixel in the integral image, whose value is the sum of the original image values above and to the left. The integral image can be computed quickly, and drastically improves computation costs under the rectangular feature model. As explained in [13], the integral image allows rectangular sums to be computed in four array references. This is easy to see when the model is considered; under the conventional representation of an image, the computation time needed would be proportional to the size of the rectangle. At the highest levels of the attentional cascade, where most of the comparisons are made, the rectangular features are very large. As the computation progresses down the cascade, the features can get smaller and smaller, but fewer locations are 16 tested for faces. Thus the advantage of the integral image representation is clear. The remaining diﬃculty lies in creating and training the attentional cascade, which also lends very heavily to the detector’s eﬃciency. Training the attentional cascade is similar to the other training methods seen, obviously adapted to suit the situation. Because of the cascade’s nature, a very high detection rate is needed, but the false detection rate can also be very high, as the overall ﬁgures decrease exponentially with the ﬁgures for each individual cascade level, based on the depth of the cascade [13]. Each level of the cascade needs to reject examples that are closer to faces than the previous level (as each level inherits the previous level’s accepted images). Viola and Jones therefore pass a large number of non-face examples to train the ﬁrst cascade level, then pass those detected by the ﬁrst level on to the next level, and so on. For face training, each level is trained on the same face set. This method is similar in spirit to the more basic bootstrapping methods adapted from Sung’s method, but is geared toward the progressive nature of the attentional cascade. 17 Chapter 3 Classiﬁers Each model requires a classiﬁer to determine whether given data are faces or non-faces. The classiﬁer is, in general, some threshold applied to the data, usually some sort of goodness-of-ﬁt measure. The classiﬁers for the models in Chapter 2 are discussed in this chapter. Recall that Sung and Poggio model their face likelihood with Gaussian clusters and anti-clusters in Rn (n, in their case, happens to be 283) [12]. In their implementation, six clusters and six anti-clusters are used. Obviously the number of clusters, and to a lesser extent the number of anti-clusters, will have a great eﬀect on the receiver operating characteristic (ROC) curve. The numbers of clusters (i.e. the classiﬁer architecture) were determined 18 empirically. This detector was tested for a number of diﬀerent architectures, and the “six and six” architecture provided the best results. The Gaussian clusters used are non-isotropic, that is, the axes of each cluster are not of equal length. Sung and Poggio justify this under the belief “that the actual face distribution can be locally more elongated along certain vector space directions than others” [12]. This seems like a reasonable generalization, but it leaves us with the problem of choosing a suitable distance function. A natural choice for a model based on these non-isotropic clusters is the normalized Mahalanobis distance. The normalized Mahalanobis distance between an image under consideration x and the centre µ of a Gaussian cluster is 1 Mn (x, µ) = (n ln 2π + ln |Σ| + (x − u)T Σ−1 (x − u)), 2 (3.1) where Σ is the covariance matrix of the Gaussian cluster [12]. We can see that if the model contains a single Gaussian cluster, then thresholding at a ﬁxed Mahalanobis distance from µ selects all vectors (images) which are within a ﬁxed probability density in the model. Of several distance metrics tested for this model, a two-value combination yielded the best results. For a given vector, these two values are obtained for each cluster. The ﬁrst, D1 is the Mahalanobis distance between the vector and the cluster centroid after 19 both have been projected to the space of the cluster’s 75 most signiﬁcant eigenvectors. The second, D2 , is the Euclidean distance between the vector and its projection to this 75-dimensional space, i.e. its out-of-subspace error. For each cluster, then, this vector has a two-value distance. These are given a weighted sum and checked against a threshold to determine whether the vector is a face or not. Relative results for this particular classiﬁer and its variants are discussed in Chapter 4. In terms of preprocessing work, Sung and Poggio perform the standard operations: image resizing, illumination gradient correction, and histogram equalization. Further, they mask the 19 × 19 pixel images, removing some border and especially corner pixels from consideration. Osuna et al. perform identical preprocessing for their support vector machine detector [12, 7]. For the support vector machine detectors, the obvious desire is to have faces on one side of the selected hyperplane and non-faces on the other side. This is the ideal classiﬁer for the model. After training, the system is very similar to that of Sung and Poggio [7]. The simplicity of the classiﬁer’s criterion for support vector machines makes the run-time computation of the methods in [7] and [9] extremely simple. Pre-processing of tested images must be performed, but in general, these two methods give impressive 20 run-time computational savings (run-time complexity for these machines is proportional to the size of the support vector set [9]. As training is the crux of both support vector methods, the classiﬁer is of relatively little interest in contrast to the training algorithms themselves. Schneiderman and Kanade, for their classiﬁer, use 17 statistical image attributes, some of which relate to only one sub-band, and some of which relate to several. Recall that the sub-bands represent diﬀerent frequencies, orientations, and spaces. This means that sub-bands, when sampled to form an attribute, can interact in a number of diﬀerent ways. The detector samples each of these 17 attributes over the object. Obviously some attributes will contribute to detecting a face more than others, e.g. the eyes and nose are more signiﬁcant than the chin [11]. Their classiﬁer thresholds a pattern’s likelihood ratio, i.e. a threshold λ is chosen such that faces are exactly those regions for which x,y∈region x,y∈region 17 k=1 17 k=1 Pk (patternk (x, y), x, y | object) Pk (patternk (x, y), x, y | non-object) > λ, (3.2) which is a very natural value to threshold. The run-time calculations needed for this scheme are extensive, so a heuristic coarse-to-ﬁne strategy is used, ﬁrst thesholding values for level 1 wavelet coeﬃcients, then further threshold21 ing values for level 1 and 2 coeﬃcients for areas not rejected, then performing the ﬁnal classiﬁer on the remaining regions. Rowley et al. use arbitration between multiple neural networks to eliminate many of their false positives. However it is ﬁrst important to understand the classiﬁcation criteria for a single neural network as implemented in [10]. Back-propagation with momentum is used as the networks’ training algorithm, and the training is done iteratively. This will result in networks that are self-programmed to classify faces versus non-faces [1]. This is the beautiful part of this detector. What Rowley et al. do to reduce error after passing images through neural networks is twofold. The ﬁrst heuristic used is based on the observation that false positives have overlapping multiple detections less frequently than do true faces. The merging step of the classiﬁer demands that true faces must have a certain number of overlapping detections. These detections are projected over various image scales in the image pyramid and a weighted centroid is computed. The result is a single detection where there once were many (and, subsequently, fewer false positives). The second step of the classiﬁer involves arbitration between multiple networks. Because the networks are trained with random initial variables, there 22 is nondeterminism among networks trained in the same manner. Several methods were tested for successful arbitration: heuristics involving logical operations, and a separate neural network, itself designed to arbitrate between several networks. All of these arbitration methods work well; extensive result tables are given in [10], and will be discussed in Chapter 4. Preprocessing is explained thoroughly in [10]. Once an oval mask has been applied to the image, a linear best-ﬁt function is calculated and subtracted from the image to correct lighting conditions. Histogram equalization is then performed. This step sets contrast and compensates for camera variants. This detector, like others, implements a coarse-to-ﬁne formulation to improve performance. The combination of the detection and error prevention methods in [10] makes for an impressive detector, as will be outlined more clearly later on. It is no wonder that this paper is regarded as a standard against which new detectors are measured. Papageorgiou et al. use a support vector machine, as in [7, 9] for their classiﬁer, due to the fact that such machines allow small parameter count and minimized generalization error, a concern which arises in [12] in the context of model architecture (this is a global concern, assuming the absence of a 23 complete data set). In fact, Papageorgiou et al. create a reduced coeﬃcient set which is akin to the support vectors in [7], in the sense that a small subset of training data can be used to accurately represent the relationship between faces and non-faces. In this paper, however, the focus is not in the implementation (i.e. the classiﬁer), but rather on the value of the overcomplete set of wavelet coeﬃcients in representing complex object classes, speciﬁcally the classes of faces and pedestrians [8]. Viola and Jones use a classiﬁer that, largely for the sake of computational eﬃciency, is based on an attentional cascade. The individual weak classiﬁers are based on a variant of the AdaBoost algorithm, which converts weak classiﬁers into a strong classiﬁer via boosting. To be detected, an image must be detected by each level of a series of basic classiﬁers, each more discriminating than the last. The computational advantage is gained in the fact that the initial levels of the cascade can use very simple features for their classiﬁers, and therefore can reject the vast majority of locations in an image quickly. By the time the cascade levels become more meaningful, they are operating only on a small proportion of the initial image locations. In their implementation, Viola and Jones use a 10-layer cascade, each of which 24 contains 20 rectangular features. They compare this against their initial, less eﬃcient detector, which uses 200 rectangular features. One eﬀect of the cascade strategy is that each classiﬁer must have an extremely high detection rate, but can get away with false positive rates that would in other circumstances be thought abysmal. The reason for this is not hard to see: The false positive rate F of the entire K-layer cascade is K F = i=1 fi , (3.3) where fi is the false positive rate of the ith classiﬁer. Similarly, the cascade’s detection rate is K D= i=1 di , (3.4) so the unusual restraints on detection and rejection rates are obviously justiﬁed [13]. More detailed analysis and research regarding sequential testing is done by Fleuret and Geman [3]. Their model is a statistical approach, like Schneiderman and Kanade’s in [11], but their focus is on theoretical work regarding cascades. 25 Chapter 4 Results and Conclusions 4.1 Results Viola and Jones easily present the best results in terms of computation time. In terms of their error rates, they provide impressive ROC curves and numerical ﬁgures that rival those of Rowley et al. [13]. Rowley et al. provide the most extensive test data of all papers addressed, and reach very impressive results through their merging and arbitration methods [10]. With their six cluster and six anti-cluster architecture, Sung and Poggio reach fairly good results, while numerically, Schneiderman and Kanade boast what seem to be the best numerical results [11]. Osuna et al. attain slightly better rates than 26 Sung et al. [7], and Romdhani et al. manage to vastly improve the speed of a support vector machine with only marginal loss in classiﬁcation accuracy [9]. An interesting point to make is that while Schneiderman and Kanade only train their detector on front and proﬁle face poses, the structure of faces aids the detection of partially averted faces: in some of the detected proﬁles in [11], this eﬀect is noticeable. The reason for it can be seen if only the selected image window (a pentagon) is viewed. In some cases where the face is almost front-facing, half of the face, when viewed alone, looks very much like a proﬁle. In this case, it doesn’t seem like more than three poses are necessary. 4.2 Conclusions and Future Work Despite the broad range of general approaches, several aspects seem to be particularly eﬀective in face (and in general, object) detection schemes. Sung’s bootstrapping method for training detectors is very eﬀective, and the reasons are clear; it is important to best deﬁne those areas in the border between faces and non-faces when classifying a window. Also evident is the fact that 27 due to the nature of the problem, classiﬁer cascades are necessary in order to attain computationally cheap detection. Since the vast majority of pixels in a given image will represent non-face windows, it is very important that these pixels be rejected with as little computation as possible. Viola and Jones seem to have the best application of this principle, and it is aided by the fact that their wavelet representation of their classiﬁers is ﬂexible enough to perform meaningful rejection with very little cost. In this sense, the integral image is absolutely key to a fast application of Haar wavelets. While Rowley, Baluja and Kanade’s neural networks approach performs well and is beautiful in the sense that neural networks give such a conceptually natural simulation of human detection, the speed of such a scheme must be improved dramatically if it is to compete with the wavelet formulation. It seems unlikely that a detection scheme that does not use an easily decomposable model will move forward in the same way that wavelet applications seem able to. One obvious avenues for future research is model combinations, since weak classiﬁers can be used to quickly reject large portions of an image with an eﬀective false negative rate of 0. After such rejection, a more powerful classiﬁer could be used to scrutinize remaining areas. 28 It would be interesting to see the improvement in performance upon applying the integral image formulation to other Harr wavelet-based detectors, such as Papageorgiou et al. [8] and Schneiderman and Kanade [11]. In the same vein, it would be interesting to see a more powerful cascading scheme applied to the methods of Viola and Jones [13]. The work of Viola and Jones opens the way for further practical applications to face detection. One immediate such application would be buttressing face tracking methods. Running at 15 frames per second, a detector could add robust backup to a tracker. Furthermore, the integral image is a breakthrough in wavelet classiﬁcation that can easily be seen to generalize well to other object classes. 29 Bibliography [1] C.M. Bishop, Neural networks for pattern recognition, Oxford University Press, Oxford, 1995. [2] Corinna Cortes and Vladimir Vapnik, Support-vector networks, Machine Learning 20 (1995), no. 3, 273–297. [3] Francois Fleuret and Donald Geman, Coarse-to-ﬁne face detection, International Journal of Computer Vision 41 (2001), no. 1/2, 85–107. [4] Yoav Freund and Robert E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, European Conference on Computational Learning Theory, 1995, pp. 23–37. [5] S. Mallat, A wavelet tour of signal processing, Academic Press, San Diego, 1998. 30 [6] Yoshio Matsumoto and Alexander Zelinsky, Real-time stereo face tracking system for visual human interfaces, Proc. Int’l Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time (1999), 77–82. [7] E. Osuna, R. Freund, and F. Girosi, Training support vector machines: An application to face detection, 1997. [8] C. P. Papageorgiou, M. Oren, and T. Poggio, A general framework for object detection, Proceedings of International Conference on Computer Vision (1998), 555–562. [9] S. Romdhani, P. Torr, B. Scholkopf, and A. Blake, Computationally eﬃcient face detection, Proc. Int. Conf. on Computer Vision (2001), II:695–700. [10] Henry A. Rowley, Shumeet Baluja, and Takeo Kanade, Neural networkbased face detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998), no. 1, 23–38. [11] H. Schneiderman and T. Kanade, A statistical approach to 3d object detection applied to faces and cars, IEEE Conference on Computer Vision and Pattern Recognition - to appear (2000). 31 [12] Kah Kay Sung and Tomaso Poggio, Example-based learning for viewbased human face detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998), no. 1, 39–51. [13] Paul Viola and Michael Jones, Robust real-time object detection, International Journal of Computer Vision - to appear (2002). 32