Document Sample

N L SS N IEEE TRANSACTIONS ON PATTERN A A Y I A 0 MACHiNE INTELLIGENCE, VOL 21, NO 12, NOVEMBER 1999 1241 N "Z l/perturbed vermn ol R# - RI Tracking Human Motion (7) in Structured Environments 2 6 - Z&t*(S,) Using a Distributed-Camera System From Xo 2 . ' ' 2 XZ, and (6) and (7), we have the following result from which [,I32 can be obtained Q. Cai and J.K. Aggarwal, Fellow, /E€€ z(R#,F) 2 3 x , + 3 x 2 C ' 3X1 + (G ~ 2Atr(s1))X2' Abstract-This paper presents a comprehensive frameworklor tracking coarse human models from sequences of synchronized monocular grayscale images in multiple camera coordinates. It demonstrates the feasibility of an end-to-end person tracking system using a unique combination 01 motion analysis on 30 ACKNOWLEDGMENTS geometw in different camera Coordinates and other existing techniques in motion detection, segmentation, and pattem recognition. The system starts with tracking The authors would like to thank the reviewers for their from a single camera view. When the system predicts that the active camera will valuable comments and help. This work was supported by no longer have a good view of the subject of interest,tracking will be switched to the Republic of China National Science Council under Grant another camera which provides a better view and requires the least switching lo NSC-86-2213-E-009-114. continue tracking. The nonrigidity of the human body is addressed by matching paints of the middle line of the human image. spatially eind temporally, using Bayesian ClaSSifiCatiOn schemes.Multivariate normal distributionS are employed to REFERENCES model class-conditional densities of the features for tracking, such as iacatian, R.M. i-lamlick and L.G. Shaoiro. Comuater and Robo: Vision. vol. 2. Reildine. intensity,and geometricfeatures. Limited degrees of Occlusion are tolerated within . , " Mass.: Addision Welsey 1<93. the system. Experimental r~sults using a prototype system are presented and the T.S. Huang and A.N. Netravaii, "Motion and Structure from Feature performancs of the algorithm is evaluated to demonstrate its feasibility for real time Corrmpondcnccs: A Review," Prac. llie IEEE, vol. 82, no. 2 pp. 252-268, , appiic81ions. 1994. Y. Liu, T S Huang, and O.D. Pilugeras, ''Determination of Camera Location .. from 2D tu 3D L k c and Point Carrcspondences." IEEE Trans. PnItcvn Index Term%-Tracking, human modeling, motion estimation, multiple Annlysis and Machine lnfelli$ence, vol. 12, no. 1, pp. 28-37, Jan. 1990. perspectives. Bayesian ciassification,end-toad vision systems. M. Uhome, M. Richetin, J.T. Lapreste, and G . Rives, "Uetsrmhation of the Attitude of 3U Objects from a Single Pcrspcctive View," JEEE Trans. Putturn Analysis nnd Machine i?ztrlii,qi.ncc,vol. 11, no. 12, pp. 1,265-1,278, Dec. 1989. + S.Y. Chen and W.H. Tsai, "A Systematic Approach to Analytic 1 INTRODUCTION Uetcrmination of Camera Parameters by Line Features," Rftrrn Recqni- tion, vol. 23, no. 8, pp. 859-877, 1990. T R A C K I N G ~ ~ ~ ~ ~ motion is of interest in numerous applications R. Kumar and A.R. Hsnson, "Robust Methods for Estimating Pose and il such as surveillance, analysis of athletic performance, and Scnitivily Amlysis," Cmriputer Vision and Graphic Jmnp Processing: h n g e Uiiderstmding, vol. 60, no. 3, pp. 313-342, 1994. content-based management of digital image databases. Recently, T.Q. Phong, R. Horaud, A. Yassine, and P.D. Tao, "Object Pose from growing interest has concentrated upon tracking humans using 1. 2 to 3D Point and Linc Corrcsoondenccs." in17 . Caniauter Vision. D , distributed monocular camera systems to extend the limitcd vol. 15, pp. 225-243, 1995. C.N. Lee and R.M. Hamlick, "StatisticalEstimation for Exterior Orientation viewing angle of a single fixed camera [1], [2], [3]. In such a from Line to Line Corrcspondmms," Imge nnd Vision Computing, vol. 14, setup, the cameras are arranged to cover a monitored area with .. I" 379-388. 1996. ). K. Cho, P.' Meer, and I. Csbrera,, "Performancc Assessment through overlapping vision fields to ensure a smooth switching among Bootstrap." IEEE Trans. I'dteern Annlysis and Machine Intelii&wee, vol. cameras during tracking. We present a comprehensive frame- 19, no. 11, pp. 1,185-1,198, Nov. 1997. work for automatically tracking coarse human models across S. Yi, R.M. Hamlick, and L.G. Shapiro, "Error Propagation in Machine multiple camera coordinates and demonstrate the feasibility of Visim,"Mmhine Vision and Appiicntims, vol. 7, pp. 93-114, 1994. an end-to-end pcrson tracking system using a unique combina- K. Fukunagn, 1ntmdiic:im 10 Statistical Pnltem ilecognilic~n, second ed. Boston, Mass.:Academic Press, 1990. tion of motion analysis on 3D geometry in different camera 1.R. Taylor, An Jn1rodue:ioii to Error Annipis. Mill Vailcy, Oxford Univ. coordinates with existing techniques in motion dctection, Press, 1982. segmentation, and pattern recognition The nonrigidity of the D.K.P. Horn, "Relative Orientation," lnt'l 1. Computer Vision, sol. 4 , rr. "5. -. % nn 47 , I 14"" human body is addressed by matching points of the middle R.A. Horn mid C.R. lalwson, Matrix Anniysis. NCWYork, NY Cambridge line of the human image, spatially and temporally, using Univ. Press, 1985. Bayesian classification schemes. The key to successful tracking K. Kanatnni, Gearrirtric CoiiiPUIUtion for Machine Vision. New York, NY Oxlord Univ. l'rcss, 1993. in the proposed work relies on our unique method of 3D G.G. Roussss, A Course in Mnthrrnafical Slntistics. second cd.Ncw York, NY motion prediction and estimation from different perspectives. Academic Press, 1997. Experimental studies using a three-camera prototype system show its efficiency in computation and potential for real time applications. The earliest work in this area is, perhaps, by Sato et al. [1].They considered the moving human image as a combination of various blobs. All distributed cameras were calibrated in the world coordinate system, which corresponds to a CAD model of the Q. Cai is wifk the Consulting Group, Rednetworks, Inc., 2601 Elliott Ave., Seattle, W A 98121. E-moil: qcni6'renl.coni. 1.K. Aggamal is with tke Deparfment of Electrical and Cornpuler Engineering, The University o Texas a f Austin, Austin, T X 78712-1084. f ~ - m u i iu g g a m n i j k g m u i i , u i r ~ ~ ~ . ~ ~ ~ , : 01S2~SS2S1001510.000 1909 IEEE 1242 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 2, NO. 11, NOVEMBER 1999 indoor environment. The blobs of body parts were matched human and nonhuman moving objects based on Principal through image sequences using the area, average brightness, and Component Analysis (PCA). More details are found in [4], [5]. rough 3D position in the world coordinates. Kelly et al. 121adopted Due to their robustness for matching in different views, N 1 a similar strategy 11 to construct a 3D environmental model using points belonging to the middle line of the upper body are selected the voxel feature. The depth information contained in the voxel is and aggregated as the feature to track. The line segment is obtained using height estimation. Moving humans were tracked as extracted by finding the middle points of the blobs. Using multiple a group of these vaxcls from the "best" angle of the viewing feature points instead of a single point [6] makes matching the system. Neither of these methods considered the particular body subject image more reliable. We have elected to use six points structure and shape characteristics of a human being. In addition, both need to model the environment in 3D and establish a world based on the trade-off between the need to use fewer points to coordinate. They are computationally expensive and do not adapt reduce computation cost and the need to use more points due to to changes in dynamic environments. In our work, only neighbor- the nonrigidity of moving human figures. To ensure the robustness ing cameras are calibrated to their relative coordinates and of the feature matching, we incorporate three types of features: background images are updated periodically to capture the location, intensity, and geometry. The location feature is defined as changes in thc environment. Based on studies on human geometric the horizontal and vertical position of the feature points: structures, we distinguish moving human figures from other XI = [(ult,m i ) , (u2,. 2 * ).,, . , (wt, u u.yt)lT, where t is the time index. nonhuman objects by modeling the human body. Matching the We define the intensity feature as Y I= [ylt,yzt,. . . ,y.wjT, in which subject image between consecutive frames involves motion is ynZt the average intensity of the neighborhood of the mth feature estimation in a spatial-temporal domain under a Bayesian points. Another type of feature is the image height ratio between classification scheme. Tracking is done from a single camera view until the system consecutive frames (the height of a candidate image in the current predicts that the active camera soon will no longer have a good frame divided by the subject height in the previous frame) as the view of the subject of interest. Tracking then switches to the geometric feature (gJ, where the image height is computed as the camera that will provide a better vicw and require the least height of the upper body using a coarse 2D geometric human switching to continue tracking. Thus, the tracking paradigm model at the segmentation stage. This feature is essential for consists of three basic modules: Singlc View Tracking (SVT), tracking in narrow corridor scenes where the location and intensity Multiple View Transition Tracking (MVTT), and Automatic features most likely fail. Camera Switching (ACS). 2.2 Feature Correspondence Tracking a subject between adjacent frames can be achieved by 2 SINGLE VIEW TRACKING finding the closest match of features in the next frame based on Tracking from a single view includes two major components: constraints such as continuous position, instantaneous velocity, preprocessing and feature correspondence between consecutive similar intensity, etc. We apply a Bayesian classifier to locate the frames. Three stages of preprocessing are performed: most likely match of the subject image in the next frame. For I. Segmenting the moving objects from the still background, simplicity of computation without loss of generality, we assume 2. Distinguishing human subjects from other segmented that a prior probability function P(Q) is uniformly distributed, nonbackground objects, and where 0 is the feature parameters of the subject to track. In a 3. Extracting features from the segmented human subjects. multivariate Gaussian model, it represents the mean and covar- Feature correspondence is established by applying a Bayesian iance of the feature vector Z L= [Xt,Y,, where X,, Yt, and gt are gl], classificr to locate the most likely match of the subject image in the assumed to be independent of each other since they are different next frame. The feature vector consists of location, intensity, and types of features. So, we define geometric information, Multivariate Gaussian models are formu- lated to parameterize the class conditional probability density of the feature vector. Thus, tracking is reduced to finding the minimum sum of the corresponding Makulonobis distances of the where w,,, and wI, are the the weights associated with p.(.), feature given the estimated feature parameters, pu(.), and py(.).Based on Bayes theory, the closest match is found by searching the minimum of D , = - logp(ZtlO). The weights for 2.1 Preprocessing each feature are computed based an the l/w, : l/w, : l/w, = Preprocessing is critical to the success of high-level processing stages. If a moving object is missed at the preprocessing stage, the [-logp,(XtI0,)1 : [-logp,(YilQ,)l : I-1ogp,(gtlQ,)1 and w + wy , + ?+ = 1 during training. We assume that pz(.), pv(.),and pp(.)are I, system will be unable to track this particular object at later stages. The major task of preprocessing is to segment human images from normally distributed to reduce the computational cost. the rest of the image objects. To the best of our knowledge, there Since the subject of interest has a nonrigid form, we assume that are still no satisfying and robust general solutions. Here, we apply one feature point is independent of another. Under such assump- efficient standard motion detection and segmentation techniques tions, the mean vector of pz(.) U = X,and the covariance C, is a . to take the advantage of the fact that the viewing system is still. diagonal matrix with the mth component of a;,,,. Therefore, we More robust and complicated segmentation schemes could be have applied if computational cost is not a consideration. The key to the proposed motion segmentation is to dynamically recover the background by grouping regions of still pixels in time. Then, we detected moving blobs by differencing and focused on the upper half body of the blobs using a coarse 2D human model. This The estimation of (ti,,,, I%nL) is computed using perspective procedure is followed by human segmentation, where moment projection and the assumption that velocity direction for three invariants are uscd as the shape feature for distinguishing between consecutive frames is unchanged 151, IEEE TRANSACTIONS ON PATTERN ANALYSiS AND MACHINE INTELLIGENCE, VOL. 2, NO. 11, NOVEMBER 1999 1243 Motion Detection Coarse Human Feature & Segmentation - Segmentation Extraction Fig. 1. The basic procedure of transition tracking same spatial coordinates. Preprocessing starts with camera calibration, which measures the intrinsic and extrinsic parameters of the system cameras by using the methods in [7] and [E], respectively. These parameters are used to establish the relation- where vt = l/gt, and TO = 1. We define U=,,,, A&, so that u3:,"& = is ships between various camera coordinates. Then, we go through proportional to the scale of the image height h,, where A, is only a the same procedure as in the preprocessing in the SVT module. scaling factor in order to obtain a universal scaling for the The last step before feature correspondence is to project the Mnhnlonobis distances for different types of features. The definition location feature into the same camera coordinates. and estimation for intensity p,(.j and geometric features are similar We again apply multivariate Gaussian models to represent the and are given in 151. class-conditional distributions of the feature p(Z,IO), including Finally, we have only location and intensity information, since there is no longer a valid criteria for estimating geometric features from different camera views without knowing the relative distances between the subject and the viewing cameras. However, feature correspon- dence using the location feature differs significantly. 3 1 Tracking Based on the Location Feature . Tracking across different perspectives in time is equivalent to =w A t + w$,t + wJ)g,t matching feature points from I, and Jl,.,, where IL is the frame These U,,,, U,,!, and ])UIL are Mnhalonobis distances for each imaged by camera Cj at time t and Jt+, is the frame imaged by individual feature. The most likely match should satisfy two + camera C, at time t 1. It involves both spatial and temporal conditions: Dt must be 1)less than certain threshold T, and 2) the motion estimation. Typical methods, like Kalman filtering, could minimum value among the candidates. Although this threshold is be used in this case. To reduce computational cost, we apply a currently preset, its value could be adapted according to different simpler prediction and estimation method instead. Two basic tracking environments. models are addressed: the class-conditional distribution for spatial In the above paradigm, if the subject of interest is occluded by matching p.~(X&l j and that for spatial-temporal matching, another subject, the system might select the occluding subject as ZJzz(xt/%2). the best match, even though the intensity or geometry features might not agree. If we do not "memorize" the correct features in 3.1.1 Spatial Matching the previous frame, the target might be switched after occlusion. In Spatial matching is based on the correspondence between a 2D such cases, we use estimated features instead of the ones computed point and its corresponding epipolar line. To establish correspon- directly from the current frame. Details of the computation are deuce between frames imaged by camera C, and camera C, at time addressed in [5]. t, the multivariate Gaussian model for position is modified from (1) to TRACKING VIEW TRANSITION 3 MULTIPLE In our system, tracking continues in the single view (SVT) mode until the active camera no longer has a good view of the subject of interest, when tracking switches to a video stream captured from where is the distance between the mth feature point (unit,U,",) another nearby camera. At that point, the system enters the mode + and its expected 2D epipolar line r~,,,~a h,,,tg + cillL = 0, all in thc of Multiple View Transition Tracking (MVTT). Fig, 1 shows the view of C,. The 2D epipolar line is projected from the point overall diagram of the module. The double-framed rectangular (ttv,2t,&,atj in the view of C,, and t is the time index. Since the boxes represent the processes which differ from SVT. In MVTT, the distance between an image point (mo/zn,yo/znj and a 2D line KC + backing feature in consecutive frames mnst be adjusted to the bg+ e = 0 is 1244 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 2, NO. 11. NOVEMBER 1999 field of the current camera. Image height prediction is,, in a sense, an estimation of the subject's depth using the image's positions in we define as =A s l l / , , ' ~ , with A 1 , again as a previous frames. Tracking confidence is a measure of robust scaling factor. matching between consecutive frames. It could be lowered due to poor segmentation, occlusion, and ambiguity in the clothing and 3.1.2 Spatial-Temporal Matching size of the subject images. Spatial-temporal matching involves estimating the projection of a In each process, we assume constant velocity of the subject over 3D point in the view of camera i at time t (denoted as (iLt,Uit)), three consecutive frames. This assumption is reasonable given the given ( B ; ( t - l ) , , f i + l ) ) , taj(t.~),~,[t.l))), and (%t,D3t)). Using the small time period for capturing three frames. The velocity pinhole projection model, we have information is refined at each step once the uncertainty of matching is resolved. and Location prediction is based on the perspective projection and . . . . constant velocity 151. Finally, we have UL = U - , + A u / ( 2 ~ ~- 1 - 1) + ~ and wt = vtti A v / ( 2 ~ ~ +1) with ~ where R,, and T;;is the rotational matrix and translational vector between the camera coordinate i and j , a and a2 are scaling 1 (nu, = (ut-, AV) - ~ ( - 2 ~ u t - - ut-z) 1 factors, 0, and are the depth ratio of the point at times t - 1 and To initialize the prediction process, we assume that q = 1 and t, in C: and C,, which can be calculated using the height ratio of the Au = AV= 0. If (ut+,) is out of the viewing boundaries of the subject images between adjacent frames. Finally, we arrive at: current camera, camera switching is immediate. 4.1.2 lrnage Height Prediction Image height prediction uses the height of the upper body image as a coarse reflection of the subject's depth in the camera coordinate. Compared to width, the height of the subject image more truthfully reflects the distance between the subject and the active camera. For example, a person facing toward the viewing where a = azJcu,, U = D~ii,, - 2L,(t+l), V = ~ D , L D,,ct-l), and ti is ~ camera will be the same height as he turns 90 degrees away, but a the kth row and lth column element of 4;.When occlusion is different width. Using the definition of rL, along with the constant detected by thresholding, similar to the module of SVT, only vmt is velocity assumption, the height of the subject's upper body io the modified. More details could be found in [ 5 ] . tth frame is 4 CAMERA AUTOMATIC SWITCHING We choose to track the subject of interest in one video stream at If ht becomes too small, indicating that the subject is moving too far one time instant to reduce the computational cost and automati- from the viewing camera, then immediate camera switching is cally switch among cameras to keep the subject in view. Automatic necessary, - camera switchiny (ACS) consists of two steps: .. prediction and optimal camera selection. The prediction process reports when 4.1.3 ~ ~ ~Confidence ~ k i ~ ~ camera switching is necessary, which may happen in three cases: Tracking confidence is derived from D, since it is the key to finding 1, when the subject image appears to be moving of the the most likely match between two consecutive frames. Two types viewine boundaries of the current camera. of confidence are considered the absolute confidence, ACF,, and 2. when the subject moves too far away, and the relative confidence, RCFt, where t is the time index. ACF, is 3. when the subject becomes occluded by another subject for defined as ACF, = T / D , with T as a threshold addressed before. more than two frames. As D, decreases, ACE; increases proportionally, which agrees with In these situations, switching to another camera may result in a the decision criterion that the less the Mahalunobis distance is, the more continuous or better view of the subject. The selection of more robust the match is. RCF, is a measure of the relative "optimal" camera is considered in terms of three aspects: tracking confidence among multiple candidates for matching. It is defined as RCFt = Dt(I)/DL(O), with 1. The candidate camera must be able to image the subject in the future, Q(0) <DL(I) . . . 5 D t ( k ) . . . 5 & ( I < ) , 5 2. Spatial matching between different cameras is robust, and k as the index of subject candidates, and li as the number of 3. The candidate camera will contain the subject image over the longest number of frames, given the subject's current subject candidates. A confident match should have both high ACF, position and velocity. and RCF,. If only one subject exists, ACFt is the only quantitative measure for tracking quality. The overall confidence is defined as The third requirement minimizes the amount of camera switching CF, = min[ACFf,RCF,].A small CFt may be caused by occlusion during tracking. of subject images, poor segmentation, ambiguity between sizes and 4.1 Prediction intensity values of subject images, etc. In such situations, changing We address three types of prediction for the subject image: location the viewing angle of the camera may help to solve some of these prediction, height prediction, and tracking confidence measnre- problems. The definition of tracking confidence also applies to ment. Location prediction estimates the location of the subject each individual feature, except that the threshold T has been image in the next frame and judges if it will be within the vision changed to T,, T,, and Ti,,respectively. iEEE TRANSACTiONS ON PATTERN ANALYSIS AND MACHiNE INTELLIGENCE, VOL. 2, NO. 11, NOVEMBER 1999 1245 (d) (e) (f) Fig. 2. Tracking a subject around an indoor corner: (a) C,.t = 1, (b) C,,,t = 2, (c) C,,t = 3, (d) C l , t = 2, (e) C,,t = 3, (f) C2,t= 4. 4.2 Optimal Camera Selection and velocity. The process of selecting the optimal camera involves We select the optimal camera based on matching robushless and prediction of the subject image position given its current position two steps, matching evaluation and frame number calculation 0 0 0 100 200 300 0 100 200 300 All Features Location 0 0 0 100 200 300 0 100 200 300 Intensity Geometric Fig 3 Tracking confidence measurements m SVT 1246 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 2, NO. 11, NOVEMBER 1999 10 I l +I It ' ' I 1 ' ' ' I ' 1 I + ' I + + - 8 C 0a, + + ++t U ++ "E 6 - +t +t ++ + ++ + ++ - 23 *++ 0 c 4++t + P ft ++++ + ++ + + + ++ + ++ + +++ + + ++ - P + 2 + + + + ++ ++ + + + + )+ ++ 'U + ++ U + - " 0 10 20 30 40 50 60 70 80 90 100 ALL Feature 10 10 U $ C 8 8 c 8- a, a, 0 "E6 $ 6. ++ 23 23 4.2.7 Matching Evaluation corner, and a room. Complex scenes are considered to be Matching evaluation selects the optimal camera with high tracking combinations of these three typical scenes. In the setup, we use three ULTRAK K-500 1/2" solid state b/w confidence, i.e., with CF, above the corresponding threshold. CCD cameras mounted with Computar H612FI wide angle lenses. 4.2.2 Frame Number Calculation A Matrox MAGIC frame grabber installed in a Compaq 486 PC grabs and digitizes 512 x 480 pixel images from the cameras. All Frame number calculation is used to minimize the amount of images are processed by a RISC workstation running AIX camera switching during tracking to reduce the computational (60 MHZ). The images are grabbed from the three cameras in cost. If more than one camera has a robust match, we use the . the order of C&'IC,COC~C,.. . The time interval between current position and velocity of the subject to estimate the number consecutive frames taken by the same camera is about 0.3 of frames until the subject will move out of the view of the seconds, while the interval between consecutive frames taken by candidate camera or will move too far from the camera to be ) adjacent cameras (e.g., C and C ~ + Iis about 0.1 second. The , viewed well. We choose the camera that will image the subject over scaling factors for U are set in such a way that we expect a valid the most frames as the optimal camera. The detailed derivation of match with Dt, D,,t, D , t , and D , , be around 1. The thresholds to are set as T = T, = Ty= T, = 2 and the weights are calculated as frame number calculation is addressed in 151. U?, = wg= 0.45 and wy= 0.1. These parameters were obtained from training on testing data. It takes about 0.3 seconds for the 5 STUDIES EXPERIMENTAL NSC workstation to process the tracking algorithm between 5.1 A Prototype System consecutive frames. We used seven data sets captured in a cluttered room, long Our prototwe system consists Of three with corridors, corridor corners, a building lobby, and building Partially overlapping fields of view, linked to a synchronization elevators, with up to six people walking in various directions, device, a digitizer, and a computer to handle all control and and with still people in the background. Fig. 2 shows an example processing. We are interested in tracking moving humans in of tracking a subject in a corridor corner that involves all three various indoor scenes, such as a long narrow corridor, an indoor basic modules: SVT, MVTT, and ACS. The first switching - IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 2, NO. 11, NOVEMBER 1999 1247 (Cl CO, = 2) happens when the subject moves too far aud the 1 been evaluated from a prototype systcm in various types of indoor second switching (CO C2,t= 3) is invoked when the subject is i scenes and demonstrates the feasibility for real time applications. about to move out of the right boundary. 5.2 Performance Evaluation ACKNOWLEDGMENTS Next, the system performance is evaluated 1101 on about two hours This work was supported in part by the Texas Higher Eduction of video (about 1.5 hours for SVT and 0.5 hours for MVTT) in three Coordinating Board under projects 95-ATP-442 and 97-Am-275 types of indoor environments. We use tracking confidence CFt as a and by the US. Army Research Office under contracts DAAH-04- measure of the robustness of our algorithm (note only CFt 2 2 is 94-G-0417 and DAAH-04-5-1-0494. considered a robust match from the previous description). We plot tracking confidence in both SVT and MVTT modules using all the features as well as each individual feature, as shown in Figs. 3 and REFERENCES 4, where the horizontal axis is the instances of feature correspon- K. Set", T. Maeda, H. Ksto, and S. Inokuchi, "CAD-Based Object Tracking with Distributed Monocular Camera h r Security Monitoring,'' Proc. Second dence and the vertical axis is the corresponding tracking CAD-Based Vision Workshop, pp. 291-297, Champion, Pa., Feb. 1994, confidence CF,s. To have a better view of the low C ~ Swe clip, I'.H Kelly, A. Katkere, O.Y. Kuramura, S. Moezei, S. Chattcrjce, and R. Jain, any CFt 2 10 to be 10. The solid lines in each figure are the "An Architecture for Multiple Perspective Interactive Video," Proc. ACM CO$ Multiwedio, pp. 201-212, 1995. threshold of 2. Both figures show that using three types of features Q. C d and J.K. Aggsrwal, "Tracking Human Motion Using Multiple achieves a much higher tracking confidence than using any .. Cameras,'' Proc. In17 Cotif. Patterti Rccwnition, DD. 68-72. Vicnna, Austria, individual feature, and the intensity feature is the least robust. Aug. 1996. Q.Cai, A. Mitiche, and J.K. Aggarwal, "Tracking Huinan Motion in an Thus, its weight is set smaller to achieve better tracking. More Indoor linvironmenl," i'ioc. Second In17 Cot$ I m p Proccming, pp. 215-218, robust features could be substituted by simply following the Washington, U.C., Oct. 1995. defined framework. MVTT tracking confidences are lower than Q. Cui, "Tracking Human Motion in Indoor nnvironincnts Using a IIistributedCamera System," PhD thesis, Thc Univ. of Tcxm at Austin, SVT due to the increased complexity of the algorithm and 1997 ~~.~ matching ambiguities between a 2D point and its estimated R. Polam and R. Nclson, "Lnw Lcvcl Recognition of I-lumun Motion," l'roc. epipolar line from multiple perspectives. IECE CS Workshop Motion of Non-Rixid end Aiticicliiled Objects, pp. 77-82, Next, we evaluate the tracking algorithm by the tracking rate, defined as the percentage that the system tracks the right subject image. In SVT, we achieved a 98 percent rate of tracking using all the features. The rates of single feature tracking for location, intensity, and geometric features individually were 93.5 percent, 80.0 percent, and 84.5percent, respectively. In MVTT, we obtained C m j Computer Vision, Bombay, India, Jan. 1998. S. Pingali end J. Segen, "Performsnce Evaluation of People Tracking a rate of 96 percent when using both features, and 95 pcrccnt and System,'' Pmc. IEEE CS Workshop Applications in Computer Vision, pp. 33-38, 68 percent when using the location and intensity features Sarasola, Fla., 1996. individually. A match with high CF,, usually results in a correct match; wrong matches occur when the CF,, is below the threshold. Failure of the proposed tracking algorithm is usually due to occlusion, which not only makes the low-level processing more difficult in the first stage, but also increases the matching ambiguity of the feature correspondence. Although we have developed techniques to deal with the problem of occlusion at a certain level, it still remains a major obstacle to the tracking problem. Other factors that degrade pcrformance include reflec- tion on glass and metal surfaces and dramatic changes in scenes viewed through glass doors. All of these factors prevent the system from accurately segmenting the subject image from a still back- ground. MVTT tracking performance is less robust than SVT due to the uncertainty of the depth at the time of matching. Other factors which may deteriorate the algorithm performance are similarities in clothing to the background or in the distance between the subject and the viewing camera, which dcgrade thc contribution of the intensity and geometric features during matching. 6 CONCLUSION We have developed a comprehensive framework for tracking coarse human models from sequences of synchronized monocular grayscale images in multiple camera coordinates. Our framework demonstratcs the feasibility of an End-to-end person tracking system that uses a unique combination of motion analysis on 3D geometry in multiple perspectives and existing techniques in motion detection, segmentation, and pattern recognition. Bayesian classification schemes associated with a general framework of motion analysis in a spatial-temporal domain are used for feature correspondence between consecutive frames under the same or different spatial coordinates. The performance of the algorithm has

DOCUMENT INFO

Shared By:

Categories:

Stats:

views: | 22 |

posted: | 3/30/2012 |

language: | |

pages: | 7 |

OTHER DOCS BY legundijr

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.