PERSON LOCALIZATION IN A WEARABLE CAMERA PLATFORM
TOWARDS ASSISTIVE TECHNOLOGY FOR SOCIAL
Lakshmi Gade, Sreekar Krishna and Sethuraman Panchanathan
Center for Cognitive Ubiquitous Computing (CUbiC)
Arizona State University, Tempe AZ 85281
Lakshmi.Gade@asu.edu, Sreekar.Krishna@asu.edu & Panch@asu.edu
Social interactions are a vital aspect of everyone’s daily living. Individuals with
visual impairments are at a loss when it comes to social interactions as majority
(nearly 65%) of these interactions happen through visual non-verbal cues.
Recently, efforts have been made towards the development of an assistive
technology, called the Social Interaction Assistant (SIA), which enables access
to non-verbal cues for individuals who are blind or visually impaired. Along with
self report feedback about their own social interactions, behavioral psychology
studies indicate that individuals with visual impairment will benefit in their social
learning and social feedback by gaining access to non-verbal cues of their
interaction partners. As part of this larger SIA project, in this paper, we discuss the
importance of person localization while building a human-centric assistive
technology which addresses the essential needs of the visually impaired users. We
describe the challenges that arise when a wearable camera platform is used as a
sensor for picking up non-verbal social cues, especially the problem of person
localization in a real-world application. Finally, we present a computer vision
based algorithm adapted to handle the various challenges associated with the
problem of person localization in videos and demonstrate its performance on three
examplar video sequences.
Keywords: Social Interactions, Wearable Camera, Person Tracking, Particle
Filtering, Chamfer Matching, Person Localization
1 INTRODUCTION providing a solution to one persistent problem of
tracking people through the primary sensing
Human-Centered Multimedia Computing element, a wearable camera, of the SIA. Following
(HCMC) , an emerging area under Human this section, we provide a brief overview of the SIA,
Centered Computing (HCC), focuses on the before getting into the particular issue of person
creation of multimedia solutions that enrich localization which is the primary focus of this
everyday lifestyles of individuals through the article.
effective use of multimedia technologies. As
explained in , HCMC focuses on deriving 1.1 Social Interaction Assistant (SIA)
inspirations from human disabilities and deficits Social interactions are highly influenced by
towards developing novel multimedia computing non-verbal communication cues such as eye contact,
solutions. An important example of the same, facial expressions, hand gestures, body posture, etc.
discussed in detail in , is the concept of a Social which are all mostly visual in nature. The lack of
Interaction Assistant (SIA) which aims at access to such informative visual cues often inhibits
developing an assistive technology aid for individuals with visual impairments and blindness
enhancing social interactions between individuals, from effectively participating in day-to-day social
especially those who are visually impaired or blind. interactions. The unique purpose of the SIA is to
Developed primarily with assistive technology bridge this communication gap between the users
focus, the SIA uses state-of-the-art pervasive and who are visually impaired and their sighted
ubiquitous computing elements starting from counterparts .
miniature on-body sensors to high fidelity haptic
actuators. A detailed evolution of this project can As shown in Figure 1, SIA makes use of an
be traced through the publications [1-4] and [28 - inconspicuous camera mounted on the nose bridge
30], in chronological order. This paper attempts at of a pair of glasses as the primary visual sensor,
while an accelerometer mounted on a cap acts an processing for social interaction cues.
egocentric motion sensor. The camera captures the
scene in front of the user allowing various levels of The problem of person localization in general is
computer vision processing. The delivery of very broad in its scope and wide varieties of
information is actuated through single behind-the- challenges such as variations in articulation, scale,
ear speaker and a novel vibrotactile interface called clothing, partial appearances, occlusions, etc make
the Haptic Belt. The video stream captured from this a complex problem. Narrowing the focus, this
the camera is processed for important social cues paper targets person localization in real world video
using a portable computing element. Any social sequences captured from the wearable camera of
information that is extracted from the video is the SIA. Specifically, we focus on the task of
delivered to the user through the use of audio and localizing a person who is approaching the user to
haptic cues. Since social cues (such as facial initiate a social interaction or just a conversation. In
expressions, body mannerisms, proxemics etc) are this context, the problem of person localization can
very high bandwidth data, care is taken to encode be constrained to the cases where the person of
these signals in such a way that the user is not interest is facing the user.
cognitively loaded with information.
Figure 2. Person of interest at a short distance
Figure 1. The Social Interaction Assistant Figure 3. Person of interest at a large distance
In  we introduce a systematic requirements When such a person of interest is in close proximity,
analysis for an effective SIA. Through an online his/her presence can be detected by analyzing the
survey (with inputs from 27 people, of whom 16 incoming video stream for facial features (Figure 2).
were blind, 9 had low vision, and 2 were sighted But when such a person is approaching the user
specialists in the area of visual impairment) we rank from a distance, the size of the facial region in the
ordered a list of important visual cues related to video appears to be extremely small. In this case,
social interaction that are considered important by relying on facial features alone would not suffice
the target population. Most of the needs identified and there is a need to analyze the data for full body
through this survey display the importance of features (Figure 3). In this work, we have
extracting these following characteristics of concentrated on improving the effectiveness of the
individuals in the scene, namely, a) Number and SIA by applying computer vision techniques to
location of the interaction partners, b) Facial robustly localize people using full body features.
expressions, c) Identity, d) Appearance, e) Eye Following section discusses some of the critical
Gaze direction, f) Pose and e) Gestures. A brief issues that are evident when performing person
glance through this list reveals the commonality of localization from the wearable camera setup of the
these issues with some of the important research SIA
questions being tackled by the face research group
of the computer vision and pattern recognition 1.2 Challenges in Person Localization from a
community. In this regard, many advances have wearable camera platform
been made in order to extract information related to A number of factors associated with the
humans from images and videos. But, when the background, object, camera/object motion, etc.
mobile setup of SIA is considered with real world determine the complexity of the problem of person
data captured in unconstrained settings, a new localization from a wearable camera platform.
dimension of complexity is added to these problems. Following is a descriptive discussion of the
As most of these cues are related to people in the imminent challenges that we encountered while
surroundings of the user, it is essential to localize processing the data using the SIA.
the individuals in the input video stream prior to
1.2.1 Background Properties has not been studied much in the literature. Figure
5(a) shows the simplicity of the data when these
When the Social Interaction Assistant is used in problems are not present, while Figure 5(b)
natural settings, it is highly possible that there are highlights complex data formulations in a typical
objects in the background which move, thus interaction scenario.
causing the background to be dynamic. Also, there
are bound to be regions in the background whose 1.2.3 Object/Camera Motion
image features are highly similar to that of the
person, thus leading to a cluttered background. Due
to these factors, the problem of distinguishing the
person of interest from the background becomes
highly challenging in this context. Figure 4 (a) and
(b) illustrate the contrast in the data due to the
nature of the background. (a) Static Camera
(a) Simple Background (b) Mobile Camera
Figure 6. Object/Camera Motion
Traditionally, most computer vision applications
use a static camera where strong assumptions of
motion continuity and temporal redundancy can be
(b) Complex Background made. But in our problem, as it is very natural for
Figure 4. Background Properties users to move their head continuously, the mobile
nature of the platform causes abrupt motion in the
image space (Figure 6). This is similar to the
1.2.2 Object Properties
problem of working with low frame rate videos or
the cases where the object exhibits abrupt
movements. Recently, there has been an increase of
interest in dealing with this issue in computer vision
research  [6-8]. Some important applications
which are required to meet real-time constraints,
(a) Rigid, Homogeneous Object such as teleconferencing over low bandwidth
networks, and cameras on low-power embedded
systems, along with those which deal with abrupt
object and camera motion like sports applications
are becoming common place . Though solutions
have been suggested, person localization through
(b) Non-Rigid, Deformable, Non- low frame rate moving cameras still remains an
Homogeneous Object active research topic.
Figure 5. Object Properties
1.2.4 Other Important Factors Affecting
As we are interested in person localization, it can be Effective Person Localization
clearly seen that the object is non-rigid in nature as
there are appearance changes that occur throughout
the sequence of images. Further, significant scale
changes and deformities in the structure can also be
observed. Also, when analyzing video frames of
persons approaching the user, the basic image
features in various sub-regions of the object vary
Figure 7. Changing Illumination, Pose Change
vastly. For example, the image features from the
facial region are considerably different from that of
the torso region. Tracking detected persons from As the SIA is intended to be used in uncontrolled
one frame to another will require individualized environments, changing illumination conditions
tracking of each region to maintain confidence. need to be taken into account. Further, partial
This non-homogeneity of the object poses a major occlusions, self occlusions, in-plane and out-of-
hurdle while applying localization algorithms and plane rotations, pose changes, blur and various
other factors can complicate the nature of the data. like features. Some of the well-known higher level
See Figure 7 for example situations where various descriptors are histogram of oriented gradients 
factors can affect the video quality. and covariance features . Efforts have been
made to make these descriptors scale invariant as
Given the nature of this problem, in this paper we well.
focus on the problem of robust localization of a
single person approaching a user of the SIA using In order to make these algorithms real-time,
full-body features. Issues arising due to cluttered researchers have popularly resorted to two kinds of
background along with object and camera motion approaches. One category includes part-based
have been handled towards providing robustness. In approach such as Implicit Shape Models  and
the following section we discuss some of the constellation models  which place emphasis on
important related work in the computer vision detecting parts of the object before integrating,
literature. The conceptual framework used in while the other category of algorithms tries to
person localization is presented in Section 3. The search for relevant descriptors for the whole object
details of the proposed algorithm are discussed in in a cascaded manner. Shape-based Chamfer
Section 4. Section 5 presents the results and matching  is a popular technique used in
discussions of the performance of our algorithm on multiple ways for person detection as the silhouette
videos collected from the wearable SIA. Finally, gives a strong indication of the presence of a person.
some possible directions of future work have been In recent times, Chamfer matching has been used
outlined followed by the conclusion. extensively by the person detection and localization
community. It has been applied with hierarchically
2 RELATED WORK arranged templates to obtain the initial candidate
detection blocks so that they can be analyzed
Historically, two distinct approaches have been further by techniques such as segmentation, neural
used for searching and localizing objects in videos. networks, etc. It has also been used as a validation
On one hand, there are detection algorithms which tool to overcome ambiguities in detection results
focus on locating an object in every frame using obtained by the Implicit Shape Model technique
specific spatial features which are fine tuned for the .
object of interest. For example, haar-based
rectangular features  and histograms of oriented 2.2 Tracking Algorithms
gradients  can develop detectors that are very Assuming that there is temporal object
specific to objects in videos. On the other hand, redundancy in the incoming videos, many
there are tracking algorithms which trail an object algorithms have been proposed to track objects over
using generic image features, once it is located, by frames and build confidence as they go. Generally
exploiting the temporal redundancy in videos. they make the simplifying assumption that the
Examples of features used by tracking algorithms properties of the object depend only on its
include color histograms  and edge orientation properties in the previous frame, i.e. the evolution
histograms . of the object is a Markovian process of first order.
Based on these assumptions, a number of
2.1 Detection Algorithms deterministic as well as stochastic algorithms have
As mentioned previously, detection algorithms been developed.
exploit the specific, distinctive features of an object
and apply learning algorithms to detect a general Deterministic algorithms usually apply iterative
class of objects. They use information related to the approaches to find the best estimate of the object in
relative feature positions, invariant structural a particular image in the video sequence .
features, characteristic patterns and appearances to Optimal solutions based on various similarity
locate objects within the gallery image. But, when measures between the object template and regions
the object is complex, like a person, it becomes in the current image, such as sum of squared
difficult for these algorithms to achieve generality differences (SSD), histogram-based distances,
thereby failing even under minute non-rigidity. A distances in eigenspace and other low dimensional
number of human factors such as variations in projected spaces and conformity to particular object
articulation, pose, clothing, scale and partial models, have been explored . Mean Shift is a
occlusions make this problem very challenging. popular, efficient optimization-based tracking
algorithm which has been widely used.
When assumptions about the background cannot be
made, learning algorithms which take advantage of Stochastic algorithms use the state space approach
the relative positions of body parts are used to build of modeling dynamic systems and formulate
classifiers. The kind of low-level features generally tracking as a problem of probabilistic state
used in this context are gradient strengths and estimation using noisy measurements . In the
gradient orientations [13,10], , entropy and haar- context of visual object tracking, it is the problem
of probabilistically estimating the object’s techniques individually, the strengths of both these
properties such as its location, scale and orientation approaches need to be combined in order to tackle
by efficiently looking for appropriate image the challenges posed by the complex setting of the
features of the object. Most of these stochastic SIA. In the past, a few researchers have approached
algorithms perform Bayesian filtering at each step the problem of tracking in low frame rate or abrupt
for tracking, i.e. they predict the probable state videos by interjecting a standard particle filtering
distribution based on all the available information algorithm with independent object detectors . In
and then update their estimate according to the new our experience, the Social Interaction Assistant
observations. Kalman filtering is one such offers a weak temporal redundancy in most cases.
algorithm which fixes the type of the underlying We exploit this information trickle between frames
system to be linear with Gaussian noise to get an approximate estimate of the object
distributions and analytically gives an optimal location by incorporating a deterministic object
estimate based on this assumption. As most search while avoiding the explicit use of pre-trained
tracking scenarios do not fit into this linear- detectors. Due to the flexibility in the design,
Gaussian model and as analytic solutions for non- particle filtering algorithms provide a good
linear, non-Gaussian systems are not feasible, platform to address the issues arising due to
approximations to the underlying distribution are complex data. These algorithms give an estimate of
widely used from both parametric and non- an object’s position by discretely building the
parametric perspective. underlying distribution which determines the
object’s properties. But, real-time constraints
Sequential monte-carlo based Particle Filtering impose limits on the number of particles and the
techniques have gained a lot of attention recently. strength of the observation models that can be used.
These techniques approximate the state distribution This generally causes the final estimate to be noisy
of the tracked object using a finite set of weighted when conventional particle filtering approaches are
samples using various features of the system. For applied. Unless the choice of the particles and the
visual object tracking, a number of features have observation models fit the underlying data well, the
been used to build different kinds of observation estimate is likely to drift away as the tracking
models, each of which have their own advantages progresses. To mitigate these problems faced in the
and disadvantages. Color histograms, use of the SIA, we propose a new particle filtering
contours, appearance models, intensity framework that gets an initial estimate of the
gradients, region covariance, texture, edge- person’s location by spreading particles over a
orientation histograms, haar-like rectangular reasonably large area and then successively corrects
features  , to name a few. Apart from the kind the position though a deterministic search in a
of observation models used, this technique allows reduced search space. Termed as Structured Mode
for variations in the filtering process itself. A lot of Searching Particle Filter (SMSPF), the algorithm
work has gone into adapting this algorithm to better uses color histogram comparison in the particle
perform in the context of visual object tracking. filtering framework at each step to get an initial
While both the areas of detection and tracking estimate which is then corrected by applying a
have been explored extensively, there is an structured search based on gradient features and
impending need to address some of the issues faced chamfer matching.
by low frame rate visual tracking of objects.
Especially in the case of SIA, person localization in 4 STRUCTURED MODE SEARCHING
low frame rate video is of utmost importance. In PARTICLE FILTER
this paper, we have attempted to modify the color Assuming that an independent person detection
histogram comparison based particle filtering algorithm can initialize this tracking algorithm with
algorithm to handle the complexities that occur the initial estimate of the person location, this
mobile camera on the Social Interaction Assistant. particle filtering framework focuses on tracking a
single person under the following circumstances,
3 CONCEPTUAL FRAMEWORK namely
• Image region with the person is non-rigid
As discussed in the previous section, detection and and non-homogeneous
tracking offer distinctive advantages and
• Image region with the person exhibits
disadvantages when it comes to localizing objects.
significant scale changes
In the case of SIA, thorough object detection is not
possible in every frame due to the lack of • Image region with the person exhibits
computational power (on a wearable platform abrupt motions of small magnitude in the
computing platform) and tracking is not always image space due to the movement of the
efficient due to the movement of the camera and the camera.
object’s (interaction partner’s) independent motion. • Background is cluttered.
Though there are clear advantages in applying these
The algorithm progresses by implementing two person in the current image based on the previous
steps on each frame of the incoming video stream. frame’s information alone. When such data is
In the first step (Figure 8), an approximate estimate modeled in the Bayesian filtering based particle
of the person region is obtained by applying a color filtering framework, the state of each particle’s
histogram based particle filtering step over a large position becomes independent of its state in the
search space. This is followed by a refining second previous step. Thus, the prior distribution can be
step (Figure 9) where the estimate is corrected by considered to be a uniform random distribution
applying a structured search based on gradient over the support region of the image.
features and Chamfer matching. These two steps
have been described in detail below. p ( x ti | x ti−1 ) = p ( x ti ) (1)
As it is essential for particle filtering algorithm to
choose a good set of particles, it would be useful to
pick a good portion of them near the estimate in the
previous step. By approximating this previous
estimate to be equivalent to a measurement of the
image region with the person in the current step, the
proposal distribution of each particle can be chosen
to be dependent only on the current measurement
q ( xti | xti−1 z t ) = q ( xti | z t ) (2)
Though the propagation of information through
particles is lost by making such an assumption, it
gives a better sampling of the underlying system.
We employ a large variance Gaussian with its mean
centered at the previous estimate for successive
frame particle propagation. By using such a set of
particles, a larger area is covered, thus accounting
for abrupt motion changes and a good portion of
Figure 8. SMSPF – Step 1 them are picked near the previous estimate, thus
exploiting the weak temporal redundancy. As in
, we have employed this technique using HSV
color histogram comparison to get likelihoods at
each of the particle locations. Since intensity is
separated from chrominance in this color space, it is
reasonably insensitive to illumination changes. We
use an 8x8x4 HSV binning thereby allowing lesser
sensitivity to changes in V when compared to
chrominance. The histograms are compared using
the well-known Bhattacharyya Similarity
Coefficient which guarantees near optimality and
Figure 9. SMSPF – Step 2
Figure 8. Structured Search
4.1 Step 1: Particle filtering step With the above step alone, due to the small number
In the context of SIA, as the person of interest can
of particles which are spread widely across the
exhibit abrupt motion changes in the image space, it
image, we can get an approximate location of the
is extremely difficult to model the placement of the
person. When such an estimate partially overlaps be seen that this search is characterized by the
with the desired person region, the best match number of bins m x n into which the sliding
occurs between the intersection of the estimate and window and the estimate are divided. Based on the
the actual person region as shown in Figure 10. But, nature of the problem, the number of bins and the
it is not trivial to detect this partial presence due to amount of sweep across scale and space can be
the existence of background clutter. To handle this adjusted. Currently, these parameters are being set
problem, we introduce a second step which uses manually, but the structured search framework can
efficient image feature representations of the be extended to include online algorithms which can
desired person object and employs an efficient adapt the number of grid bins based on the
search around the estimate to accurately localize the evolution of the object.
If the object of interest was simple, then the best
4.2 Step 2: Structured Search match across space and scale could be obtained by
As the estimate obtained using widely spread using simple feature matching techniques. But, due
particles gives the approximate location of the to the complex nature of the data, strong confidence
object, the search for the image block with a person is required while searching for the person region
in it can be restricted to a region around it. We have across scale. To this end, we propose to perform the
employed a grid-based approach to discretely structured search by analyzing the internal features
search for the object of interest (a person) instead of of the person region as well as the external
checking at every pixel. By dividing the estimate boundary/silhouette features and aggregating the
into an m x n grid and sliding a window along the confidence obtained from these two measures to
bins of the grid as shown in Error! Reference refine the person location estimate in the image
source not found., the search space can be (Figure 12)
restricted to a region close to the estimate. By
finding the location which gives the best match
with the person template, we can localize the
person in the video sequence with better accuracy.
Figure 11. Sliding window of the Structured
Search (Green: Estimate; Red: Sliding window).
If this search is performed based on scale-invariant
features, then it can be extended to identify scale
changes as well. In order to achieve search over
scale, the estimate and the sliding window need to
be divided into different number of bins. If the Figure 12. Structured Search Matching
search is performed using smaller number of bins as Technique
compared to the estimate, then shrinking of the
object can be identified while searching with higher In literature, gradient based features have been
number of bins can account for dilation of the widely used for person detection and tracking
object. For example, if a (m-1) x (n-1) grid is used problems and their applicability has been strongly
with the sliding window while a m x n grid is used established by various algorithms like Histogram of
with the estimate, then the best match will find a Oriented Gradients (HoGs) . Following this
shrink in the object size. Similarly if an m x n grid principle, we have used the Edge Orientation
sliding window is used with a (m-1) x (n-1) Histogram (EOH) features  in order to obtain
estimate grid, then dilations can be detected. It can the internal content information measure. For this
purpose, a gradient histogram template (GHT) is transformed image of the window is then obtained
initially built using a generic template image of a using the masked edges.
walking/standing person. This GHT is then
compared with the gradient histogram of each
structured search block using the Bhattacharyya
histogram comparison as in  in order to find the
block with the best internal confidence. In our
implementation, orientations are computed using
the Sobel operator and the gradients are then binned
into 9 discrete bins. These features were extracted
using the integral histogram concept  to
facilitate computationally efficient searching.
Similarly, in order to obtain the boundary
confidence measure, a generic person silhouette
template (GPT) (as shown in Figure 13) is used to
perform a modified Chamfer match on each of the Figure 13. Incorporating Chamfer Matching
search blocks. In general, Chamfer matching is into Structured Search
used to search for a particular contour model in an
edge map by building a distance transformed image By applying the modified chamfer matching (with a
of the edge map. Each pixel value in a distance generic person contour resized to the current
transformed image is proportional to the distance to particle filter estimate), a confidence number in
its nearest edge pixel. In order to compare the edge locating the desired object within the image region
map to the contour map, we convolve the edge can be obtained. Similar to the Chamfer matching
image with the contour map. If the contour as before, a value close to 0 indicates a strong
completely overlaps with the matching edge region, confidence of the presence of a person and vice
we get a chamfer match value of zero. Based on versa. As 1 is the maximum value that can be
how different the edge map is to the template obtained by the chamfer match, this measure can be
contour, the chamfer match score will increase and incorporated into the match score of the structured
move towards 1. A chamfer match score of 1 search using the following equation.
implies a very bad match.
BoundaryCo nf = (1 − ChamferMat ch) (3)
While the theory of chamfer matching offers
elegant search score, in reality, especially with
clutter within the object’s silhouette, it is very The standard form of Chamfer Matching gives a
difficult to get an exact match score. In SIA, since continuous measure of confidence in locating an
the data is very noisy and complex, certain object in an edge map. But, in our case, when the
modifications need to be made with the Chamfer elliptical ring mask is used to filter out the noisy
matching algorithm in order achieve good edges in each search block, this nature of Chamfer
performance. The following section details a match is lost. Since the primary goal of the
modified Chamfer match algorithm introduced in structured search is to find a single best matching
this work. location of the person, it is more advantageous to
use the filter mask at the cost of losing this
4.3 Chamfer Matching in Structured Search continuous nature of the chamfer match. Further, as
As discussed above, Chamfer matching gives a it is very likely that the person region is close to the
measure of confidence on the presence of the approximate estimate obtained from the first step,
person within an image based on silhouette one of the search windows of the structured search
information. We have incorporated this confidence is bound to capture the entire person object thus
into the structured search in order to detect the resulting in a good match score.
precise location of the person around the particle
filter estimate. An edge map of the image under From the above discussion, it can be seen that
consideration is first obtained which is then divided combining the knowledge about the internal
into (m x n) windows in accordance with the structure of the person region with the silhouette
structured search and an elliptical ring mask is then information results in a greater confidence in the
applied to each of these windows as shown in SMSPF algorithm. Further, using such
Figure 13. This mask is applied so as to eliminate complementary features in the structured search
the edges that arise due to clothing and background robustly corrects the approximate estimate obtained
thereby emphasizing the silhouette edges which are from the particle filtering step while handling
likely to appear in the ring region if a window is various problems associated with search across
precisely placed on the object perimeter. A distance scale.
5 EXPERIMENTS AND RESULTS their performance .
• Area Overlap (A0)
• Distance between Centroids (DC)
The performance of the structured mode searching
particle filter (SMSPF) has been tested using three Manually labeled rectangular regions around the
datasets where a single person faces the camera person in the image have been used as the ground
while approaching it. There are significant scale truth. Suppose gTruthi is the ground truth in the ith
changes in each of these sequences. Further, non- frame and tracki is the rectangular region output by
rigidity and deformability of the person region can a tracking algorithm, then the area overlap criterion
also be clearly observed. Different scenarios with is defined as follows
varying degrees of complexity of the background
and camera movement have been considered. Area( gTruthi ∩ track i ) (4)
AO( gTruthi , track i ) =
Following is a brief description of these datasets. Area( gTruthi ∪ track i )
(a) DataSet 1 (Collected at CUbiC 1 ) : Plain
Background; Static Camera; 320x240 The average area overlap can be computed for
resolution each data sequence as
(b) DataSet 2 (CASIA 2 Gait Dataset B with
subject approaching the camera ) : Slightly 1 N
AvgAOR = ∑ AO (5)
cluttered Background; Static Camera; 320x240 N
(c) DataSet 3 (Collected at CUbiC 3 ) : Cluttered AvgAOR value closer to 1 indicates better match
Background; Mobile Camera; 320x240 when compared to a value of 0 which implies no
resolution overlap. Similar to, we use Object Tracking
Figure 14 shows the sample results on each of the Error (OTE) which is the average distance between
datasets used. the centroid of the ground truth bounding box and
the centroid of the result given by a tracking
OTE = ∑ (Centroid gTruthi − Centroid tracki ) (7)
N i =1
An OTE value closer to 0 implies better tracking
(a) SMSPF Results on a sequence from Dataset1 while a value away from 0 implies larger distance
between the prediction and ground truth.
In order to evaluate the performance of these
algorithms using a single metric which encodes
information from both area overlap and the distance
(b) SMSPF Results on a sequence from Dataset 2 between centroids, we have used a measure termed
as the Tracking Evaluation Measure (TEM) which
is the harmonic mean of the average area overlap
fraction (AvgAOR) and an exponent mapping of the
Object Tracking Error (OTE).
(c) SMSPF Results on a sequence from Dataset 3 AvgAOR * e − kOTE
TEM = 2 * (8)
Figure 14. SMSPF Results AvgAOR + e − kOTE
5.2 Evaluation Metrics
In order to test the robustness of this algorithm and Where, k is a constant which exponentially
the applicability in complex situations, its penalizes the cases where the distance between
performance has been compared with the popular centroids is large.
Color Particle Filtering algorithm . The
following two criteria have been used to evaluate
As mentioned in , in order to handle abrupt
Center for Cognitive Ubiquitous Computing, ASU. motion changes, it is essential that the particles are
Portions of the research in this paper use the CASIA widely spread while tracking. Following this
Gait Database collected by Institute of Automation, principle, we have compared the performance of
Chinese Academy of Sciences color particle filter (PF)  and the structured
Center for Cognitive Ubiquitous Computing, ASU. mode searching particle filter (SMSPF) by using a
2-D Gaussian with large variance as the system algorithm outperforms the color based particle
model. The position of the person and its scale has filtering algorithm with a higher TEM score.
been included in the state vector. In order to
compensate for the computational cost of structured
search, only 50 particles were used for the SMSPF
algorithm while 100 particles were used for the PF
algorithm. A 10x10 grid with a sweep of 8 steps
along the spatial dimension and 3 steps along the
scale dimension were incorporated in the structured
Figure 17. Evaluation Measure for DataSet 1
Figure 15. AO (Dotted Line: Color PF; Solid
Line: SMSPF) Figure 18. Evaluation Measure for DataSet 2
Figure 15 and Figure 16 illustrate the comparison of
the area overlap ratio and the distance between
centroids at each frame of an example sequence
from Dataset 3. The sample frames are shown
beside the tracking results. From Figure 15(a), it is
evident that the SMSPF algorithm (red) shows a
significant improvement over the color particle
filter algorithm (green). Here, the area overlap ratio
using SMSPF is much closer to 1 in most of the
frames while the color particle filter drifts away
causing this measure to be closer to 0. The distance Figure 19. Evaluation Measure for DataSet 3
between centroids measure also indicates a greater
precision of the SMSPF algorithm as seen in Figure The results presented as a comparison between
16(a), where the distance between centroids using Color PF and SMSPF shows that incorporating a
color particle filter is much higher than that with deterministic structured search into the stochastic
SMSPF (≈0). particle filtering framework improves the person
tracking performance in complex scenarios. The
SMSPF algorithm strikes a balance between
specificity and generality offered by detection and
tracking algorithms as discussed in Section 2. It
uses specific structure-aware features in the search
in order to handle non-homogeneity of the object
and the cluttered nature of the background. On the
other hand, generality is maintained by using
simple, global features in the particle filtering
framework so as to handle non-rigidity and
deformability of the object. The clear advantage of
Figure 16. DC (Dotted Line: Color PF; Solid using the structured search can be observed on the
Line: SMSPF) complex Dataset 3 which encompasses most of the
challenges generally encountered while using the
Figure 17, Figure 18 and Figure 19 show the Social Interaction Assistant.
Tracking Evaluation Measure (TEM) for Datasets 1,
2 and 3. In majority of the cases, the SMSPF
6 FUTURE WORK  S. Panchanthan, N.C. Krishnan, S. Krishna, T.
McDaniel, and V.N. Balasubramanian, “Enriched
As a first step towards achieving robust person human-centered multimedia computing through
localization in the Social Interaction Assistant inspirations from disabilities and deficit-centered
computing solutions,” Proceeding of the 3rd ACM
platform, we have currently considered the cases
international workshop on Human-centered
where the movement of the camera is small. The computing, Vancouver, British Columbia, Canada:
generic structured search proposed in this work can ACM, 2008, pp. 35-42.
be adapted to handle drastic abrupt motions of the
 S. Panchanathan, S. Krishna, J. Black, and V.
camera as well. One way to handle such cases is to Balasubramanian, “Human Centered Multimedia
use a very small set of particles spread over a large Computing: A New Paradigm for the Design of
region in conjunction with the structured search at Assistive and Rehabilitative Environments,”
each particle region. Also, improving the efficiency Signal Processing, Communications and
of the observation models would computationally Networking, 2008. ICSCN '08. International
ease such near-exhaustive searches. Further, in this Conference on, 2008, pp. 1-7.
work, we used a generic person silhouette in our  L. Gade, S. Krishna, and S. Panchanathan, “Person
chamfer matching step to validate the positions in localization using a wearable camera towards
the structured search. Better validation can be enhancing social interactions for individuals with
obtained by using person dependent silhouettes and visual impairment,” Proceedings of the 1st ACM
better boundary masks which accurately capture the SIGMM international workshop on Media studies
relevant structure of the person’s body. The current and implementations that help improving access to
disabled users, Beijing, China: ACM, 2009, pp.
implementation has been focused only towards
people facing the camera. This can be readily
extended to handle other cases by effectively  B. Leibe, A. Leonardis, and B. Schiele,
“Combined Object Categorization and
selecting the relevant silhouettes based on the
Segmentation With An Implicit Shape Model,” In
application. Eccv Workshop On Statistical Learning In
Computer Vision, 2004, pp. 17--32.
 Porikli, F. Tuzel, O., "Object Tracking in Low-
Frame-Rate Video", SPIE Image and Video
Person localization in videos captured from a Communications and Processing, Vol. 5685,
wearable camera involves tracking non-rigid, 2005, pp. 72-79.
deformable, non-homogeneous image regions
 Yuan Li, Haizhou Ai, T. Yamashita, Shihong Lao,
which exhibit random motion patterns in cluttered and M. Kawade, “Tracking in Low Frame Rate
backgrounds. By incorporating ideas of specificity Video: A Cascade Particle Filter with
associated with deterministic detection algorithms Discriminative Observers of Different Lifespans,”
along with the generality of stochastic tracking Computer Vision and Pattern Recognition, 2007.
algorithms, we have presented a particle filtering CVPR '07. IEEE Conference on, 2007, pp. 1-8.
technique which effectively localizes individuals  J. Kwon and K.M. Lee, “Tracking of Abrupt
across a range of space and scale once a person is Motion Using Wang-Landau Monte Carlo
detected. This technique is useful in achieving Estimation,” Proceedings of the 10th European
person localization in videos captured using any Conference on Computer Vision: Part I,
mobile camera platform where there is low Marseille, France: Springer-Verlag, 2008, pp. 387-
temporal redundancy between frames. Our 400.
immediate application being the wearable Social  P. Viola and M.J. Jones, “Robust Real-Time Face
Interaction Assistant, which aims to enhance the Detection,” Int. J. Comput. Vision, vol. 57, 2004,
everyday social interaction experience of the pp. 137-154.
visually impaired, we have been able to achieve  N. Dalal and B. Triggs, “Histograms of Oriented
near real-time person localization. Gradients for Human Detection,” Proceedings of
the 2005 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition
8 REFERENCES (CVPR'05) - Volume 1 - Volume 01, IEEE
Computer Society, 2005, pp. 886-893.
 S. Krishna, D. Colbry, J. Black, V.  K. Nummiaro, E. Koller-Meier, and L. Van Gool,
Balasubramanian, and S. Panchanathan, “A “An adaptive color-based particle filter,” Image
Systematic Requirements Analysis and and Vision Computing, vol. 21, Jan. 2003, pp.
Development of an Assistive Device to Enhance 110, 99.
the Social Interaction of People Who are Blind or
Visually Impaired,” Workshop on Computer  F. Porikli, “Integral histogram: A fast way to
Vision Applications for the Visually Impaired extract histograms in cartesian spaces,” In Proc.
(CVAVI 08), European Conference on Computer IEEE Conf. On Computer Vision And Pattern
Vision ECCV 2008, Marseille, France: 2008. Recognition, vol. 1, 2005, pp. 829--836.
 Q. Zhu, M. Yeh, K. Cheng, and S. Avidan, “Fast  H.G. Barrow, J.M. Tenenbaum, R.C. Bolles, H.C.
Human Detection Using a Cascade of Histograms Wolf, Parametric correspondence and chamfer
of Oriented Gradients,” Proceedings of the 2006 matching: Two new techniques for image
IEEE Computer Society Conference on Computer matching. In proceedings of the 5th International
Vision and Pattern Recognition - Volume 2, IEEE Joint Conference on Artificial Intelligence.
Computer Society, 2006, pp. 1491-1498. Cambridge, MA, 1977, pp. 659-663
 O. Tuzel, F. Porikli, and P. Meer, “Human  CASIA, CASIA Gait Database,
Detection via Classification on Riemannian http://www.sinobiometrics.com
Manifolds,” Computer Vision and Pattern  F.C. Crow, “Summed-area tables for texture
Recognition, 2007. CVPR '07. IEEE Conference mapping,” Proceedings of the 11th annual
on, 2007, pp. 1-8. conference on Computer graphics and interactive
 R. Fergus, P. Perona, and A. Zisserman, “Object techniques, ACM, 1984, pp. 207-212.
class recognition by unsupervised scale-invariant  T.L. McDaniel, S. Krishna, D. Colbry, and S.
learning,” Computer Vision and Pattern Panchanathan, “Using tactile rhythm to convey
Recognition, 2003. Proceedings. 2003 IEEE interpersonal distances to individuals who are
Computer Society Conference on, 2003, pp. 271, blind,” Proceedings of the 27th international
264. conference extended abstracts on Human factors
 Changjiang Yang, R. Duraiswami, and L. Davis, in computing systems, Boston, MA, USA: ACM,
“Fast multiple object tracking via a hierarchical 2009, pp. 4669-4674.
particle filter,” Computer Vision, 2005. ICCV  S. Krishna, T. McDaniel, and S. Panchanathan,
2005. Tenth IEEE International Conference on, “Haptic Belt for Delivering Nonverbal
2005, pp. 212-219 Vol. 1. Communication Cues to People who are Blind or
 V. Philomin, R. Duraiswami, and L.S. Davis, Visually Impaired,” 25th Annual International
“Quasi-Random Sampling for Condensation,” Technology & Persons with Disabilities, Los
Proceedings of the 6th European Conference on Angeles, CA: 25, 2009.
Computer Vision-Part II, Springer-Verlag, 2000,  S. Krishna, N.C. Krishnan, and S. Panchanathan,
pp. 134-149. “Detecting Stereotype Body Rocking Behavior
 B. Leibe, E. Seemann, and B. Schiele, “Pedestrian through Embodied Motion Sensors,” Annual
detection in crowded scenes,” Computer Vision Conference of the Rehabilitation Engineering and
and Pattern Recognition, 2005. CVPR 2005. IEEE Assistive Technology Society of North America,
Computer Society Conference on, 2005, pp. 878- New Orleans, LA: 2009.
885 vol. 1.
 M. Bertozzi, A. Broggi, R. Chapuis, F. Chausse,
A. Fascioli, and A. Tibaldi, “Shape-based
pedestrian detection and localization,” Intelligent
Transportation Systems, 2003. Proceedings. 2003
IEEE, 2003, pp. 328-333 vol.1.
 M. Arulampalam, S. Maskell, N. Gordon, and T.
Clapp, “A tutorial on particle filters for online
nonlinear/non-Gaussian Bayesian tracking,”
Signal Processing, IEEE Transactions on, vol. 50,
2002, pp. 174-188.
 M. Isard and A. Blake, “CONDENSATION -
conditional density propagation for visual
tracking,” International Journal Of Computer
Vision, vol. 29, 1998, pp. 5--28.
 S. Birchfield, “Elliptical Head Tracking Using
Intensity Gradients and Color Histograms,”
Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern
Recognition, IEEE Computer Society, 1998, p.
 K. Okuma, A. Taleghani, N. De Freitas, O. De
Freitas, J.J. Little, and D.G. Lowe, “A Boosted
Particle Filter: Multitarget Detection and
Tracking,” In ECCV, vol. 1, 2004, pp. 28--39.
 V. Manohar, P. Soundararajan, H. Raju, D.
Goldgof, R. Kasturi, and J. Garofolo,
“Performance Evaluation of Object Detection and
Tracking in Video,” Computer Vision – ACCV
2006, 2006, pp. 151-161.