Image Segmentation by Data-Driven Markov Chain Monte Carlo by Flavio58


									IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,                      VOL. 24, NO. 5,   MAY 2002                                  657

                 Image Segmentation by Data-Driven
                     Markov Chain Monte Carlo
                                                    Zhuowen Tu and Song-Chun Zhu

       AbstractÐThis paper presents a computational paradigm called Data-Driven Markov Chain Monte Carlo (DDMCMC) for image
       segmentation in the Bayesian statistical framework. The paper contributes to image segmentation in four aspects. First, it designs
       efficient and well-balanced Markov Chain dynamics to explore the complex solution space and, thus, achieves a nearly global optimal
       solution independent of initial segmentations. Second, it presents a mathematical principle and a K-adventurers algorithm for
       computing multiple distinct solutions from the Markov chain sequence and, thus, it incorporates intrinsic ambiguities in image
       segmentation. Third, it utilizes data-driven (bottom-up) techniques, such as clustering and edge detection, to compute importance
       proposal probabilities, which drive the Markov chain dynamics and achieve tremendous speedup in comparison to the traditional jump-
       diffusion methods [12], [11]. Fourth, the DDMCMC paradigm provides a unifying framework in which the role of many existing
       segmentation algorithms, such as, edge detection, clustering, region growing, split-merge, snake/balloon, and region competition, are
       revealed as either realizing Markov chain dynamics or computing importance proposal probabilities. Thus, the DDMCMC paradigm
       combines and generalizes these segmentation methods in a principled way. The DDMCMC paradigm adopts seven parametric and
       nonparametric image models for intensity and color at various regions. We test the DDMCMC paradigm extensively on both color and
       gray-level images and some results are reported in this paper.

       Index TermsÐImage segmentation, Markov Chain Monte Carlo, region competition, data clustering, edge detection, Markov random



I   MAGE segmentation is a long standing problem in computer
   vision and it is found difficult and challenging for two main
                                                                                       First, we formulate the problem in a Bayesian/MDL
                                                                                    framework [15], [14], [29] with seven families of image
                                                                                    models which compete to explain various visual patterns in
    The first challenge is the fundamental complexity of                            an image, for example, flat regions, clutter, texture, smooth
modeling a vast amount of visual patterns that appear in                            shading, etc.
generic images. The objective of image segmentation is to                              Second, we decompose the solution space into a union of
parse an image into its constituent components. The latter                          many subspaces of varying dimensions and each subspace
are various stochastic processes, such as attributed points,                        is a product of a number of subspaces for the image
lines, curves, textures, lighting variations, and deformable                        partition and image models (see Fig. 3 for a space structure).
objects. Thus, a segmentation algorithm must incorporate                            The Bayesian posterior probability is distributed over such
many families of image models and its performance is                                a heterogeneously structured space.
upper bounded by the accuracy of its image models.                                     Third, we design ergodic Markov chains to explore the
    The second challenge is the intrinsic ambiguities in
                                                                                    solution space and sample the posterior probability. The
image perception, especially when there is no specific task
                                                                                    Markov chains consist of two types of dynamics: jumps and
to guide the attention. Real world images are fundamentally
                                                                                    diffusion. The jump dynamics simulate reversible split-and-
ambiguous and our perception of an image changes over
time. Furthermore, an image often demonstrates details at                           merge and model switching. The diffusion dynamics
multiple scales. Thus, the more one looks at an image, the                          simulate boundary deformation, region growing, region
more one sees. Therefore, it must be wrong to think that a                          competition [29], and model adaptation. We make the split
segmentation algorithm outputs only one result. In our                              and merge processes reversible and the ergodicity and
opinion, image segmentation should be considered a                                  reversibility enable the algorithm to achieve nearly global
computing process not a vision task. It should output multiple                      optimal solution independent of initial segmentation condi-
distinct solutions dynamically and endlessly so that these                          tions. Thus, this demonstrates major progress over the
solutions ªbest preserveº the intrinsic ambiguity.                                  previous region competition algorithm (Zhu and Yuille) [29].
    Motivated by the above two observations, we present a                              Fourth, we utilize data-driven techniques to guide the
stochastic computing paradigm called data-driven Markov                             Markov chain search and, thus, achieves tremendous speed-
chain Monte Carlo (DDMCMC) for image segmentation. We                               up in comparison to previous MCMC algorithms [10], [12],
proceed in five steps.                                                              [11]. In the literature, there are various techniques for
                                                                                    improving the Markov chain speed, such as multiresolution
                                                                                    approaches [28], [3], causal Markov models [3], [22], and
. The authors are with the Department of Computer and Information                   clustering [28], [27], [2], [9]. In our DDMCMC paradigm, data-
  Science, Ohio State University, 2015 Neil Ave., Columbus, OH 43210.
  E-mail: {ztu, szhu}
                                                                                    driven methods, such as edge detection [4] and tracing, data
                                                                                    clustering [5], [6] are used. The results of these algorithms are
Manuscript received 7 June 2001; accepted 26 Nov. 2001.
Recommended for acceptance by D. Forsyth.
                                                                                    expressed as weighted samples (or particles), which
For information on obtaining reprints of this article, please send e-mail to:       encode nonparametric probabilities in various subspaces., and reference IEEECS Log Number 114320.                         These probabilities, respectively, approximate the marginal
                                                                0162-8828/02/$17.00 ß 2002 IEEE
658                                             IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,                         VOL. 24, NO. 5,   MAY 2002

probabilities of the Bayesian posterior probability and they                            Thus, a segmentation is denoted by a vector of hidden
are used to design importance proposal probabilities to drive                        variables W , which describes the world state for generating
the Markov chains.                                                                   the image I.
    Fifth, we propose a mathematical principle and a
ªK-adventurersº algorithm for selecting and pruning a set                                         W ˆ …K; f…Ri ; `i ; Âi †Y i ˆ I; P; F F F ; Kg†:
of important and distinct solutions from the Markov chain                               In a Bayesian framework, we make inference about W
sequence and at multiple scales of details. The set of                               from I over a solution space .
solutions encode an approximation to the Bayesian poster-
ior probability. The multiple solutions are computed to                                             W $ p…W jI† G p…IjW †p…W †; W P :
minimize a Kullback-Leibler divergence from the approx-
                                                                                        As we mentioned before, the first challenge in segmenta-
imative posterior to the true posterior and they preserve the
                                                                                     tion is to obtain realistic image models. In the following, we
ambiguities in image segmentation.
                                                                                     briefly discuss the prior and likelihood models selected in
    In summary, the DDMCMC paradigm is about effec-
                                                                                     our experiments.
tively creating particles (by bottom-up clustering/edge
detection), composing particles (by importance proposals),                           2.2 The Prior Probability p…W †
and pruning particles (by a K-adventurers algorithm) and                             The prior probability p…W † is a product of the following
these particles represent hypotheses of various grainula-                            four probabilities.
rities in the solution space.
    Conceptually, the DDMCMC paradigm also reveals the                                  1.    An exponential model for the number of regions
roles of some well-known segmentation algorithms. Algo-                                       p…K† G eÀH K .
rithms such as split-and-merge, region growing, Snake [13]                              2.                              H
                                                                                              A general smoothness Gibbs prior for the region
and balloon/bubble [25], region competition [29], and                                         boundaries p…À† G e
variational methods [14], and PDEs [24] can be viewed as                                3.    A model for the size of each region. Recently, both
various MCMC jump-diffusion dynamics with minor mod-                                          empirical and theoretical studies [1], [16] on the
ifications. Other algorithms, such as edge detection [4] and                                  statistics of natural images indicate that the size of a
clustering [6], [8] compute importance proposal probabilities.                                region A ˆ jRj in natural images follows a distribu-
    We test the algorithm on a wide variety of gray-level and                                                I
                                                                                              tion, p…A† G A ;  $ P:H. Such a prior encourages large
color images and some results are shown in the paper. We                                      regions to form. In our experiments, we found this
also demonstrate multiple solutions and verify the segmen-                                    prior is not strong enough to enforce large regions;
tation results by synthesizing (reconstructing) images                                        instead we take a distribution
through sampling the likelihood models.
                                                                                                                         p…A† G eÀ
A ;                         …P†
2     PROBLEM FORMULATION                    AND IMAGE          MODELS                      where c ˆ H:W is a constant. 
 is a scale factor which
In this section, we formulate the problem in a Bayesian                                     controls the scale of the segmentation. This scale
framework and discuss the prior and likelihood models that                                  factor is in spirit similar to the ªclutter factorº found
                                                                                            by Mumford and Gidas [20] in studying natural
are selected in our experiments.
                                                                                            images. It is an indicator for how ªbusyº an image is.
2.1 The Bayesian Formulation for Segmentation                                               In our experiments, it is typically set to 
 ˆ P:H and is
Let à ˆ f…i; j† X I i L; I j Hg be an image lattice                                         the only free parameter in this paper.
and Ià an image defined on Ã. For any point v P Ã, Iv P                                 4. The prior for the type of model p…`† is assumed to be
fH; F F F ; Gg is the pixel intensity for a gray-level image, or                            uniform and the prior for the parameters  of an
Iv ˆ …Lv ; Uv ; Vv † for a color image.1 The problem of image                               image model penalizes model complexity in terms of
segmentation refers to partitioning the lattice into an                                     the number of parameters Â, p…Âj`† G eÀjÂj .
unknown number of K disjoint regions                                                    In summary, we have the following prior model

                à ˆ ‘K Ri ;         Ri ’ Rj ˆ Y; Vi Tˆ j:                     …I†                            Y
                                                                                             p…W † G p…K†          p…Ri †p…`i †p…Âi j`i †
Each region R & Ã needs not to be connected for reason of                                           (
occlusion. We denote by Ài ˆ @Ri the boundary of Ri . As a                                                 X I
                                                                                                           K                                          )
slight complication, two notations are used interchangeably                                  G exp ÀH K À                       ds ‡ 
jRi j ‡ jÂi j :
                                                                                                                   iˆI      @Ri
in the literature. One treats a region R & Ã as a discrete
label map and the other treats a region boundary À…s† ˆ @R
as a continuous contour parameterized by s. The contin-                              2.3 The Likelihood p…IjW † for Gray-Level Images
uous representation is convenient for diffusions while the                           Visual patterns in different regions are assumed to
label map representation is better for maintaining the                               be independent stochastic processes specified by
topology. The level set method [23], [24] provides a good                            …Âi ; `i †; i ˆ I; P; F F F ; K. Thus, the likelihood is,2
transform between the two.
   Each image region IR is supposed to be coherent in the                                                                Y

sense that IR is a realization from a probabilistic model p…IR Y †.                                      p…IjW † ˆ           p…IRi Y Âi ; `i †:
represents a stochastic process whose type is indexed by `.

   1. We transfer the (R, G, B) representation to (Là ; uà ; và ) for better color      2. As a slight notation complication, Â; ` could be viewed as parameters
distance measure.                                                                    or hidden variables in W We use p…I; Â; `† in both situtations for simplicity.
TU AND ZHU: IMAGE SEGMENTATION BY DATA-DRIVEN MARKOV CHAIN MONTE CARLO                                                                                  659

Fig. 1. Four types of regions in the windows are typical in real world images: (a) uniform, (b) clutter, (c) texture, and (d) shading.
The choice of models needs to balance model sufficiency                                  p…IR Y Â; gQ † ˆ          p…Iv jI@v Y †
and computational efficiency. In studying a large image set,                                                 vPR
                                                                                                             Y I                                        …S†
we found that four types of regions appear most frequently                                               ˆ         expfÀ < Â; h…Iv jI@v † >g;
in real world images. Fig. 1 shows examples for the four                                                        Z
                                                                                                             vPR v
types of regions in windows: Fig. 1a shows the flat regions
with no distinct image structures, Fig. 1b shows the                                  where < Á; Á > is the inner product between two
cluttered regions, Fig. 1c shows the regions with homo-                               vectors and the model is considered nonparametric.
                                                                                      The reason for choosing the pseudolikelihood
geneous textures, and Fig. 1d shows the inhomogeneous
                                                                                      expression is obvious: Its normalizing constant can
regions with globally smooth shading variations.
                                                                                      be computed exactly and  can be estimated easily
   We adopt the following four families of models for the
                                                                                      from images. We refer to a recent paper [31] for
four types of regions. The algorithm can switch between                               discussions on the computation of this model and its
them by Markov chain jumps. The four families are indexed                             variations, such as patch likelihood, etc.
by ` P fgI ; gP ; gQ ; gR g and denoted by $gI , $gP , $gQ , and $gR ,          4.    Gray image model family ` ˆ gR : $gR . The first three
respectively. Let G…HY P † be a Gaussian density centered at                         families of models are homogeneous, which fail in
H with variance P .                                                                  characterizing regions with shading effects, such as
                                                                                      sky, lake, wall, perspective texture, etc. In the
   1.   Gray image model family ` ˆ gI : $gI . This assumes                           literature, such smooth regions are often modeled
        that pixel intensities in a region R are subject to                           by low order Markov random fields, which again do
        independently and identically distributed (iid)                               not model the inhomogeneous pattern over space
        Gaussian distribution,                                                        and often lead to over-segmentation. In our experi-
                           Y                                                          ments, we adopt a 2D Bezier-spline model with
          p…IR Y Â; gI † ˆ   G…Iv À Y P †;  ˆ …; † P $gI :
                                                                                      sixteen equally spaced control points on à (i.e., we
                                                                                      fix the knots). This is a generative type model. Let
                                                                        …Q†           B…x; y† be the Bezier surface, for any v ˆ …x; y† P Ã,
                                                                                                       B…x; y† ˆ U…x†  M  U…y† ;                      …T†
   2.   Gray image model family ` ˆ gP : $gP . This is a
        nonparametric intensity histogram h…†. In practice                            where
        h…† is discretized as a step function expressed by a
        vector …hH ; hI ; F F F ; hG †. Let nj be the number of pixels                      U…x† ˆ ……I À x†Q ; Qx…I À x†P ; QxP …I À x†; xQ ††T
        in R with intensity level j.
                                        Y                Y
                     p…IR Y Â; gP † ˆ         h…Iv † ˆ         hj j ;                       M ˆ …mII ; mIP ; mIQ ; mIR Y F F F Y mRI ; F F F ; mRR †:
                                        vPR              jˆH            …R†
                                                                                      Therefore, the image model for a region R is,
                      ˆ …hH ; hI ; F F F ; hG † P $gP :
   3.   Gray image model family ` ˆ gQ : $gQ . This is a texture                      p…IR Y Â; gR † ˆ         G…Iv À Bv Y P †;     ˆ …M; † P $gR :
        model FRAME [30] with pixel interactions captured
        by a set of Gabor filters. This family of models was                                                                                            …U†
        demonstrated to be sufficient in realizing a wide
        variety of texture patterns. To facilitate the computa-                  In summary, four types of models compete to explain a
        tion, we choose a set of eight filters and formulate the              gray intensity region. Whoever fits the region better will have
        model in pseudolikelihood form [31]. The model is                     a higher likelihood. We denote by $g the gray-level model
        specified by a long vector  ˆ …I ; P ; F F F ; m † P $gQ ,        space,
        m is the total number of bins in the histograms of the
                                                                                               P $g ˆ $gI ‘ $gP ‘ $gQ ‘ $gR :
        eight Gabor filtered images. Let @v denote the Markov                                      Â
        neighborhood of v P R and h…Iv jI@v † the vector
        including eight local histograms of filter responses                  2.4 Model Calibration
        in the neighborhood of pixel v. Each of the filter                    The four image models should be calibrated for two reasons.
        histograms counts the filter responses at pixels whose                First, for computational efficiency, we prefer simple models
        filter windows cover v. Thus, we have                                 with less parameters. However, penalizing the number of
660                                               IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,                  VOL. 24, NO. 5,   MAY 2002

parameters is not enough in practice. When a region is of size
over $ IHH pixels, the data term dominates the prior and
demands more complex models. Second, the pseudolikeli-
hood models in family $gQ are not a true likelihood as they
depend on a rather big neighborhood; thus they are not
directly comparable to the other three types of models.
   To calibrate the likelihood probabilities, we did an
empirical study. We collected a set of typical regions from
natural images and manually divided them into four
categories. For example, Fig. 2 shows four typical images
in the first column, which are cropped from the images in
Fig. 1. We denote the four images by Io˜s ; i ˆ I; P; Q; R on a
lattice Ão . For each image Io˜s , we compute its per pixel
coding length (minus log-likelihood) according to an
optimal model within family $gj computed by a maximum
likelihood estimation for j ˆ I; P; Q; R.

                            log p…Io˜s Y Â; gj †
           Lij ˆ min À                           ;    for I    i; j   R:   …V†
                 $gj QÂ           jÃo j
We denote by ÂÃ P $gj optimal fit within each family and
draw a typical sample (synthesis) from each fitted model,

                 Isyn $ p…IY Âà ; gj †;
                  ij          ij              for I     i; j   R:

   We show         Io˜s , Isyn ,
                    i   and Lij in Fig. 2 for I i; j R.
                           ij                                                    Fig. 2. Comparison study of four families of models. The first column is
   The results in Fig. 2 show that the spline model has                          the original image regions cropped from four real world images shown in
                                                                                 Fig. 1. The images in the 2-5 columns are synthesized images Isyn $
obviously the shortest coding length for the shading region,                                                                                        ij
                                                                                 p…IR Y Âà † sampled from the four families, respectively, each after an
while the texture model fits the best for the three other                        MLE fitting. The number below each synthesized image shows the per-
regions. Then, we choose to rectify these models by a                            pixel coding bits Lij using each family of model.
constant factor eÀcj for each pixel v,
                                                                                     3.   Color image model family cQ : $cQ . We use three
           p…Iv Y Â; gj † ˆ p…Iv Y Â; gj †eÀcj
           ”                                         for j ˆ I; P; Q; R:
                                                                                          Bezier spline surfaces (see (6)) for Là , uà , and và ,
cj ; j ˆ I; P; Q; R are chosen so that the rectified coding                               respectively, to characterize regions with gradually
length Lij reaches minimum when i ˆ j. That is, uniform                                   changing colors such as sky, wall, etc. Let B…x; y† be
regions, clutter regions, texture regions, and shading                                    the color value in …Là ; uà ; và † space for any
regions are best fitted by the models in $I , $P , $Q , and                               v ˆ …x; y† P Ã,
$R , respectively.                                                                                         T                  T
                                                                                                B…x; y† ˆ…U…x†  ML  U…y† ; U…x†  Mu  U…y† ;
2.5 Image Models for color
                                                                                                            U…x†  Mv  U…y† †T :
In experiments, we work on both gray-level and color
images. For color images, we adopt a …Là ; uà ; và † color                                Thus, the model is
space and adopted three families of models indexed by                                                                      Y
` P fcI ; cP ; cQ g. Let G…HY Ɔ denote a 3D Gaussian density.                                          p…IR Y Â; cQ † ˆ         G…Iv À Bv Y Ɔ;
      1.   Color image model family cI : $cI . This is an iid
           Gaussian model in …Là ; uà ; và † space.                                      where  ˆ …ML ; Mu ; Mv ; Ɔ are the parameters.
                             Y                                                      In summary, three types of models compete to explain a
            p…IR Y Â; cI † ˆ                        
                               G…Iv À Y Ɔ;  ˆ …; Ɔ P $cI :                  color region. Whoever fits the region better will have higher
                                   vPR                                           likelihood. We denote by $c the color model space, then
                                                                                                          $c ˆ $cI ‘ $cP ‘ $cQ :

      2.   Color image model family cP : $cP . This is a mixture of
           two Gaussians and is used for modeling textured color                 3    ANATOMY         OF    SOLUTION SPACE
           regions,                                                              Before we design an algorithm, we need to study the
                                     Y                                           structures of the solution space  in which the posterior
                    p…IR Y Â; cP † ˆ   ‰I G…Iv À  I Y ÆI †                     probability p…W jI† is distributed.
                                                                                    We start with the partition space for all possible partitions
                                         ‡ P G…Iv À  P Y ÆP †Š:                of a lattice Ã. When a lattice à is segmented into k disjoint
           Thus,                                                                 regions, we call it a k-partition denoted by k ,

                           ˆ …I ; I ; ÆI ; P ;  P ; ÆP † P $cP               k ˆ …RI ; RP ; F F F ; Rk †;   ‘k Ri ˆ Ã;
                                                                                                                   iˆI               Ri ’ Rj ˆ Y; Vi Tˆ j:
           are the parameters.
TU AND ZHU: IMAGE SEGMENTATION BY DATA-DRIVEN MARKOV CHAIN MONTE CARLO                                                                      661

Fig. 3. The anatomy of the solution space. The arrows represent Markov chain jumps, and the reversible jumps between two subspace V and W
realize a split-and-merge of a region.

If all pixels in each region are connected, then k is a connected        4.1 Three Basic Criteria for Markov Chain Design
component partition [28]. The set of all k-partitions, denoted            There are three basic requirements for Markov chain design.
by $k , is a quotient space of the set of all possible k-colorings          First, the Markov chain should be ergodic. That is, from
divided by a permutation group €q for the labels.                         an arbitrary initial segmentation Wo P , the Markov chain
                                                                          can visit any other states W P  in finite time. This
                  $k ˆ f…RI ; RP ; F F F ; Rk † ˆ k Y                   disqualifies all greedy algorithms. Ergodicity is ensured
                  jRi j > H; Vi ˆ I; P; F F F ; kg=€q:                    by the jump-diffusion dynamics [12]. Diffusion realizes
                                                                          random moves within a subspace of fixed dimensions.
Thus, we have a general partition space $ with the number
                                                                          Jumps realize reversible random walks between subspaces
of regions I k jÃj,
                                                                          of different dimensions, as shown by the arrows in Fig. 3.
                           $ ˆ ‘kˆI $k :                                   Second, the Markov chain should be aperiodic. This is
                                                                          ensured by using the dynamics at random.
   Then, the solution space for W is a union of subspaces k                 Third, the Markov chain has stationary probability
and each k is a product of one k-partition space $k and                 p…W jI†. This is replaced by a stronger condition of detailed
k spaces for the image models                                             balance equations which demands that every move should be
                         2                       3                        reversible [12], [11]. The jumps in this paper all satisfy
                                                                          detailed balance and reversibility.
       ˆ ‘kˆI k ˆ ‘kˆI 4 $k  $  Á Á Á  $ 5;
           jÃj       jÃj
                                                 k                        4.2 Five Markov Chain Dynamics
where $ ˆ      ‘R $gi        for gray-level images and $ ˆ              We adopt five types of Markov chain dynamics which are
‘Q $ci for color images.                                                  used at random with probabilities q…I†; F F F ; q…S†, respectively.
    Fig. 3 illustrates the structures of the solution space. In           The dynamics 1-2 are diffusion and dynamics 3-5 are
Fig. 3, the four image families $` ; ` ˆ gI ; gP ; gQ ; gR are            reversible jumps.
represented by the triangles, squares, diamonds, and                         Dynamics 1: Boundary Diffusion/Competition. For mathe-
circles, respectively. $ ˆ $g is represented by a hexagon                matical convenience, we switch to a continuous boundary
                                                                          representation for regions Ri ; i ˆ I; F F F ; K. These curves
containing the four shapes. The partition space $k is
                                                                          evolve to maximize the posterior probability through a
represented by a rectangle. Each subspace k consists of a
                                                                          region competition equation [29]. Let Àij be the boundary
rectangle and k hexagons, and each point W P k represents
                                                                          between Ri ; Rj , Vi; j, and Âi ; Âj the models for the two
a k-partition plus k image models for k regions.
    We call k the scene spaces. $k and $` ; ` ˆ gI ; gP ; gQ ; gR (or   regions, respectively. The motion of points Àij …s† ˆ
` ˆ cI ; cP ; cQ ) are the basic components for constructing  and,       …x…s†; y…s†† follows the steepest ascent equation of the
thus, are called the atomic spaces. Sometimes, we call $ a               log p…W jI† plus a Brownian motion dB along the curve
partition space and $` ; ` ˆ gI ; gP ; gQ ; gR ; cI ; cP ; cQ the cue                        n
                                                                          normal direction ~…s†. By variational calculus, this is [29],
spaces.                                                                     dÀij …s†
4    EXPLORING THE SOLUTION SPACE                         BY                                     p…I…x…s†; y…s††Y Âi ; `i † p
                                                                              fprior …s† ‡ log                              ‡ PT …t†dB ~…s†:n
     ERGODIC MARKOV CHAINS                                                                       p…I…x…s†; y…s††Y Âj ; `j †
The solution space in Fig. 3 is typical for vision problems.              The first two terms are derived from the prior and
The posterior probability p…W jI† not only has an enormous                likelihood, respectively. The Brownian motion is a normal
number of local maxima but is distributed over subspaces                  distribution whose magnitude is controlled by a tempera-
of varying dimensions. To search for globally optimal                     ture T …t† which decreases with time t. The Brownian motion
solutions in such spaces, we adopt the Markov chain Monte                 helps to avoid local small pitfalls. The log-likelihood ratio
Carlo (MCMC) techniques.                                                  requires that the image models are comparable. Dynamics 1
662                                         IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,              VOL. 24, NO. 5,    MAY 2002

realizes diffusion within the atomic (or partition) space $k               images) for a region Ri . For example, from texture
(i.e., moving within a rectangle of Fig. 3).                                description to a spline surface, etc.
    Dynamics 2: Model Adaptation. This is simply to fit the
parameters of a region by steepest ascent. One can add a                                W ˆ …`i ; Âi ; WÀ †23…`Hi ; ÂHi ; WÀ † ˆ W H :
Brownian motion, but it does not make much of a difference in
                                                                            The proposal probabilities are
                                                                                     G…W 3 dW H † ˆ q…S†q…Ri †q…`Hi †q…ÂHi jRi ; `Hi †dW H ;     …IT†
                        dÂi @ log p…IRi Y Âi ; `i †
                            ˆ                                                              H
                                                                                     G…W 3 dW † ˆ q…S†q…Ri †q…`i †q…Âi jRi ; `i †dW :            …IU†
                         dt        @Âi :
This realizes diffusion in the atomic (or cue) spaces $l ; ` P
fgI ; gP ; gQ ; gR ; cI ; cP ; cQ g (move within a triangle, square,        4.3 The Bottlenecks
diamond, or circle of Fig. 3).                                              The speed of a Markov chain depends critically on the design
   Dynamics 3-4: Split and Merge. Suppose at a certain time                 of its proposal probabilities in the jumps. In our experiments,
step, a region Rk with model Âk is split into two regions Ri and            the proposal probabilities, such as q…I†; F F F ; q…S†, q…Rk †,
Rj with models Âi ; Âj , or vice verse, and this realizes a jump            q…Ri ; Rj †, q…`† are easy to specify and do not influence the
between two states W to W H as shown by the arrows in Fig. 3.               convergence significantly. The real bottlenecks are caused by
                                                                            two proposal probabilities in the jump dynamics.
         W ˆ …K; …Rk ; `k ; Âk †; WÀ †2    3
            …K ‡ I; …Ri ; `i ; Âi †; …Rj ; `j ; Âj †; WÀ † ˆ W H ;             1.    q…ÀjR† in (13): Where is a good À for splitting a given
                                                                                     region R? q…ÀjR† is a probability in the atomic (or
where WÀ denotes the remaining variables that are un-                                partition) space $ .
changed during the move. By the classic Metropolis-                            2. q…ÂjR; `† in (13), (15), and (17): For a given region R
Hastings method [19], we need two proposal probabilities                             and a model family ` P fgI ; F F F ; gR ; cI ; cP ; cQ g, what is
G…W 3 dW H † and G…W H 3 dW †. G…W 3 dW H † is a condi-                              a good Â? q…ÂjR; `† is a probability in the atomic
tional probability for how likely the Markov chain proposes                          (cue) space $` .
to move to W H at state W and G…W H 3 dW † is the proposal
probability for coming back. The proposed split is then                        It is worth mentioning that both probabilities q…ÀjR† and
accepted with probability                                                   q…ÂjR; `† cannot be replaced by deterministic decisions
                                                                          which were used in region competition [29] and others [15].
                 H           G…W H 3 dW †p…W H jI†dW H                      Otherwise, the Markov chain will not be reversible and,
     …W 3 dW † ˆ min I;                                 :
                             G…W 3 dW H †p…W jI†dW                          thus, reduce to a greedy algorithm. On the other hand, if we
                                                                            choose uniform distributions, it is equivalent to blind search
    There are two routes (or ªpathwaysº in a psychology                     and the Markov chain will experience exponential ªwait-
language) for computing the split proposal G…W 3 dW H †.                    ingº time before each jump. In fact, the length of the waiting
    In route 1, it first chooses a split move with probability
                                                                            time is proportional to the volume of the cue spaces. The
q…Q†, then chooses region Rk from a total of K regions at
                                                                            design of these probabilities need to strike a balance
random; we denote this probability by q…Rk †. Given Rk , it
                                                                            between speed and robustness (nongreediness).
chooses a candidate splitting boundary Àij within Rk with                      While it is hard to analytically derive a convergence
probability q…Àij jRk †. Then, for the two new regions Ri ; Rj ,
                                                                            rate for complicated algorithms that we are dealing with, it
it chooses two new model types `i and `j with probabilities
                                                                            is revealing to observe the following theorem in a simple
q…`i † and q…`j †, respectively. Then, it chooses Âi P $`i with
                                                                            case [18]:
probability q…Âi jRi ; `i † and chooses Âj with probability
q…Âj jRj ; `j †. Thus,                                                      Theorem 1. Sampling a target density p…x† by independence
                                                                              Metropolis-Hastings algorithm with proposal probability q…x†.
       G…W 3 dW H † ˆ q…Q†q…Rk †q…Àij jRk †q…`i †                             Let P n …xo ; y† be the probability of a random walk to reach
                      q…Âi jRi ; `i †q…`j †q…Âj jRj ; `j †dW H :              point y at n steps. If there exists  > H such that,

   In route 2, it first chooses two new region models Âi and                                            q…x†
                                                                                                             ! ;      Vx;
Âj and, then, decides the boundary Àij . Thus,                                                          p…x†

  G…W 3 dW H † ˆ q…Q†q…Rk †q…`i †q…`j †q…Âi ; Âj jRk ; `i ; `j †               then the convergence measured by a LI norm distance
                 q…Àij jRk ; Âi ; Âj †dW H :                                                    jjP n …xo ; Á† À pjj   …I À †n :
   We shall discuss in later a section that either of the two
                                                                               This theorem states that the proposal probability q…x†
routes can be more effective than the other depending on
the region Rk .                                                             should be very close to p…x† for fast convergence. In our
   Similarly, we have the merge proposal probability,                       case, q…ÀjR† and q…ÂjR; `† should be equal to the
                                                                            conditional probabilities of some marginal probabilities
      G…W H 3 dW † ˆ q…R†q…Ri ; Rj †q…`k †q…Âk jRk ; `k †dW :        …IS†   of the posterior p…W jI† within the atomic spaces $ and
                                                                            $` , respectively. That is,
q…Ri ; Rj † is the probability of choosing to merge two regions
Ri and Rj at random.                                                           Ã                                 Ã
                                                                              qÀ …Àij jRk † ˆ p…Àij jI; Rk †;   q …ÂjR; `† ˆ p…ÂjI; R; `†; V`:
   Dynamics 5: Switching Image Models. This switches the                                                                                         …IV†
image model within the four families (three for color
TU AND ZHU: IMAGE SEGMENTATION BY DATA-DRIVEN MARKOV CHAIN MONTE CARLO                                                                               663

Fig. 4. A color image and its clusters in …Là ; uà ; và † space for $cI , the second row are six of the saliency maps associated with the color clusters.

                    Ã        Ã
   Unfortunately, qÀ and q have to integrate information                     strategy with a limit m ˆ j… ` j. This is well discussed in the
from the entire image I and, thus, are intractable. We must                   literature [5], [6].
seek approximations and this is where the data-driven                             In the clustering algorithms, each feature Fv and, thus, its
                                                                                                                      `                   `
methods step in.                                                              location v is classified to a cluster Âi with probability Si;v ,
   In the next section, we discuss data clustering for each
atomic space $` ; ` P fcI ; cP ; cQ g and ` P fgI ; gP ; gQ ; gR g and             `        `
                                                                                  Si;v ˆ p…Fv Y Â` †;       with           `
                                                                                                                          Si;v ˆ I;   Vv P Ã; V`:
edge detection in $ . The results of clustering and edge                                                           iˆI
detection are expressed as nonparametric probabilities for
                                                         Ã         Ã          This is a soft assignment and can be computed by the
approximating the ideal marginal probabilities qÀ and q in
                                                                              distance from Fv to the cluster centers. We call
these atomic spaces, respectively.
                                                                                        Si` ˆ fSi;v X v P Ãg;      for i ˆ I; P; F F F ; m; V`      …PH†
5     DATA-DRIVEN METHODS                                                     a saliency map associated with cue particle Â` .
5.1   Method I: Clustering in Atomic Spaces $`                                   In the following, we discuss each model family with
      for ` P fcI ; cP ; cQ ; gI ; gP ; gQ ; gR g                             experiments.
Given an image I (gray or color) on lattice Ã, we extract a
feature vector Fv at each pixel v P Ã. The dimension of Fv`                   5.1.1 Computing Cue Particles in $cI
depends on the image model indexed by `. Then, we have a                      For color images, we take Fv ˆ …Lv ; Uv ; Vv † and apply a
collection of vectors                                                         mean-shift algorithm [5], [6] to compute color clusters in
                                                                              $cI . For example, Fig. 4 shows a few color clusters (balls) in
                         … ` ˆ fFv X v P Ãg:                                  a cubic (…Là ; uà ; và †-space) for a simple color image (left), the
                                                                              size of the balls represents the weights !cI . Each cluster is
In practice, v can be subsampled for computational ease.
                                                                              associated with a saliency map SicI for i ˆ I; P; F F F ; T in the
The set of vectors are clustered by either an EM method [7]                   second row and the bright areas mean high probabilities.
or a mean-shift clustering [5], [6] algorithm to … ` . The EM-                From left to right are, respectively, background, skin, shirt,
clustering approximates the points density in … ` by a                        shadowed skin, pant and hair, highlighted skin.
mixture of m Gaussians and it extends from the m-mean
clustering by a soft cluster assignment to each vector Fv .                   5.1.2 Computing Cue Particles in $cQ
The mean-shift algorithm assumes a nonparametric dis-                         Each point v contributes its color Iv ˆ …Lv ; Uv ; Vv † as ªsur-
tribution for … ` and seeks the modes (local maxima) in its                   face heightsº and we apply an EM-clustering to find the
density (after some Gaussian window smoothing). Both                          spline surface models. Fig. 5 shows the clustering result for
algorithms return a list of m weighted clusters Â` ; Â` ; F F F ; Â`
                                                     I P           m          the woman image. Fig. 5a, Fig. 5b, Fig. 5c, and Fig. 5d are
with weights !` ; i ˆ I; P; F F F ; m and we denote by
                i                                                             saliency maps SicQ for i ˆ I; P; Q; R. Fig. 5e, Fig. 5f, Fig. 5g,
                                                                              and Fig. 5h are the four reconstructed images according to
               € ` ˆ f …!` ; Â` † X i ˆ I; P; F F F ; m: g:
                         i    i                                      …IW†     fitted spline surfaces which recover some global illumina-
We call    …!` ; Â` †   a weighted atomic (or cue) particle in $`             tion variations.
             i    i
for ` P fcI ; cQ ; gI ; gP ; gQ ; gR g.3 The size m is chosen to be
                                                                              5.1.3 Computing Cue Particles in $gI
conservative or it can be computed in a coarse-to-fine
                                                                              In this model, the feature space Fv ˆ Iv is simply the
  3. The atomic space $cP is a composition of two $cI and, thus, is           intensity and … gI is the image intensity histogram. We
computed from $cI .                                                           simply apply a mean-shift algorithm to get the modes
664                                          IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,                     VOL. 24, NO. 5,   MAY 2002

Fig. 5. (a)-(d) are saliency maps associated with four clusters in $cQ . (e)-(h) are the color spline surfaces for the four clusters.

Fig. 6. A clustering map (left) for $gI and six saliency maps SigI ; i ˆ I::T of a zebra image (input is in Fig. 14a).

Fig. 7. A clustering map (left) for $gP and six saliency maps SigP ; i ˆ I; F F F ; T of a zebra image (input is in Fig. 14a).

(peaks) of the histogram and the breadth of each peak                                                     g
                                                                                 nine bins. Then, Fv Q ˆ …hv;I;I ; F F F ; hv;V;W † is the feature. An EM
decides its variance.                                                            clustering is applied to find the m mean histograms
   Fig. 6 shows six saliency maps SigI ; i ˆ I; P; F F F ; T for a zebra         "
                                                                                 hi ; i ˆ I; P; F F F ; m. We can compute the cue particles for
image (the original image is shown in Fig. 14a. In the                                                            "
clustering map on the left in Fig. 6, each pixel is assigned to its              texture models ÂgQ from hi for i ˆ I; P; F F F ; m. A detailed
most likely particle. We show the clustering by pseudocolors.                    account of this transform is referred to a previous paper [31].
                                                                                     Fig. 8 shows the texture clustering results on the zebra
5.1.4 Computing the Cue Particles in $gP                                         image with one clustering map on the left and five saliency
For clustering in $gP , at each subsampled pixel v P Ã, we                       maps for five particles ÂgQ ; i ˆ I; P; F F F ; S.
compute Fv as a local intensity histogram Fv ˆ …hvH ; F F F ; hvG †
accumulated over a local window centered at v. Then, an                          5.1.6 Computing Cue Particles in $gR
EM clustering is applied to compute the cue particles and                        Each point v contributes its intensity Iv ˆ Fv as a ªsurface
each particle ÂgP ; i ˆ I; F F F ; m is a histogram. This model is
                i                                                                heightº and we apply an EM-clustering to find the spline
used for clutter regions.                                                        surface models. Fig. 9 shows a clustering result for the zebra
   Fig. 7 shows the clustering results on the same zebra image.                  image with four surfaces. The second row shows the four
                                                                                 surfaces which recover some global illumination variations.
5.1.5 Computing Cue Particles $gQ
                                                                                 Unlike the texture clustering results which capture the zebra
At each subsampled pixel v P Ã, we compute a set of eight
local histograms for eight filters over a local window of                        strips as a whole region, the surface models separate the black
IP Â IP pixels. We choose eight filters for computational                        and white stripes as two regionsÐanother valid perception.
convenience: one  filter, two gradient filters, one Laplacian of                Interestingly, the black and white strips in the zebra skin both
Gaussian filter, and four Gabor filters. Each histogram has                      have shading changes which are fitted by the spline models.
TU AND ZHU: IMAGE SEGMENTATION BY DATA-DRIVEN MARKOV CHAIN MONTE CARLO                                                                                      665

Fig. 8. Texture clustering. A clustering map (left) and five saliency maps for five particles ÂgQ ; i ˆ I; P; F F F ; S.

Fig. 9. Clustering result on the zebra image under Bezier surface model. The left image is the clustering map. The first row of images on the right side
are the saliency maps. The second row shows the fitted surfaces using the surface height as intensity.

                                                                                                        m                                X

5.2 Method II: Edge Detection                                                           q…ÂjÃ; `† ˆ            !` G…Â À Â` †;
                                                                                                                i        i        with           !` ˆ I;
                                                                                                                                                  i        …PI†
                                                                                                        i ˆI                              i ˆI
We detect intensity edges using Canny edge detector [4]
and color edges using a method in [17] and trace edges to                         where G…x† is a Parzen window centered at H. As a
form a partition of the image lattice. We choose edges at                         matter of fact, q…ÂjÃ; `† ˆ q…ÂjI†† is an approximation to a
three scales according to edge strength and, thus, compute                        marginal probability of the posterior p…W jI† on cue space
                                                                                  $` ; ` P fgI ; gP ; gQ ; gR ; cI ; cQ g since the partition  is inte-
the partition maps in three coarse-to-fine scales. We choose
                                                                                  grated out in EM-clustering.
not to discuss the details, but show some results using the                          q…ÂjÃ; `† is computed once for the whole image and
two running examples: the woman and zebra images.                                 q…ÂjR; `† is computed from q…ÂjÃ; `† for each R at runtime.
   Fig. 10a shows a color image and three scales of partitions.                   It proceeds in the following. Each cluster Â` ; i ˆ I; P; F F F ; m
Since this image has strong color cue, the edge maps are very                     receives a real-valued vote from the pixel v P R in region R
informative about where the region boundaries are. In                             and the accumulative vote is the summation of the saliency
contrast, the edge maps for the zebra image are very messy,                       map Si` associated with Â` , i.e.,     i
as Fig. 11 shows.
                                                                                                         I X `
                                                                                                 pi ˆ          S ;         i ˆ I; P; F F F ; m; V`:
                                                                                                        jRj vPR i;v
      PROBABILITIES                                                               Obviously, the clusters which receive high votes should
It is generally acknowledged in the community that cluster-                       have high chance to be chosen. Thus, we sample a new
ing and edge detection algorithms can sometimes produce                           image model  for region R,
good segmentations or even perfect results for some images,                                                                X
but very often they are far from being reliable for generic                                          $ q…ÂjR; `† ˆ              pi G…Â À Â` †:
                                                                                                                                           i               …PP†
images, as the experiments in Figs. 4, 5, 6, 7, 8, 9, 10, and 11                                                           iˆI
demonstrate. It is also true that sometimes one of the image                      Equation (22) explains how we choose (or propose) an
models and edge detection scales could do a better job in                         image model for a region R. We first draw a cluster i at
segmenting some regions than other models and scales, but
we do not know a priori what types of regions present in a
generic image. Thus, we compute all models and edge
detection at multiple scales and, then, utilize the clustering
and edge detection results probabilistically. MCMC theory
provides a framework for integrating this probabilistic
information in a principled way under the guidance of a
globally defined Bayesian posterior probability.
    We explain how the importance proposal probabilities
q…ÂjR; `† and q…Àij jRk † in Section 4.3 are computed from the
data-driven results.

6.1   Computing Importance Proposal
      Probability q…ÂjR; `†
The clustering method in an atomic (cue) space $` outputs a                       Fig. 10. Partition maps at three scales of details for a color image.
set of weighted cue particles € ` . € ` encodes a nonpara-                        (a) Input image. (b) Partition map at scale 1. (c) Partition map at scale 2.
metric probability in $` ,                                                        (d) Partition map at scale 3.
666                                                IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,                   VOL. 24, NO. 5,   MAY 2002

Fig. 11. A gray-level image and three partition maps at three scales. (a) Input image. (b) Partition map at scale 1. (c) Partition map at scale 2.
(d) Partition map at scale 3.

Fig. 12. A candidate region Rk is superimposed on the partition maps at three scales for computing a candidate boundary Àij for the pending split.

random according to probability p ˆ …pI ; pP ; F F F ; pm † and,                 So each partition map Á…s† encodes a probability in the
then, do a random perturbation at Â` . Thus, any  P $` has
                                             i                                atomic (partition) space $k .
a nonzero probability to be chosen for robustness and                                                   …s†
                                                                                                    jÅk j
ergodicity. Intuitively, the clustering results with local votes                              I     X                    …s†
propose the ªhottestº portions of the space in a probabilistic                  q …k † ˆ     …s†
                                                                                                              G…k À k;j †;     for s ˆ I; P; Q: Vk: …PR†
                                                                                            jÅk j jˆI
way to guide the jump dynamics.
   In practice, one could implement a multiresolution (on a                      G…† is a smooth window centered at H and its smoothness
pyramid) clustering algorithm over smaller local windows,                     accounts for boundary deformations and forms a cluster
thus the clusters Â` ; i ˆ I; P; F F F ; m will be more effective at the
                                                                              around each partition particle and k À k;j measures the
expense of some overhead computing.                                           difference between two partition maps k and k;j . Martin et
                                                                              al. [21] recently proposed a method of measuring such
6.2    Computing Importance Proposal                                          difference and we use a simplified version. In the finest
       Probability q…ÀjR†                                                                                                           …s†
                                                                              resolution, all metaregions reduce to pixels and Åk is then
By edge detection and tracing, we obtain partition maps                       equal to the atomic space $k . We adopt equal weights for
denoted by Á…s† at multiple scales s ˆ I; P; Q. In fact, each                 all partitions k and one may add other geometric
partition map Á…s† consists of a set of ªmetaregionsº                         preferences to some partitions.
ri ; i ˆ I; P; F F F ; n,                                                        In summary, the partition maps at all scales encode a
                                                                              nonparametric probability in $k ,
                               …s†                          …s†
             Á…s† …Æ ˆ fri X i ˆ I; P; F F F ; n; ‘n ri ˆ Ãg;                                          X
                                                    iˆI                                        q…k † ˆ   q…s†q…s† …k †; Vk:
             for s ˆ I; P; Q:                                                                                     s

These metaregions are then used in combination to form                        This q…k † can be considered as an approximation to the
               …s† …s†         …s†                                            marginal posterior probability p…k jI†.
K n regions RI ; RP ; F F F ; RK ,
                                                                                  The partition maps Á…s† ; Vs (or q…k †; Vk implicitly) are
             …s†         …s†               …s†
            Ri ˆ ‘j rj ;         with rj P Á…s† ; Vi ˆ I; P; F F F ; K:       computed once for the whole image, then the importance
                                                                              proposal probability q…ÀjR† is computed from q…k † for each
One could put a constraint that all metaregions in a region                   region as a conditional probability at run time, like in the cue
Ri are connected.                                                             spaces.
         …s†     …s†    …s†     …s†
   Let k ˆ …RI ; RP ; F F F ; Rk † denote a k-partition based                    Fig. 12 illustrates an example. We show partition maps
on Á…s† . k is different from the general k-partition k                     Á…s† …Æ at three scales and the edges are shown at width
                     …s†                  …s†
because regions Ri ; i ˆ I; F F F ; K in k are limited to the                Q; P; I, respectively, for s ˆ I; P; Q. A candidate region R is
metaregions. We denote by Åk the set of all k-partitions                      proposed to split. q…ÀjR† is the probability for proposing a
based on a partition map Á…s† .                                               splitting boundary À.
                                                                                  We superimpose R on the three partition maps. The
      …s†          …s†     …s†        …s†        …s†         …s†
  Åk ˆ f…RI ; RP ; F F F ; Rk † ˆ k X                 ‘k Ri ˆ Ãg: …PQ†
                                                                              intersections between R and the metaregions generate
                         …s†         …s†                                      three sets
We call each k in Åk a k-partition particle in atomic
                                                           …s†                                            …s†     …s†
(partition) space $k . Like the clusters in a cue space, Åk is                        Á…s† …R† ˆ frj X rj ˆ R ’ rj for rj P Á…s† …Æ;
a sparse subset of $k and it narrows the search in $k to                                        …s†
                                                                                       —nd ‘i ri ˆ Rg;                s ˆ I; P; Q:
the most promising portions.
TU AND ZHU: IMAGE SEGMENTATION BY DATA-DRIVEN MARKOV CHAIN MONTE CARLO                                                                                    667

For example, in Fig. 12,                                                                   of many existing image segmentation algorithms. First,
                                                                                           edge detection and tracing methods [4], [17] compute
                    …I†    …I†                              …P†   …P†   …P†   …P†
    Á…I† …R† ˆ frI ; rP g; Á…P† …R† ˆ frI ; rP ; rQ ; rR g;                                implicitly a marginal probability q…jI† on the partition
                                                                                           space $ . Second, clustering algorithms [5], [6] compute a
etc.                                                                                       marginal probability on the model space $` for various
                                    …s†   …s†
   Thus, we can define …s† …R† ˆ …RI ; RP ; F F F ; R…s† † as
                           c                          c                                    models `. Third, the split-and-merge and model switching
a c-partition of region R based on Á…s† …R† and define a                                   [2] realize jump dynamics. Fourth, region growing and
c-partition space of R as                                                                  competition methods [29], [24] realize diffusion dynamics
                                                                                           for evolving the region boundaries.
                                  …s†      …s†
              Ņs† …R† ˆ f…RI ; RP ; F F F ; R…s† †
               c                              c
                          ˆ …s† …R† X ‘c Ri ˆ Rg; Vs:
                             c          iˆI                                                7   COMPUTING MULTIPLE DISTINCT SOLUTIONS
We can define distributions on Ņs† …R†.                                                   7.1 Motivation and a Mathematical Principle
                                                                                           The DDMCMC paradigm samples solutions from the
                                          c …R†j                                           posterior W $ p…W jI† endlessly, as we argued in the
        …s†                   I                                   …s†
      q …c …R†† ˆ                                  G…c À c;j …R††;                      introduction that segmentation is a computing process not
                          jÅc …R†j        jˆI
                                                                                    …PT†   a task. To extract an optimal result, one can take an
      for s ˆ I; P; Q; Vc:                                                                 annealing strategy and use the conventional maximum a
                                                                                           posteriori (MAP) estimator
Thus, one can propose to split R into c pieces, in a general case,
                                X                                                                                W à ˆ —rg m—x p…W jI†:
                                                                                                                              W P
           c …R† $ q…c …R†† ˆ     q…s†q…s† …c …R††:
                                                s                                          In this paper, we argue that it is desirable and often critical
                                                                                           to have the ability of computing multiple distinct solutions
That is, we first select a scale s with probability q…s†. q…s†                             for the following reasons.
depends on R. For example, for a large region R, we can                                       First, natural scenes are intrinsically ambiguous and for
choose coarse scale with higher probability and choose a                                   an image I, many competing organizations and interpreta-
fine scale for small regions. Then, we choose a c-partition                                tions exist in visual perception.
from the set Ņs† …R†. In our implementation, c ˆ P is chosen
               c                                                                              Second, for robustness, decisions should be left to the last
as a special case for easy implementation. It is trivial to                                stage of computation when a segmentation process is
show that an arbitrary c-partition of region R, c …R†, can be                             integrated with a specific task. Therefore, it is best to
generated through composing P …R† in multiple steps.                                      maintain a set of typical solutions.
Obviously, there is a big overhead for choosing large c.                                      Third, preserving multiple solutions is necessary when
6.3 Computing q…Âi ; Âj jR; `i ; `j † and q…Àij jR; Âi ; Âj †                              the prior and likelihood models are not perfect. Because the
                                                                                           globally optimal solution may not be semantically more
In some cases, we find the second route useful for splitting a
                                                                                           meaningful than some other inferior local maxima.
region which we discussed in designing MCMC dynamics
                                                                                              However, simply keeping a set of samples from the
3-4 (see (14)).
                                                                                           Markov chain sequence is not enough because it often collects
    For example, there are two ways to perceive the zebra in
                                                                                           a set of segmentations which are trivially different from each
Fig. 14. One perceives the zebra as one textured region (by a
                                                                                           other. Here, we present a mathematical principle for
model in $gQ ). The other sees it as one region of black stripes
                                                                                           computing important and distinctive solutions in space .
plus one region of white strips and, thus, uses two models in
                                                                                           (Our result was presented earlier in a CVPR2000 paper.)
$gI or $gR . The Markov chain should be able to switch between
                                                                                              Let S ˆ f…!i ; Wi † X i ˆ I; F F F ; Kg be a set of K weighted
the two perceptions effectively (see results in Fig. 14b, Fig. 14c,
                                                                                           solutions which we call ªscene particles,º the weight is its
and Fig. 14d. This is necessary and typical for the transitions
                                                                                           posterior probability !i ˆ p…W jI†; i ˆ I; P; F F F ; K. (Note that
between any texture regions and intensity regions.
                                                                                           there is a slight abuse of notation, we use K for the number
    Because the number of strips in such textures is large, the
                                                                                           of regions in W before. Here, it is a different K). S encodes a
first split procedure (route 1) is very ineffective and it works
                                                                                           nonparametric probability in ,
on one strip at a time. This motivates the second pathway
for split dynamics.                                                                                              X !i
                                                                                                                 K                       X
    For a candidate region R, we first propose two new                                               ”
                                                                                                     p…W jI† ˆ             G…W À Wi †;         !i ˆ !:
region models (we always assume the same labels `i ˆ `j ),                                                       iˆI
                                                                                                                       !                 iˆI
which can be done by twice sampling the importance
                                                                                           G is a Gaussian window in .
proposal probabilities q…ÂjR; `†, so
                                                                                              As all image ambiguities are captured in the Bayesian
     …Âi ; Âj † $ q…Âi ; Âj jR; `i ; `j † ˆ q…Âi jR; `i †q…Âj jR; `j †:                    posterior probability to reflect the intrinsic ambiguities, we
                                                                                           should compute the set of solutions S which best preserves
Obviously, we exclude Âi from the candidate set when we                                                                                 ”
                                                                                           the posterior probability. Thus, we let p…W jI† approach
select Âj . Then, we decide on the boundary À q…Àij jR; Âi ; Âj †                          p…W jI† by minimizing a Kullback-Leibler divergence D…pjj”†   p
by randomly labeling the pixels in R according to probabil-                                under a complexity constraint jSj ˆ K,
ities of the saliency maps.                                                                                                    Z
                                                                                                                                              p…W jI†
6.4 A Unifying Framework                                                                     S à ˆ —rg min D…pjj”† ˆ —rg min
                                                                                                                p                 p…W jI† log         dW :
                                                                                                      jSjˆK              jSjˆK                ”
                                                                                                                                              p…W jI†
To summarize this section, the DDMCMC paradigm                                                                                                           …PU†
provides a unifying framework for understanding the roles
668                                        IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,          VOL. 24, NO. 5,   MAY 2002

Fig. 13. Segmenting a color image by DDMCMC with two solutions. See text for explanation.

This criterion extends the conventional MAP estimator (see                      In practice, we run multiple Markov chains and add new
Appendix for further discussion).                                            particles to the set S in a batch fashion. From our
                                                                             experiments, the greedy algorithm did a satisfactory job
7.2    A KEedventurers Algorithm for Multiple                                and it is shown to be optimal in two low dimensional
       Solutions                                                             examples in the Appendix.
Fortunately, the KL-divergence D…pjj”† can be estimated
                                              ” p
fairly accurately by a distance measure D…pjj”† which is
computable, thanks to two observations of the posterior                      8   EXPERIMENTS
probability p…W jI† which has many separable modes. The                      The DDMCMC paradigm was tested extensively on many
details of the calculation and some experiments are given in                 gray-level, color and textured images. This section shows
the appendix. The idea is simple. We can always represent                    some examples and more are available on our website.5 It
p…W jI† by a mixture of Gaussian, i.e., a set of N particles                 was also tested in a benchmark data set of 50 natural images
with N large enough. By ergodicity, the Markov chain is                      in both color and gray-level [21] by the Berkeley group,6
supposed to visit these significant modes over time! Thus,                   where the results by DDMCMC and other methods such as
our goal is to extract K distinct solutions from the Markov                  [26] are displayed in comparison to those by a number of
                                                                             human subjects. Each tested algorithm uses the same
chain sampling process. Here, we present a greedy
                                                                             parameter setting for all the benchmark images and, thus,
algorithm for computing S Ã approximately. We call the
                                                                             the results were obtained purely automatically.
algorithmЪKE—dventurersº algorithm.4                                           We first show our working example on the color woman
   Suppose we have a set of K particles S at step t. At time                 image. Following the importance proposal probabilities for
t ‡ I, we obtain a new particle (or a number of particles) by                the edges in Fig. 10 and for color clustering in Fig. 4, we
MCMC, usually following a successful jump. We augment                        simulated three Markov chains with three different initial
the set S to S‡ by adding the new particle(s). Then, we                      segmentations shown in Fig. 13 (top row). The energy
eliminate one particle (or a number of particles) from S‡ to get             changes (À log p…W jI†) of the three MCMCs are plotted in
a new Snew by minimizing the approximative KL divergence                     Fig. 13 against time steps. Fig. 13 shows two different
D…p‡ jjpnew †.                                                               solutions WI ; WP obtained by a Markov chain using
                                                                             KE—dventurers algorithm. To verify the computed solution
The k-adventurers algorithm                                                  Wi , we synthesized an image by sampling from the
1. Initializing S and p by repeating one initial solution                    likelihood Isyn $ p…IjWi †; i ˆ I; P. The synthesis is a good
         K times.                                                            way to examine the sufficiency of models in segmentation.
2. Repeat                                                                       Fig. 14 shows three segmentations on a gray-level zebra
3.     Compute a new particle …!K‡I ; xK‡I † by DDMCMC                       image. As we discussed before, the DDMCMC algorithm in
             after a successful jump.                                        this paper has only one free parameter 
 which is a ªclutter
4.     S‡ 2 S f…!K‡I ; xK‡I †g.                                              factorº in the prior model (see (2)). It controls the extents of
5.     ”
       p 2 S‡ .                                                              segmentations. A big 
 encourages coarse segmentation
6.     For i ˆ I; P; F F F ; K ‡ I do                                        with large regions. We normally extract results at three
7.        SÀi 2 S‡ =f…!i ; xi †g.                                            scales by setting 
 ˆ I:H; P:H; Q:H, respectively. In our
                                                                             experiments, the KE—dventurers algorithm is effective only
8.         ”
          pÀi 2 SÀi .
                                                                             for computing distinct solutions in a certain scale. We
9.                     p
          di ˆ D…pjj”Ài †.                                                   expect the parameter 
 can be fixed to a constant if we form
10. ià ˆ —rg miniPfI;P;FFF;K‡Ig di .                                         an image pyramid with multiple scales and conduct
11. S 2 SÀià ; p 2 pÀià      ”
                                                                                5. See
   4. The name follows a statistics metaphor told by Mumford to one of the   DDMCMC.htm.
authors Zhu. A team of K adventurers want to occupy K largest islands in        6. See
an ocean while keeping apart from each other's territories.                  benchmark.
TU AND ZHU: IMAGE SEGMENTATION BY DATA-DRIVEN MARKOV CHAIN MONTE CARLO                                                                            669

Fig. 14. Experiments on the gray-level zebra image with three solutions: (a) input image, (b)-(d) are three solutions, Wi ; i ˆ I; P; Q, for the zebra
image, (e)-(g) are synthesized images Isyn $ p…IjWià † for verifying the results.

segmentation with KE—dventurers algorithm at each scale                     9    DISCUSSION
and, then, propagate and refine the results to the next finer               In future work, we shall extend the DDMCMC paradigm in
scale sequentially. This will be done in future research.                   three directions:
   For the zebra image, WI segments out the black and
white stripes while WP and WQ treat the zebra as a texture                      1.   Integrating other image models, such as point, curve
region. The synthesized images Isyn $ p…IjWi †; i ˆ I; P; Q                          processes for perceptual organization, and object
                                                                                     models such as face, for object recognition.
show that the texture model is not sufficient because we
                                                                                2.   Incorporating specific vision tasks with this sto-
choose only eight small filters for computational ease. Also                         chastic inference engine. When there is a special
the spline surface model plays an important role in                                  purpose, the computing process is tuned (attended
segmenting the ground and background grass and this is                               in a psychology term) to minimize some criterion,
verified by the global shading changes in Isyn and Isyn .
                                               P        Q
                                                                                     such as a risk function, which guides the selection
   Figs. 15 and 16 display some other gray-level and color                           and pruning of results.
images using the same algorithm. We show the input (left)                       3.   Analyzing the DDMCMC convergence rate and
                                                                                     linking it to the goodness of the set of heuristics qs.
and a segmentation (middle) starting with arbitrary initial
conditions and a synthesized image (right) drawn from the
likelihood Isyn $ p…IjW †. The 
 values for these images are
mostly set up as I:S with a few obtained at 1.0-3.5. It took                APPROXIMATING           THE   KL-DIVERGENCE
about 10-50 minutes, depending upon the complexity of                       For simplicity of notation, we denote by p…x† an arbitrary
image contents, on a Pentium III PC to segment an image                     distribution in space . In our case, p…x† represents the
                                                                            posterior p…W jI†.
with medium size, such as QSH Â PSH pixels, after learning
                                                                               For segmentation problems, we observe that p…x† has
the pseudolikelihood texture models at the beginning.                       two important properties.
   The synthesis images show that we need to engage more
stochastic models such as point, curve process, and object like                 1.   p…x† has an enormous number of local maxima
faces, etc. For example, in the first row of Fig. 16. The music                      (called modes in statistics). A significant mode
band in a football stadium forms a point process which is not                        corresponds to a distinct interpretation of the image
                                                                                     and the cloud surrounding a mode is local small
captured. The face is also missing in the synthesis.
   Fig. 17 shows three gray-level images out of the                                  perturbation of the region boundaries or model
                                                                                     parameters. These significant modes of p…x†, de-
50 natural images in both color and gray-level for the
                                                                                     noted by xi ; i ˆ I; P; F F F , are well separated from
benchmark study. The input (left), the segmentation results                          each other due to the high dimensions.
by DDMCMC (middle), and the manual segmentation by a                            2.   Each mode xi has a weight !i ˆ p…xi † and its energy
human subject (right) are displayed.                                                 is defined as E…xi † ˆ À log p…xi †. The energies of
670                                      IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,               VOL. 24, NO. 5,         MAY 2002

Fig. 15. Gray-level image segmentation by DDMCMC. Left: input images, middle: segmentation results W , right: synthesized images Isyn $ p…IjW †
with the segmentation results W .

        these modes are uniformly distributed in a broad                 masses could differ in many orders of magnitudes. The
        range ‰Emin ; Em—x Š, say, ‰I; HHH; IH; HHHŠ. For example,       above metaphor leads us to a mixture of Gaussian
                                                                         representation of p…x†. For a large enough N, we have,
        it is normal to have solutions (or local maxima)
        whose energies differ in the order of SHH or more.                                  IX N                                  X

        Thus, their probability (weights) differ in the order                      p…x† ˆ         !j G…x À xj ; P †;
                                                                                                                 j          !ˆ          !j :
                                                                                            ! jˆI                                 jˆI
        of expÀSHH and our perception is interested in those
        ªtrivialº local modes.                                           We denote by
Intuitively, it helps to imagine that p…x† in  is distributed
like the mass of the universe. Each star is a mode as local                                                                 X
                                                                                  So ˆ f…!j ; xj †; j ˆ I; P; F F F ; Ng;         !j ˆ ! ;
maximum of the mass density. The significant and devel-                                                                     jˆI
oped stars are well separated from each other and their
TU AND ZHU: IMAGE SEGMENTATION BY DATA-DRIVEN MARKOV CHAIN MONTE CARLO                                                                     671

Fig. 16. Color image segmentation by DDMCMC. Left: input images, middle: segmentation results W , right: synthesized images Isyn $ p…IjW † with
the segmentation results W .

Fig. 17. Some segmentation results by DDMCMC for the benchmark test by Martin. The errors for the above results by DDMCMC (middle) compared
with the results by a human subject (right) are H:IHVQ, H:QHVP, and H:SPWH, respectively, according to their metrics.

the set of weighted particles (or modes). Thus, our task is to              By analogy, all ªstarsº have the same volume, but differ in
select K << N particles S from So . We define a mapping                     weight. With the two properties of p…x†, we can approximately
from the index in S to the index in So ,                                                   p
                                                                            compute D…pjj”† in the following. We start with an observa-
                                                                            tion for the KL-divergence for Gaussian distributions.
                X fI; P; F F F ; KgÀ3fI; P; F F F :; Ng:                      Let pI …x† ˆ G…x À I Y P † and pP …x† ˆ G…x À P Y P † be
                                                                            two Gaussian distributions, then it is easy to check that

               S ˆ f…!…i† ; x…i† †Y i ˆ I; P; F F F ; Kg                                                     …I À P †P
                                                                                               D…pI jjpP † ˆ               :
S encodes a nonparametric model for approximating p…x† by
                                                                               We partition the solution space  into disjoint domains
               I   X
                   K                                        X
      p…x† ˆ             !…i† G…x À x…i† ; P †;
                                              …i†   ˆ           !…i† :                 ˆ ‘N Di ;
                                                                                              iˆI        Di ’ Dj ˆ Y; Vi Tˆ j:
                  iˆI                                      iˆI
                                                                            Di is the domain where p…x† is decided by a particle …!i ; xi †.
Our goal is to compute                                                      The reason for this partition is that the particles in S are far
                                                                            apart from each other in high dimensional space and their
                          S à ˆ —rg min D…pjj”†:
                                             p                              energy varies significantly as the two properties state. Within
                                                                            each domain Di , it is reasonable to assume that p…x† is
  For notational simplicity, we assume all Gaussians have the               dominated by one term in the mixture and the other N À I
same variance in approximating p…x†, j ˆ ; j ˆ I; P; F F F ; N.           terms are neglectable.
672                                              IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,               VOL. 24, NO. 5,    MAY 2002

Fig. 18. (a) A 1D dist. p…x† with four particles xi ; i ˆ I; P; Q; R. (b) A 2D dist p…x† with SH particles we show log p…x† in the image for visualization.
    ”                                             p             p        ”                                          ”
(c) pI …x† with six particles that minimizes D…pjj”† or D…pjj”†. (d) pP …x† with six particles that minimizes jp À pj.
                                                                      TABLE 1
                                              Distances between p…x† and p…x† for Different Particle Set SQ

        p…x† %       G…x À xi Y P †;      x P Di ; i ˆ I; P; F F F ; N:            Equation (28) has some intuitive meanings. The second
                                                                                 term suggests that each selected x…c…i†† should have large
The size of Di is much larger than P . After removing N À                       weight !…c…i†† . The third term is the attraction forces from
K particles in the space, a domain Di is dominated by a                          particles in So to particles in S. Thus, it helps to pull apart
                                                                                 the particles in Sk and also plays the role of encouraging to
nearby particle that is selected in S.
                                                                                 choose particles with big weight like the second term. It can
  We define a second mapping function
                                                                                 be shown that (27) generalizes the traditional (MAP)
                   c X fI; P; F F F ; Ng 3 fI; P; F F F ; Kg;                    estimator when K ˆ I.
                                                                                    To demonstrate the goodness of approximation D…pjj”†       ” p
so that p…x† in Di is dominated by a particle x…c…i†† P SK ,                             p
                                                                                 to D…pjj”†, we show two experiments below. Fig. 18a
               !c…i†                                                             displays a 1D distribution p…x†, which is a mixture of N ˆ R
      p…x† %         G…x À x…c…i†† Y P †;      x P Di ; i ˆ I; P; F F F ; N:   Gaussians (particles). We index the centers from left to right
                                                                                 xI < xP < xQ < xR . Suppose we want to choose K ˆ Q
Intuitively, the N domains are partitioned into K groups,                                                 ”
                                                                                 particles for SK and p…x†. Table 1 displays the distances
each of which is dominated by one particle in SK . Thus, we                                           ”
                                                                                 between p…x† and p…x† over the four possible combinations.
can approximate D…pjj”†.                                                         The second row shows the KL-divergence D…pjj”† and the  p
                                                                                                  ” p
                                                                                 third row is D…pjj”†. The approximation is very accurate
                                       p…x†                                      given the particles are well separable.
 D…pjj”† ˆ                  p…x† log        dx                                      Both measures choose …xI ; xQ ; xR † as the best S. Particle xP is
                nˆI    Dn              ”
                                                                                 not favored by the KL-divergence because it is near xI ,
                      IX  N
                                                                                 although it has much higher weight than xQ and xR . The fourth
           ˆ                 !i G…x À xi Y P † log                              row shows the absolute value of the difference between p…x†
             nˆI Dn
                      ! iˆI
                    PN                                                                ”
                                                                                 and p…x†. This distance favors …xI ; xP ; xQ † and …xI ; xP ; xR †. In
                  I                        P
                  !    iˆI !i G…x À xi Y  †                                     comparison, the KL-divergence favors particles that are apart
                 Pk                              dx
               I                              P                                  from each other and picks up significant peaks in the tails.
                   jˆI !…j† G…x À  …j† Y  †
                                                                                    This idea is better demonstrated in Fig. 18. Fig. 18a
             X Z !n
                                                                                 shows log p…x† ˆ ÀE…x† which is renormalized for display-
           %              G…x À xn Y P †
             nˆI Dn
                       !                                                         ing. p…x† consists of N ˆ SH particles whose centers are
                                                                               shown by the black spots. The energies E…xi †; i ˆ I; P; F F F ; N
                                  !n G…x À xn Y P †
                log ‡ log                                      dx                are uniformly distributed in an interval ‰H; IHHŠ. Thus, their
                    !         !…c…n†† G…x À x…c…n†† Y P †                     weights differ in exponential order. Fig. 18b shows log p…x†     ”
                     "                                            #
             X !n
                                                …xn À x…c…n†† †P                                                                   p
                                                                                 with k ˆ T particles that minimize both D…pjj”† and D…pjj”†. ” p
           ˆ          log ‡ log              ‡                                   Fig. 18c shows the six particles that minimize the absolute
                  !        !        !…c…n††          PP
                                                                                 difference jp À pj. Fig. 18b has more disperse particles.
                       X !n
                         N                                                          For segmentation, a remaining question is: How do we
           ˆ log       ‡                                                         measure the distance between two solutions WI and WP ?
                      ! nˆI !
          "                                             #                        This distance measure is to some extent subjective. We
                                      …xn À x…c…n†† †P                          adopt a distance measure which simply accumulates the
            …E…x…c…n†† † À E…xn †† ‡                       ” p
                                                          ˆ D…pjj”†:
                                           PP                                   differences for the number of regions in WI ; WP and the
                                                                                 types ` of image models used at each pixel by WI and WP .
TU AND ZHU: IMAGE SEGMENTATION BY DATA-DRIVEN MARKOV CHAIN MONTE CARLO                                                                                     673

ACKNOWLEDGMENTS                                                                    [22] S. Oe, ªTexture Segmentation Method by Using Two-Dimensional
                                                                                        AR Model and Kullback Information,º Pattern Recognition, vol. 26,
This work is partially supported by two US National                                     pp. 237-243, 1993.
                                                                                   [23] S. Osher and J.A. Sethian, ªFront Propagating with Curvature
Science Foundation grants IIS 98-77-127 and IIS-00-92-664, a                            Dependent Speed: Algorithms Based on Hamilton-Jacobi For-
National Aeronautics and Space Agency grant NAG-13-                                     mulation,º J. Computational Physics, vol. 79, pp. 12-49, 1988.
                                                                                   [24] N. Paragios and R. Deriche, ªCoupled Geodesic Active Regions
00039, an Office of Naval Research grant N000140-110-535,
                                                                                        for Image Segmentation: A Level Set Approach,º Proc. European
and a Microsoft gift. MS student Rong Zhang helped in the                               Conf. Computer Vision, June 2000.
experiments for computing multiple solution experiments                            [25] S. Sclaroff and J. Isidoro, ªActive Blobs,º Proc. Int'l Conf. Computer
                                                                                        Vision, pp. 1146-1153, 1998.
in 1D and 2D used in Section 7. The authors thank David                            [26] J. Shi and J. Malik, ªNormalized Cuts and Image Segmentation,º
Mumford for discussions.                                                                IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8,
                                                                                        Aug. 2000.
                                                                                   [27] R.H. Swendsen and J.S. Wang, ªNonuniversal Critical Dynamics
                                                                                        in Monte Carlo Simulation,º Physical Rev. Letters, vol. 58, no. 2,
REFERENCES                                                                              pp. 86-88, 1987.
[1]    L. Alvarez, Y. Gousseau, and J.M. Morel, ªThe Size of Objects in            [28] J.P. Wang, ªStochastic Relaxation on Partitions with Connected
       Natural Images,º Preprint of Centre de Math. and Applications, 2000.             Components and Its Applications to Image Segmentation,º IEEE
[2]    S.A. Barker, A.C. Kokaram, and P.J.W. Rayner, ªUnsupervised                      Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 6,
       Segmentation of Images,º SPIEs 43rd Ann. Meeting, pp. 19-24, July                pp. 619-636, June 1998.
       1998.                                                                       [29] S.C. Zhu and A.L. Yuille, ªRegion Competition: Unifying Snakes,
[3]    C. Bouman and B. Liu, ªMultiple Resolution Segmentation of                       Region Growing, and Bayes/MDL for Multi-Band Image Seg-
       Textured Images,º IEEE Trans. Pattern Analysis and Machine                       mentation,º IEEE Trans. Pattern Analysis and Machine Intelligence,
       Intelligence, vol. 13, no. 2, pp. 99-113, Feb. 1991                              vol. 18, no. 9, pp. 884-900, Sept. 1996.
[4]    J.F. Canny, ªA Computational Approach to Edge Detection,º IEEE              [30] S.C. Zhu, Y.N. Wu, and D. Mumford, ªFilters, Random Field and
       Trans. Pattern Analysis and Machine Intelligence, vol. 8, no. 6,                 Maximum Entropy: Towards a Unified Theory for Texture
       pp. 679-698, Nov. 1986.                                                          Modeling,º Int'l J. Computer Vision, vol. 27, no. 2, pp. 107-126, 1998.
[5]    Y. Cheng, ªMean Shift, Mode Seeking, and Clustering,º IEEE                  [31] S.C. Zhu and X. Liu, ªLearning in Gibbsian Fields: How Accurate
       Trans. Pattern Analysis and Machine Intelligence, vol. 17, no. 8,                and How Fast Can It Be?º Proc. IEEE Conf. Computer Vision and
       pp. 790-799, Aug. 1995.                                                          Pattern Recognition, vol. 2, pp. 2-9, 2000.
[6]    D. Comaniciu and P. Meer, ªMean Shift Analysis and Applica-
       tions,º Proc. Int'l Conf. Computer Vision, pp. 1197-1203, 1999.                                     Zhuowen Tu received the BE degree in
[7]    A.P. Dempster, N.M. Laird, and D.B. Rubin, ªMaximum Like-                                           electronic engineering from Beijing Information
       lihood from Incomplete Data via the EM Algorithm,º J. Royal                                         Technology Institute in 1993. He received the
       Statistics Soc. Series B, vol. 39, pp. 1-38, 1977.                                                  ME degree in civil engineering from Tsinghua
[8]    Y.N. Deng, B.S. Manjunath, and H. Shin, ªColor Image                                                University in 1996. He received the MS degree
       Segmentation,º Proc. IEEE Conf. Computer Vision and Pattern                                         in geodetic science and surveying in 1999 from
       Recognition, pp. 446-451, 1999.                                                                     The Ohio State University. He is currently a
[9]    D.A. Forsyth, ªSampling, Resampling and Colour Constancy,º                                          PhD candidate in the Department of Computer
       Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 300-                                  and Information Science at The Ohio State
       305, 1999                                                                                           University. He was with the Institute of Compu-
[10]   S. Geman and D. Geman, ªStochastic Relaxation, Gibbs Distribu-                                      ter Science and Technology at Peking Univer-
       tions, and the Bayesian Restoration of Images,º IEEE Trans. Pattern         sity from 1996 to 1997. His research interests include computer vision,
       Analysis and Machine Intelligence, vol. 6, pp. 721-741, Nov. 1984.          image processing, and geographic information system and mapping.
[11]   P.J. Green, ªReversible Jump Markov Chain Monte Carlo
       Computation and Bayesian Model Determination,º Biometrika,                                         Song-Chun Zhu received the BS degree in
       vol. 82, no. 4, pp. 711-732, 1995.                                                                 computer science from the University of Science
[12]   U. Grenander and M.I. Millter, ªRepresentations of Knowledge in                                    and Technology of China in 1991. He received
       Complex Systems,º J. Royal Statistics Soc. Series B, vol. 56, no. 4,                               the MS and PhD degrees in computer science
       pp. 549-603, 1994.                                                                                 from Harvard University in 1994 and 1996,
[13]   M. Kass, A. Witkin, and D. Terzopoulos, ªSnakes: Active Contour                                    respectively. He was a research associate in
       Models,º Int'l J. Computer Vision, vol. 1, no. 4, pp. 321-332, Jan.                                the Division of Applied Mathematics at Brown
       1988.                                                                                              University during 1996-1997 and he was a
[14]   G. Koepfler, C. Lopez, and J.M. Morel, ªA Multiscale Algorithm                                     lecturer in the Computer Science Department
       for Image Segmentation by Variational Approach,º SIAM J.                                           at Stanford University during 1997-1998. Since
       Numerical Analysis, vol. 31, no. 1, pp. 282-299, 1994.                      1998, he has been a faculty member in the Department of Computer and
[15]   Y.G. Leclerc, ªConstructing Simple Stable Descriptions for Image            Information Sciences at Ohio State University. His research is focused
       Partitioning,º Int'l J. Computer Vision, vol. 3, no. 1, pp. 73-102, 1989.   on computer vision and learning, statistical modeling, and stochastic
[16]   A.B. Lee, D.B. Mumford, and J.G. Huang, ªOcclusion Models for               computing. He has published 50 articles and received honors, including
       Natural Images,º Int'l J. Computer Vision, vol. 41, pp. 35-59, Jan.         a David Marr prize honorary mention, a US National Science Foundation
       2001.                                                                       Career award, research fellow of Sloan foundation, and a Navy young
[17]   H.C. Lee and D.R. Cok, ªDetecting Boundaries in a Vector Field,º            investigator award.
       IEEE Trans. Signal Processing, vol. 39, no. 5, pp. 1181-1194, 1991.
[18]   K.L. Mengersen and R.L. Tweedie, ªRates of Convergence of the
       Hastings and Metropolis Algorithms,º Annals of Statistics, vol. 24,
       pp. 101-121, 1994.
[19]   N. Metropolis, M.N. Rosenbluth, A.W. Rosenbluth, A.H. Teller,
       and E. Teller, ªEquations of State Calculations by Fast Computing           . For more information on this or any other computing topic,
       Machinesº J. Chemical Physics, vol. 21, pp. 1087-1092, 1953.                please visit our Digital Library at
[20]   D.B. Mumford and B. Gidas, ªStochastic Models for Generic
       Images,º Quarterly of Applied Math., vol. LIX, no. 1, pp. 85-111,
       Mar. 2001.
[21]   D. Martin, C. Fowlkes, D. Tal, and J. Malik, ªA Database of
       Human Segmented Natural Images and Its Application to
       Evaluating Segmentation Algorithms and Measuring Ecological
       Statistics,º Proc. Int'l Conf. Computer Vision, vol. 2, pp. 416-423,

To top