VIEWS: 24 PAGES: 17 POSTED ON: 3/9/2010 Public Domain
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 5, MAY 2002 657 Image Segmentation by Data-Driven Markov Chain Monte Carlo Zhuowen Tu and Song-Chun Zhu AbstractÐThis paper presents a computational paradigm called Data-Driven Markov Chain Monte Carlo (DDMCMC) for image segmentation in the Bayesian statistical framework. The paper contributes to image segmentation in four aspects. First, it designs efficient and well-balanced Markov Chain dynamics to explore the complex solution space and, thus, achieves a nearly global optimal solution independent of initial segmentations. Second, it presents a mathematical principle and a K-adventurers algorithm for computing multiple distinct solutions from the Markov chain sequence and, thus, it incorporates intrinsic ambiguities in image segmentation. Third, it utilizes data-driven (bottom-up) techniques, such as clustering and edge detection, to compute importance proposal probabilities, which drive the Markov chain dynamics and achieve tremendous speedup in comparison to the traditional jump- diffusion methods [12], [11]. Fourth, the DDMCMC paradigm provides a unifying framework in which the role of many existing segmentation algorithms, such as, edge detection, clustering, region growing, split-merge, snake/balloon, and region competition, are revealed as either realizing Markov chain dynamics or computing importance proposal probabilities. Thus, the DDMCMC paradigm combines and generalizes these segmentation methods in a principled way. The DDMCMC paradigm adopts seven parametric and nonparametric image models for intensity and color at various regions. We test the DDMCMC paradigm extensively on both color and gray-level images and some results are reported in this paper. Index TermsÐImage segmentation, Markov Chain Monte Carlo, region competition, data clustering, edge detection, Markov random field. æ 1 INTRODUCTION I MAGE segmentation is a long standing problem in computer vision and it is found difficult and challenging for two main reasons. First, we formulate the problem in a Bayesian/MDL framework [15], [14], [29] with seven families of image models which compete to explain various visual patterns in The first challenge is the fundamental complexity of an image, for example, flat regions, clutter, texture, smooth modeling a vast amount of visual patterns that appear in shading, etc. generic images. The objective of image segmentation is to Second, we decompose the solution space into a union of parse an image into its constituent components. The latter many subspaces of varying dimensions and each subspace are various stochastic processes, such as attributed points, is a product of a number of subspaces for the image lines, curves, textures, lighting variations, and deformable partition and image models (see Fig. 3 for a space structure). objects. Thus, a segmentation algorithm must incorporate The Bayesian posterior probability is distributed over such many families of image models and its performance is a heterogeneously structured space. upper bounded by the accuracy of its image models. Third, we design ergodic Markov chains to explore the The second challenge is the intrinsic ambiguities in solution space and sample the posterior probability. The image perception, especially when there is no specific task Markov chains consist of two types of dynamics: jumps and to guide the attention. Real world images are fundamentally diffusion. The jump dynamics simulate reversible split-and- ambiguous and our perception of an image changes over time. Furthermore, an image often demonstrates details at merge and model switching. The diffusion dynamics multiple scales. Thus, the more one looks at an image, the simulate boundary deformation, region growing, region more one sees. Therefore, it must be wrong to think that a competition [29], and model adaptation. We make the split segmentation algorithm outputs only one result. In our and merge processes reversible and the ergodicity and opinion, image segmentation should be considered a reversibility enable the algorithm to achieve nearly global computing process not a vision task. It should output multiple optimal solution independent of initial segmentation condi- distinct solutions dynamically and endlessly so that these tions. Thus, this demonstrates major progress over the solutions ªbest preserveº the intrinsic ambiguity. previous region competition algorithm (Zhu and Yuille) [29]. Motivated by the above two observations, we present a Fourth, we utilize data-driven techniques to guide the stochastic computing paradigm called data-driven Markov Markov chain search and, thus, achieves tremendous speed- chain Monte Carlo (DDMCMC) for image segmentation. We up in comparison to previous MCMC algorithms [10], [12], proceed in five steps. [11]. In the literature, there are various techniques for improving the Markov chain speed, such as multiresolution approaches [28], [3], causal Markov models [3], [22], and . The authors are with the Department of Computer and Information clustering [28], [27], [2], [9]. In our DDMCMC paradigm, data- Science, Ohio State University, 2015 Neil Ave., Columbus, OH 43210. E-mail: {ztu, szhu}@cis.ohio-state.edu. driven methods, such as edge detection [4] and tracing, data clustering [5], [6] are used. The results of these algorithms are Manuscript received 7 June 2001; accepted 26 Nov. 2001. Recommended for acceptance by D. Forsyth. expressed as weighted samples (or particles), which For information on obtaining reprints of this article, please send e-mail to: encode nonparametric probabilities in various subspaces. tpami@computer.org, and reference IEEECS Log Number 114320. These probabilities, respectively, approximate the marginal 0162-8828/02/$17.00 ß 2002 IEEE 658 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 5, MAY 2002 probabilities of the Bayesian posterior probability and they Thus, a segmentation is denoted by a vector of hidden are used to design importance proposal probabilities to drive variables W , which describes the world state for generating the Markov chains. the image I. Fifth, we propose a mathematical principle and a ªK-adventurersº algorithm for selecting and pruning a set W K; f Ri ; `i ; Âi Y i I; P; F F F ; Kg: of important and distinct solutions from the Markov chain In a Bayesian framework, we make inference about W sequence and at multiple scales of details. The set of from I over a solution space . solutions encode an approximation to the Bayesian poster- ior probability. The multiple solutions are computed to W $ p W jI G p IjW p W ; W P : minimize a Kullback-Leibler divergence from the approx- As we mentioned before, the first challenge in segmenta- imative posterior to the true posterior and they preserve the tion is to obtain realistic image models. In the following, we ambiguities in image segmentation. briefly discuss the prior and likelihood models selected in In summary, the DDMCMC paradigm is about effec- our experiments. tively creating particles (by bottom-up clustering/edge detection), composing particles (by importance proposals), 2.2 The Prior Probability p W and pruning particles (by a K-adventurers algorithm) and The prior probability p W is a product of the following these particles represent hypotheses of various grainula- four probabilities. rities in the solution space. Conceptually, the DDMCMC paradigm also reveals the 1. An exponential model for the number of regions roles of some well-known segmentation algorithms. Algo- p K G eÀH K . rithms such as split-and-merge, region growing, Snake [13] 2. H A general smoothness Gibbs prior for the region and balloon/bubble [25], region competition [29], and boundaries p À G e À À ds . variational methods [14], and PDEs [24] can be viewed as 3. A model for the size of each region. Recently, both various MCMC jump-diffusion dynamics with minor mod- empirical and theoretical studies [1], [16] on the ifications. Other algorithms, such as edge detection [4] and statistics of natural images indicate that the size of a clustering [6], [8] compute importance proposal probabilities. region A jRj in natural images follows a distribu- We test the algorithm on a wide variety of gray-level and I tion, p A G A ; $ P:H. Such a prior encourages large color images and some results are shown in the paper. We regions to form. In our experiments, we found this also demonstrate multiple solutions and verify the segmen- prior is not strong enough to enforce large regions; tation results by synthesizing (reconstructing) images instead we take a distribution through sampling the likelihood models. c p A G eÀ A ; P 2 PROBLEM FORMULATION AND IMAGE MODELS where c H:W is a constant. is a scale factor which In this section, we formulate the problem in a Bayesian controls the scale of the segmentation. This scale framework and discuss the prior and likelihood models that factor is in spirit similar to the ªclutter factorº found by Mumford and Gidas [20] in studying natural are selected in our experiments. images. It is an indicator for how ªbusyº an image is. 2.1 The Bayesian Formulation for Segmentation In our experiments, it is typically set to P:H and is Let Ã f i; j X I i L; I j Hg be an image lattice the only free parameter in this paper. and IÃ an image defined on Ã. For any point v P Ã, Iv P 4. The prior for the type of model p ` is assumed to be fH; F F F ; Gg is the pixel intensity for a gray-level image, or uniform and the prior for the parameters Â of an Iv Lv ; Uv ; Vv for a color image.1 The problem of image image model penalizes model complexity in terms of segmentation refers to partitioning the lattice into an the number of parameters Â, p Âj` G eÀjÂj . unknown number of K disjoint regions In summary, we have the following prior model Ã K Ri ; Ri Rj Y; Vi T j: I Y K iI p W G p K p Ri p `i p Âi j`i Each region R & Ã needs not to be connected for reason of ( iI occlusion. We denote by Ài @Ri the boundary of Ri . As a X I K ) c slight complication, two notations are used interchangeably G exp ÀH K À ds jRi j jÂi j : iI @Ri in the literature. One treats a region R & Ã as a discrete label map and the other treats a region boundary À s @R as a continuous contour parameterized by s. The contin- 2.3 The Likelihood p IjW for Gray-Level Images uous representation is convenient for diffusions while the Visual patterns in different regions are assumed to label map representation is better for maintaining the be independent stochastic processes specified by topology. The level set method [23], [24] provides a good Âi ; `i ; i I; P; F F F ; K. Thus, the likelihood is,2 transform between the two. Each image region IR is supposed to be coherent in the Y K sense that IR is a realization from a probabilistic model p IR Y Â. Â p IjW p IRi Y Âi ; `i : iI represents a stochastic process whose type is indexed by `. 1. We transfer the (R, G, B) representation to (LÃ ; uÃ ; vÃ ) for better color 2. As a slight notation complication, Â; ` could be viewed as parameters distance measure. or hidden variables in W We use p I; Â; ` in both situtations for simplicity. TU AND ZHU: IMAGE SEGMENTATION BY DATA-DRIVEN MARKOV CHAIN MONTE CARLO 659 Fig. 1. Four types of regions in the windows are typical in real world images: (a) uniform, (b) clutter, (c) texture, and (d) shading. Y The choice of models needs to balance model sufficiency p IR Y Â; gQ p Iv jI@v Y Â and computational efficiency. In studying a large image set, vPR Y I S we found that four types of regions appear most frequently expfÀ < Â; h Iv jI@v >g; in real world images. Fig. 1 shows examples for the four Z vPR v types of regions in windows: Fig. 1a shows the flat regions with no distinct image structures, Fig. 1b shows the where < Á; Á > is the inner product between two cluttered regions, Fig. 1c shows the regions with homo- vectors and the model is considered nonparametric. The reason for choosing the pseudolikelihood geneous textures, and Fig. 1d shows the inhomogeneous expression is obvious: Its normalizing constant can regions with globally smooth shading variations. be computed exactly and Â can be estimated easily We adopt the following four families of models for the from images. We refer to a recent paper [31] for four types of regions. The algorithm can switch between discussions on the computation of this model and its them by Markov chain jumps. The four families are indexed variations, such as patch likelihood, etc. by ` P fgI ; gP ; gQ ; gR g and denoted by $gI , $gP , $gQ , and $gR , 4. Gray image model family ` gR : $gR . The first three respectively. Let G HY P be a Gaussian density centered at families of models are homogeneous, which fail in H with variance P . characterizing regions with shading effects, such as sky, lake, wall, perspective texture, etc. In the 1. Gray image model family ` gI : $gI . This assumes literature, such smooth regions are often modeled that pixel intensities in a region R are subject to by low order Markov random fields, which again do independently and identically distributed (iid) not model the inhomogeneous pattern over space Gaussian distribution, and often lead to over-segmentation. In our experi- Y ments, we adopt a 2D Bezier-spline model with p IR Y Â; gI G Iv À Y P ; Â ; P $gI : sixteen equally spaced control points on Ã (i.e., we vPR fix the knots). This is a generative type model. Let Q B x; y be the Bezier surface, for any v x; y P Ã, T B x; y U x Â M Â U y ; T 2. Gray image model family ` gP : $gP . This is a nonparametric intensity histogram h . In practice where h is discretized as a step function expressed by a vector hH ; hI ; F F F ; hG . Let nj be the number of pixels U x I À xQ ; Qx I À xP ; QxP I À x; xQ T in R with intensity level j. and Y Y G n p IR Y Â; gP h Iv hj j ; M mII ; mIP ; mIQ ; mIR Y F F F Y mRI ; F F F ; mRR : vPR jH R Therefore, the image model for a region R is, Â hH ; hI ; F F F ; hG P $gP : Y 3. Gray image model family ` gQ : $gQ . This is a texture p IR Y Â; gR G Iv À Bv Y P ; Â M; P $gR : vPR model FRAME [30] with pixel interactions captured by a set of Gabor filters. This family of models was U demonstrated to be sufficient in realizing a wide variety of texture patterns. To facilitate the computa- In summary, four types of models compete to explain a tion, we choose a set of eight filters and formulate the gray intensity region. Whoever fits the region better will have model in pseudolikelihood form [31]. The model is a higher likelihood. We denote by $g the gray-level model Â specified by a long vector Â I ; P ; F F F ; m P $gQ , space, m is the total number of bins in the histograms of the Â P $g $gI $gP $gQ $gR : eight Gabor filtered images. Let @v denote the Markov Â neighborhood of v P R and h Iv jI@v the vector including eight local histograms of filter responses 2.4 Model Calibration in the neighborhood of pixel v. Each of the filter The four image models should be calibrated for two reasons. histograms counts the filter responses at pixels whose First, for computational efficiency, we prefer simple models filter windows cover v. Thus, we have with less parameters. However, penalizing the number of 660 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 5, MAY 2002 parameters is not enough in practice. When a region is of size over $ IHH pixels, the data term dominates the prior and demands more complex models. Second, the pseudolikeli- hood models in family $gQ are not a true likelihood as they depend on a rather big neighborhood; thus they are not directly comparable to the other three types of models. To calibrate the likelihood probabilities, we did an empirical study. We collected a set of typical regions from natural images and manually divided them into four categories. For example, Fig. 2 shows four typical images in the first column, which are cropped from the images in Fig. 1. We denote the four images by Ios ; i I; P; Q; R on a i lattice Ão . For each image Ios , we compute its per pixel i coding length (minus log-likelihood) according to an optimal model within family $gj computed by a maximum likelihood estimation for j I; P; Q; R. log p Ios Y Â; gj i Lij min À ; for I i; j R: V $gj QÂ jÃo j We denote by ÂÃ P $gj optimal fit within each family and ij draw a typical sample (synthesis) from each fitted model, Isyn $ p IY ÂÃ ; gj ; ij ij for I i; j R: We show Ios , Isyn , i and Lij in Fig. 2 for I i; j R. ij Fig. 2. Comparison study of four families of models. The first column is The results in Fig. 2 show that the spline model has the original image regions cropped from four real world images shown in Fig. 1. The images in the 2-5 columns are synthesized images Isyn $ obviously the shortest coding length for the shading region, ij p IR Y ÂÃ sampled from the four families, respectively, each after an ij while the texture model fits the best for the three other MLE fitting. The number below each synthesized image shows the per- regions. Then, we choose to rectify these models by a pixel coding bits Lij using each family of model. constant factor eÀcj for each pixel v, 3. Color image model family cQ : $cQ . We use three p Iv Y Â; gj p Iv Y Â; gj eÀcj for j I; P; Q; R: Bezier spline surfaces (see (6)) for LÃ , uÃ , and vÃ , cj ; j I; P; Q; R are chosen so that the rectified coding respectively, to characterize regions with gradually length Lij reaches minimum when i j. That is, uniform changing colors such as sky, wall, etc. Let B x; y be regions, clutter regions, texture regions, and shading the color value in LÃ ; uÃ ; vÃ space for any regions are best fitted by the models in $I , $P , $Q , and v x; y P Ã, $R , respectively. T T B x; y U x Â ML Â U y ; U x Â Mu Â U y ; 2.5 Image Models for color U x Â Mv Â U y T : T In experiments, we work on both gray-level and color images. For color images, we adopt a LÃ ; uÃ ; vÃ color Thus, the model is space and adopted three families of models indexed by Y ` P fcI ; cP ; cQ g. Let G HY Æ denote a 3D Gaussian density. p IR Y Â; cQ G Iv À Bv Y Æ; vPR 1. Color image model family cI : $cI . This is an iid Gaussian model in LÃ ; uÃ ; vÃ space. where Â ML ; Mu ; Mv ; Æ are the parameters. Y In summary, three types of models compete to explain a p IR Y Â; cI G Iv À Y Æ; Â ; Æ P $cI : color region. Whoever fits the region better will have higher vPR likelihood. We denote by $c the color model space, then Â W $c $cI $cP $cQ : Â 2. Color image model family cP : $cP . This is a mixture of two Gaussians and is used for modeling textured color 3 ANATOMY OF SOLUTION SPACE regions, Before we design an algorithm, we need to study the Y structures of the solution space in which the posterior p IR Y Â; cP I G Iv À I Y ÆI probability p W jI is distributed. vPR We start with the partition space for all possible partitions P G Iv À P Y ÆP : of a lattice Ã. When a lattice Ã is segmented into k disjoint Thus, regions, we call it a k-partition denoted by k , Â I ; I ; ÆI ; P ; P ; ÆP P $cP k RI ; RP ; F F F ; Rk ; k Ri Ã; iI Ri Rj Y; Vi T j: IH are the parameters. TU AND ZHU: IMAGE SEGMENTATION BY DATA-DRIVEN MARKOV CHAIN MONTE CARLO 661 Fig. 3. The anatomy of the solution space. The arrows represent Markov chain jumps, and the reversible jumps between two subspace V and W realize a split-and-merge of a region. If all pixels in each region are connected, then k is a connected 4.1 Three Basic Criteria for Markov Chain Design component partition [28]. The set of all k-partitions, denoted There are three basic requirements for Markov chain design. by $k , is a quotient space of the set of all possible k-colorings First, the Markov chain should be ergodic. That is, from divided by a permutation group q for the labels. an arbitrary initial segmentation Wo P , the Markov chain can visit any other states W P in finite time. This $k f RI ; RP ; F F F ; Rk k Y disqualifies all greedy algorithms. Ergodicity is ensured II jRi j > H; Vi I; P; F F F ; kg=q: by the jump-diffusion dynamics [12]. Diffusion realizes random moves within a subspace of fixed dimensions. Thus, we have a general partition space $ with the number Jumps realize reversible random walks between subspaces of regions I k jÃj, of different dimensions, as shown by the arrows in Fig. 3. jÃj $ kI $k : Second, the Markov chain should be aperiodic. This is ensured by using the dynamics at random. Then, the solution space for W is a union of subspaces k Third, the Markov chain has stationary probability and each k is a product of one k-partition space $k and p W jI. This is replaced by a stronger condition of detailed k spaces for the image models balance equations which demands that every move should be 2 3 reversible [12], [11]. The jumps in this paper all satisfy detailed balance and reversibility. kI k kI 4 $k Â $Â Â Á Á Á Â $Â 5; jÃj jÃj IP |{z} k 4.2 Five Markov Chain Dynamics where $Â R $gi for gray-level images and $Â We adopt five types of Markov chain dynamics which are iI Q $ci for color images. used at random with probabilities q I; F F F ; q S, respectively. iI Fig. 3 illustrates the structures of the solution space. In The dynamics 1-2 are diffusion and dynamics 3-5 are Fig. 3, the four image families $` ; ` gI ; gP ; gQ ; gR are reversible jumps. represented by the triangles, squares, diamonds, and Dynamics 1: Boundary Diffusion/Competition. For mathe- circles, respectively. $Â $g is represented by a hexagon matical convenience, we switch to a continuous boundary Â representation for regions Ri ; i I; F F F ; K. These curves containing the four shapes. The partition space $k is evolve to maximize the posterior probability through a represented by a rectangle. Each subspace k consists of a region competition equation [29]. Let Àij be the boundary rectangle and k hexagons, and each point W P k represents between Ri ; Rj , Vi; j, and Âi ; Âj the models for the two a k-partition plus k image models for k regions. We call k the scene spaces. $k and $` ; ` gI ; gP ; gQ ; gR (or regions, respectively. The motion of points Àij s ` cI ; cP ; cQ ) are the basic components for constructing and, x s; y s follows the steepest ascent equation of the thus, are called the atomic spaces. Sometimes, we call $ a log p W jI plus a Brownian motion dB along the curve partition space and $` ; ` gI ; gP ; gQ ; gR ; cI ; cP ; cQ the cue n normal direction ~ s. By variational calculus, this is [29], spaces. dÀij s dt 4 EXPLORING THE SOLUTION SPACE BY p I x s; y sY Âi ; `i p fprior s log PT tdB ~ s:n ERGODIC MARKOV CHAINS p I x s; y sY Âj ; `j The solution space in Fig. 3 is typical for vision problems. The first two terms are derived from the prior and The posterior probability p W jI not only has an enormous likelihood, respectively. The Brownian motion is a normal number of local maxima but is distributed over subspaces distribution whose magnitude is controlled by a tempera- of varying dimensions. To search for globally optimal ture T t which decreases with time t. The Brownian motion solutions in such spaces, we adopt the Markov chain Monte helps to avoid local small pitfalls. The log-likelihood ratio Carlo (MCMC) techniques. requires that the image models are comparable. Dynamics 1 662 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 5, MAY 2002 realizes diffusion within the atomic (or partition) space $k images) for a region Ri . For example, from texture (i.e., moving within a rectangle of Fig. 3). description to a spline surface, etc. Dynamics 2: Model Adaptation. This is simply to fit the parameters of a region by steepest ascent. One can add a W `i ; Âi ; WÀ 23 `Hi ; ÂHi ; WÀ W H : Brownian motion, but it does not make much of a difference in The proposal probabilities are practice. G W 3 dW H q Sq Ri q `Hi q ÂHi jRi ; `Hi dW H ; IT dÂi @ log p IRi Y Âi ; `i H G W 3 dW q Sq Ri q `i q Âi jRi ; `i dW : IU dt @Âi : This realizes diffusion in the atomic (or cue) spaces $l ; ` P fgI ; gP ; gQ ; gR ; cI ; cP ; cQ g (move within a triangle, square, 4.3 The Bottlenecks diamond, or circle of Fig. 3). The speed of a Markov chain depends critically on the design Dynamics 3-4: Split and Merge. Suppose at a certain time of its proposal probabilities in the jumps. In our experiments, step, a region Rk with model Âk is split into two regions Ri and the proposal probabilities, such as q I; F F F ; q S, q Rk , Rj with models Âi ; Âj , or vice verse, and this realizes a jump q Ri ; Rj , q ` are easy to specify and do not influence the between two states W to W H as shown by the arrows in Fig. 3. convergence significantly. The real bottlenecks are caused by two proposal probabilities in the jump dynamics. W K; Rk ; `k ; Âk ; WÀ 2 3 K I; Ri ; `i ; Âi ; Rj ; `j ; Âj ; WÀ W H ; 1. q ÀjR in (13): Where is a good À for splitting a given region R? q ÀjR is a probability in the atomic (or where WÀ denotes the remaining variables that are un- partition) space $ . changed during the move. By the classic Metropolis- 2. q ÂjR; ` in (13), (15), and (17): For a given region R Hastings method [19], we need two proposal probabilities and a model family ` P fgI ; F F F ; gR ; cI ; cP ; cQ g, what is G W 3 dW H and G W H 3 dW . G W 3 dW H is a condi- a good Â? q ÂjR; ` is a probability in the atomic tional probability for how likely the Markov chain proposes (cue) space $` . to move to W H at state W and G W H 3 dW is the proposal probability for coming back. The proposed split is then It is worth mentioning that both probabilities q ÀjR and accepted with probability q ÂjR; ` cannot be replaced by deterministic decisions which were used in region competition [29] and others [15]. H G W H 3 dW p W H jIdW H Otherwise, the Markov chain will not be reversible and, W 3 dW min I; : G W 3 dW H p W jIdW thus, reduce to a greedy algorithm. On the other hand, if we choose uniform distributions, it is equivalent to blind search There are two routes (or ªpathwaysº in a psychology and the Markov chain will experience exponential ªwait- language) for computing the split proposal G W 3 dW H . ingº time before each jump. In fact, the length of the waiting In route 1, it first chooses a split move with probability time is proportional to the volume of the cue spaces. The q Q, then chooses region Rk from a total of K regions at design of these probabilities need to strike a balance random; we denote this probability by q Rk . Given Rk , it between speed and robustness (nongreediness). chooses a candidate splitting boundary Àij within Rk with While it is hard to analytically derive a convergence probability q Àij jRk . Then, for the two new regions Ri ; Rj , rate for complicated algorithms that we are dealing with, it it chooses two new model types `i and `j with probabilities is revealing to observe the following theorem in a simple q `i and q `j , respectively. Then, it chooses Âi P $`i with case [18]: probability q Âi jRi ; `i and chooses Âj with probability q Âj jRj ; `j . Thus, Theorem 1. Sampling a target density p x by independence Metropolis-Hastings algorithm with proposal probability q x. G W 3 dW H q Qq Rk q Àij jRk q `i Let P n xo ; y be the probability of a random walk to reach IQ q Âi jRi ; `i q `j q Âj jRj ; `j dW H : point y at n steps. If there exists > H such that, In route 2, it first chooses two new region models Âi and q x ! ; Vx; Âj and, then, decides the boundary Àij . Thus, p x G W 3 dW H q Qq Rk q `i q `j q Âi ; Âj jRk ; `i ; `j then the convergence measured by a LI norm distance IR q Àij jRk ; Âi ; Âj dW H : jjP n xo ; Á À pjj I À n : We shall discuss in later a section that either of the two This theorem states that the proposal probability q x routes can be more effective than the other depending on the region Rk . should be very close to p x for fast convergence. In our Similarly, we have the merge proposal probability, case, q ÀjR and q ÂjR; ` should be equal to the conditional probabilities of some marginal probabilities G W H 3 dW q Rq Ri ; Rj q `k q Âk jRk ; `k dW : IS of the posterior p W jI within the atomic spaces $ and $` , respectively. That is, q Ri ; Rj is the probability of choosing to merge two regions Ri and Rj at random. Ã Ã qÀ Àij jRk p Àij jI; Rk ; qÂ ÂjR; ` p ÂjI; R; `; V`: Dynamics 5: Switching Image Models. This switches the IV image model within the four families (three for color TU AND ZHU: IMAGE SEGMENTATION BY DATA-DRIVEN MARKOV CHAIN MONTE CARLO 663 Fig. 4. A color image and its clusters in LÃ ; uÃ ; vÃ space for $cI , the second row are six of the saliency maps associated with the color clusters. Ã Ã Unfortunately, qÀ and qÂ have to integrate information strategy with a limit m j ` j. This is well discussed in the from the entire image I and, thus, are intractable. We must literature [5], [6]. ` seek approximations and this is where the data-driven In the clustering algorithms, each feature Fv and, thus, its ` ` methods step in. location v is classified to a cluster Âi with probability Si;v , In the next section, we discuss data clustering for each X m atomic space $` ; ` P fcI ; cP ; cQ g and ` P fgI ; gP ; gQ ; gR g and ` ` Si;v p Fv Y Â` ; with ` Si;v I; Vv P Ã; V`: i edge detection in $ . The results of clustering and edge iI detection are expressed as nonparametric probabilities for Ã Ã This is a soft assignment and can be computed by the approximating the ideal marginal probabilities qÀ and qÂ in distance from Fv to the cluster centers. We call these atomic spaces, respectively. ` Si` fSi;v X v P Ãg; for i I; P; F F F ; m; V` PH 5 DATA-DRIVEN METHODS a saliency map associated with cue particle Â` . i 5.1 Method I: Clustering in Atomic Spaces $` In the following, we discuss each model family with for ` P fcI ; cP ; cQ ; gI ; gP ; gQ ; gR g experiments. Given an image I (gray or color) on lattice Ã, we extract a ` feature vector Fv at each pixel v P Ã. The dimension of Fv` 5.1.1 Computing Cue Particles in $cI depends on the image model indexed by `. Then, we have a For color images, we take Fv Lv ; Uv ; Vv and apply a collection of vectors mean-shift algorithm [5], [6] to compute color clusters in $cI . For example, Fig. 4 shows a few color clusters (balls) in ` ` fFv X v P Ãg: a cubic ( LÃ ; uÃ ; vÃ -space) for a simple color image (left), the size of the balls represents the weights !cI . Each cluster is i In practice, v can be subsampled for computational ease. associated with a saliency map SicI for i I; P; F F F ; T in the The set of vectors are clustered by either an EM method [7] second row and the bright areas mean high probabilities. or a mean-shift clustering [5], [6] algorithm to ` . The EM- From left to right are, respectively, background, skin, shirt, clustering approximates the points density in ` by a shadowed skin, pant and hair, highlighted skin. mixture of m Gaussians and it extends from the m-mean clustering by a soft cluster assignment to each vector Fv . 5.1.2 Computing Cue Particles in $cQ The mean-shift algorithm assumes a nonparametric dis- Each point v contributes its color Iv Lv ; Uv ; Vv as ªsur- tribution for ` and seeks the modes (local maxima) in its face heightsº and we apply an EM-clustering to find the density (after some Gaussian window smoothing). Both spline surface models. Fig. 5 shows the clustering result for algorithms return a list of m weighted clusters Â` ; Â` ; F F F ; Â` I P m the woman image. Fig. 5a, Fig. 5b, Fig. 5c, and Fig. 5d are with weights !` ; i I; P; F F F ; m and we denote by i saliency maps SicQ for i I; P; Q; R. Fig. 5e, Fig. 5f, Fig. 5g, and Fig. 5h are the four reconstructed images according to ` f !` ; Â` X i I; P; F F F ; m: g: i i IW fitted spline surfaces which recover some global illumina- We call !` ; Â` a weighted atomic (or cue) particle in $` tion variations. i i for ` P fcI ; cQ ; gI ; gP ; gQ ; gR g.3 The size m is chosen to be 5.1.3 Computing Cue Particles in $gI conservative or it can be computed in a coarse-to-fine In this model, the feature space Fv Iv is simply the 3. The atomic space $cP is a composition of two $cI and, thus, is intensity and gI is the image intensity histogram. We computed from $cI . simply apply a mean-shift algorithm to get the modes 664 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 5, MAY 2002 Fig. 5. (a)-(d) are saliency maps associated with four clusters in $cQ . (e)-(h) are the color spline surfaces for the four clusters. Fig. 6. A clustering map (left) for $gI and six saliency maps SigI ; i I::T of a zebra image (input is in Fig. 14a). Fig. 7. A clustering map (left) for $gP and six saliency maps SigP ; i I; F F F ; T of a zebra image (input is in Fig. 14a). (peaks) of the histogram and the breadth of each peak g nine bins. Then, Fv Q hv;I;I ; F F F ; hv;V;W is the feature. An EM decides its variance. clustering is applied to find the m mean histograms Fig. 6 shows six saliency maps SigI ; i I; P; F F F ; T for a zebra " hi ; i I; P; F F F ; m. We can compute the cue particles for image (the original image is shown in Fig. 14a. In the " clustering map on the left in Fig. 6, each pixel is assigned to its texture models ÂgQ from hi for i I; P; F F F ; m. A detailed i most likely particle. We show the clustering by pseudocolors. account of this transform is referred to a previous paper [31]. Fig. 8 shows the texture clustering results on the zebra 5.1.4 Computing the Cue Particles in $gP image with one clustering map on the left and five saliency For clustering in $gP , at each subsampled pixel v P Ã, we maps for five particles ÂgQ ; i I; P; F F F ; S. i compute Fv as a local intensity histogram Fv hvH ; F F F ; hvG accumulated over a local window centered at v. Then, an 5.1.6 Computing Cue Particles in $gR EM clustering is applied to compute the cue particles and Each point v contributes its intensity Iv Fv as a ªsurface each particle ÂgP ; i I; F F F ; m is a histogram. This model is i heightº and we apply an EM-clustering to find the spline used for clutter regions. surface models. Fig. 9 shows a clustering result for the zebra Fig. 7 shows the clustering results on the same zebra image. image with four surfaces. The second row shows the four surfaces which recover some global illumination variations. 5.1.5 Computing Cue Particles $gQ Unlike the texture clustering results which capture the zebra At each subsampled pixel v P Ã, we compute a set of eight local histograms for eight filters over a local window of strips as a whole region, the surface models separate the black IP Â IP pixels. We choose eight filters for computational and white stripes as two regionsÐanother valid perception. convenience: one filter, two gradient filters, one Laplacian of Interestingly, the black and white strips in the zebra skin both Gaussian filter, and four Gabor filters. Each histogram has have shading changes which are fitted by the spline models. TU AND ZHU: IMAGE SEGMENTATION BY DATA-DRIVEN MARKOV CHAIN MONTE CARLO 665 Fig. 8. Texture clustering. A clustering map (left) and five saliency maps for five particles ÂgQ ; i I; P; F F F ; S. i Fig. 9. Clustering result on the zebra image under Bezier surface model. The left image is the clustering map. The first row of images on the right side are the saliency maps. The second row shows the fitted surfaces using the surface height as intensity. X m X m 5.2 Method II: Edge Detection q ÂjÃ; ` !` G Â À Â` ; i i with !` I; i PI i I i I We detect intensity edges using Canny edge detector [4] and color edges using a method in [17] and trace edges to where G x is a Parzen window centered at H. As a form a partition of the image lattice. We choose edges at matter of fact, q ÂjÃ; ` q ÂjI is an approximation to a three scales according to edge strength and, thus, compute marginal probability of the posterior p W jI on cue space $` ; ` P fgI ; gP ; gQ ; gR ; cI ; cQ g since the partition is inte- the partition maps in three coarse-to-fine scales. We choose grated out in EM-clustering. not to discuss the details, but show some results using the q ÂjÃ; ` is computed once for the whole image and two running examples: the woman and zebra images. q ÂjR; ` is computed from q ÂjÃ; ` for each R at runtime. Fig. 10a shows a color image and three scales of partitions. It proceeds in the following. Each cluster Â` ; i I; P; F F F ; m i Since this image has strong color cue, the edge maps are very receives a real-valued vote from the pixel v P R in region R informative about where the region boundaries are. In and the accumulative vote is the summation of the saliency contrast, the edge maps for the zebra image are very messy, map Si` associated with Â` , i.e., i as Fig. 11 shows. I X ` pi S ; i I; P; F F F ; m; V`: jRj vPR i;v 6 COMPUTING IMPORTANCE PROPOSAL PROBABILITIES Obviously, the clusters which receive high votes should It is generally acknowledged in the community that cluster- have high chance to be chosen. Thus, we sample a new ing and edge detection algorithms can sometimes produce image model Â for region R, good segmentations or even perfect results for some images, X m but very often they are far from being reliable for generic Â $ q ÂjR; ` pi G Â À Â` : i PP images, as the experiments in Figs. 4, 5, 6, 7, 8, 9, 10, and 11 iI demonstrate. It is also true that sometimes one of the image Equation (22) explains how we choose (or propose) an models and edge detection scales could do a better job in image model for a region R. We first draw a cluster i at segmenting some regions than other models and scales, but we do not know a priori what types of regions present in a generic image. Thus, we compute all models and edge detection at multiple scales and, then, utilize the clustering and edge detection results probabilistically. MCMC theory provides a framework for integrating this probabilistic information in a principled way under the guidance of a globally defined Bayesian posterior probability. We explain how the importance proposal probabilities q ÂjR; ` and q Àij jRk in Section 4.3 are computed from the data-driven results. 6.1 Computing Importance Proposal Probability q ÂjR; ` The clustering method in an atomic (cue) space $` outputs a Fig. 10. Partition maps at three scales of details for a color image. set of weighted cue particles ` . ` encodes a nonpara- (a) Input image. (b) Partition map at scale 1. (c) Partition map at scale 2. metric probability in $` , (d) Partition map at scale 3. 666 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 5, MAY 2002 Fig. 11. A gray-level image and three partition maps at three scales. (a) Input image. (b) Partition map at scale 1. (c) Partition map at scale 2. (d) Partition map at scale 3. Fig. 12. A candidate region Rk is superimposed on the partition maps at three scales for computing a candidate boundary Àij for the pending split. random according to probability p pI ; pP ; F F F ; pm and, So each partition map Á s encodes a probability in the then, do a random perturbation at Â` . Thus, any Â P $` has i atomic (partition) space $k . a nonzero probability to be chosen for robustness and s jÅk j ergodicity. Intuitively, the clustering results with local votes I X s s propose the ªhottestº portions of the space in a probabilistic q k s G k À k;j ; for s I; P; Q: Vk: PR jÅk j jI way to guide the jump dynamics. In practice, one could implement a multiresolution (on a G is a smooth window centered at H and its smoothness pyramid) clustering algorithm over smaller local windows, accounts for boundary deformations and forms a cluster s thus the clusters Â` ; i I; P; F F F ; m will be more effective at the i around each partition particle and k À k;j measures the s expense of some overhead computing. difference between two partition maps k and k;j . Martin et al. [21] recently proposed a method of measuring such 6.2 Computing Importance Proposal difference and we use a simplified version. In the finest Probability q ÀjR s resolution, all metaregions reduce to pixels and Åk is then By edge detection and tracing, we obtain partition maps equal to the atomic space $k . We adopt equal weights for s denoted by Á s at multiple scales s I; P; Q. In fact, each all partitions k and one may add other geometric partition map Á s consists of a set of ªmetaregionsº preferences to some partitions. s ri ; i I; P; F F F ; n, In summary, the partition maps at all scales encode a nonparametric probability in $k , s s Á s Ã fri X i I; P; F F F ; n; n ri Ãg; X iI q k q sq s k ; Vk: for s I; P; Q: s These metaregions are then used in combination to form This q k can be considered as an approximation to the s s s marginal posterior probability p k jI. K n regions RI ; RP ; F F F ; RK , The partition maps Á s ; Vs (or q k ; Vk implicitly) are s s s Ri j rj ; with rj P Á s ; Vi I; P; F F F ; K: computed once for the whole image, then the importance proposal probability q ÀjR is computed from q k for each One could put a constraint that all metaregions in a region region as a conditional probability at run time, like in the cue s Ri are connected. spaces. s s s s Let k RI ; RP ; F F F ; Rk denote a k-partition based Fig. 12 illustrates an example. We show partition maps s on Á s . k is different from the general k-partition k Á s Ã at three scales and the edges are shown at width s s because regions Ri ; i I; F F F ; K in k are limited to the Q; P; I, respectively, for s I; P; Q. A candidate region R is s metaregions. We denote by Åk the set of all k-partitions proposed to split. q ÀjR is the probability for proposing a based on a partition map Á s . splitting boundary À. We superimpose R on the three partition maps. The s s s s s s Åk f RI ; RP ; F F F ; Rk k X k Ri Ãg: PQ iI intersections between R and the metaregions generate s s three sets We call each k in Åk a k-partition particle in atomic s s s (partition) space $k . Like the clusters in a cue space, Åk is Á s R frj X rj R rj for rj P Á s Ã; a sparse subset of $k and it narrows the search in $k to s nd i ri Rg; s I; P; Q: the most promising portions. TU AND ZHU: IMAGE SEGMENTATION BY DATA-DRIVEN MARKOV CHAIN MONTE CARLO 667 For example, in Fig. 12, of many existing image segmentation algorithms. First, edge detection and tracing methods [4], [17] compute I I P P P P Á I R frI ; rP g; Á P R frI ; rP ; rQ ; rR g; implicitly a marginal probability q jI on the partition space $ . Second, clustering algorithms [5], [6] compute a etc. marginal probability on the model space $` for various s s Thus, we can define s R RI ; RP ; F F F ; R s as c c models `. Third, the split-and-merge and model switching a c-partition of region R based on Á s R and define a [2] realize jump dynamics. Fourth, region growing and c-partition space of R as competition methods [29], [24] realize diffusion dynamics for evolving the region boundaries. s s Å s R f RI ; RP ; F F F ; R s c c s PS s R X c Ri Rg; Vs: c iI 7 COMPUTING MULTIPLE DISTINCT SOLUTIONS We can define distributions on Å s R. 7.1 Motivation and a Mathematical Principle c The DDMCMC paradigm samples solutions from the s jÅX c Rj posterior W $ p W jI endlessly, as we argued in the s I s q c R G c À c;j R; introduction that segmentation is a computing process not s jÅc Rj jI PT a task. To extract an optimal result, one can take an for s I; P; Q; Vc: annealing strategy and use the conventional maximum a posteriori (MAP) estimator Thus, one can propose to split R into c pieces, in a general case, X W Ã rg mx p W jI: W P c R $ q c R q sq s c R: s In this paper, we argue that it is desirable and often critical to have the ability of computing multiple distinct solutions That is, we first select a scale s with probability q s. q s for the following reasons. depends on R. For example, for a large region R, we can First, natural scenes are intrinsically ambiguous and for choose coarse scale with higher probability and choose a an image I, many competing organizations and interpreta- fine scale for small regions. Then, we choose a c-partition tions exist in visual perception. from the set Å s R. In our implementation, c P is chosen c Second, for robustness, decisions should be left to the last as a special case for easy implementation. It is trivial to stage of computation when a segmentation process is show that an arbitrary c-partition of region R, c R, can be integrated with a specific task. Therefore, it is best to generated through composing P R in multiple steps. maintain a set of typical solutions. Obviously, there is a big overhead for choosing large c. Third, preserving multiple solutions is necessary when 6.3 Computing q Âi ; Âj jR; `i ; `j and q Àij jR; Âi ; Âj the prior and likelihood models are not perfect. Because the globally optimal solution may not be semantically more In some cases, we find the second route useful for splitting a meaningful than some other inferior local maxima. region which we discussed in designing MCMC dynamics However, simply keeping a set of samples from the 3-4 (see (14)). Markov chain sequence is not enough because it often collects For example, there are two ways to perceive the zebra in a set of segmentations which are trivially different from each Fig. 14. One perceives the zebra as one textured region (by a other. Here, we present a mathematical principle for model in $gQ ). The other sees it as one region of black stripes computing important and distinctive solutions in space . plus one region of white strips and, thus, uses two models in (Our result was presented earlier in a CVPR2000 paper.) $gI or $gR . The Markov chain should be able to switch between Let S f !i ; Wi X i I; F F F ; Kg be a set of K weighted the two perceptions effectively (see results in Fig. 14b, Fig. 14c, solutions which we call ªscene particles,º the weight is its and Fig. 14d. This is necessary and typical for the transitions posterior probability !i p W jI; i I; P; F F F ; K. (Note that between any texture regions and intensity regions. there is a slight abuse of notation, we use K for the number Because the number of strips in such textures is large, the of regions in W before. Here, it is a different K). S encodes a first split procedure (route 1) is very ineffective and it works nonparametric probability in , on one strip at a time. This motivates the second pathway for split dynamics. X !i K X K For a candidate region R, we first propose two new p W jI G W À Wi ; !i !: region models (we always assume the same labels `i `j ), iI ! iI which can be done by twice sampling the importance G is a Gaussian window in . proposal probabilities q ÂjR; `, so As all image ambiguities are captured in the Bayesian Âi ; Âj $ q Âi ; Âj jR; `i ; `j q Âi jR; `i q Âj jR; `j : posterior probability to reflect the intrinsic ambiguities, we should compute the set of solutions S which best preserves Obviously, we exclude Âi from the candidate set when we the posterior probability. Thus, we let p W jI approach select Âj . Then, we decide on the boundary À q Àij jR; Âi ; Âj p W jI by minimizing a Kullback-Leibler divergence D pjj p by randomly labeling the pixels in R according to probabil- under a complexity constraint jSj K, ities of the saliency maps. Z p W jI 6.4 A Unifying Framework S Ã rg min D pjj rg min p p W jI log dW : jSjK jSjK p W jI To summarize this section, the DDMCMC paradigm PU provides a unifying framework for understanding the roles 668 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 5, MAY 2002 Fig. 13. Segmenting a color image by DDMCMC with two solutions. See text for explanation. This criterion extends the conventional MAP estimator (see In practice, we run multiple Markov chains and add new Appendix for further discussion). particles to the set S in a batch fashion. From our experiments, the greedy algorithm did a satisfactory job 7.2 A KEedventurers Algorithm for Multiple and it is shown to be optimal in two low dimensional Solutions examples in the Appendix. p Fortunately, the KL-divergence D pjj can be estimated p fairly accurately by a distance measure D pjj which is computable, thanks to two observations of the posterior 8 EXPERIMENTS probability p W jI which has many separable modes. The The DDMCMC paradigm was tested extensively on many details of the calculation and some experiments are given in gray-level, color and textured images. This section shows the appendix. The idea is simple. We can always represent some examples and more are available on our website.5 It p W jI by a mixture of Gaussian, i.e., a set of N particles was also tested in a benchmark data set of 50 natural images with N large enough. By ergodicity, the Markov chain is in both color and gray-level [21] by the Berkeley group,6 supposed to visit these significant modes over time! Thus, where the results by DDMCMC and other methods such as our goal is to extract K distinct solutions from the Markov [26] are displayed in comparison to those by a number of human subjects. Each tested algorithm uses the same chain sampling process. Here, we present a greedy parameter setting for all the benchmark images and, thus, algorithm for computing S Ã approximately. We call the the results were obtained purely automatically. algorithmÐªKEdventurersº algorithm.4 We first show our working example on the color woman Suppose we have a set of K particles S at step t. At time image. Following the importance proposal probabilities for t I, we obtain a new particle (or a number of particles) by the edges in Fig. 10 and for color clustering in Fig. 4, we MCMC, usually following a successful jump. We augment simulated three Markov chains with three different initial the set S to S by adding the new particle(s). Then, we segmentations shown in Fig. 13 (top row). The energy eliminate one particle (or a number of particles) from S to get changes (À log p W jI) of the three MCMCs are plotted in a new Snew by minimizing the approximative KL divergence Fig. 13 against time steps. Fig. 13 shows two different D p jjpnew . solutions WI ; WP obtained by a Markov chain using KEdventurers algorithm. To verify the computed solution The k-adventurers algorithm Wi , we synthesized an image by sampling from the 1. Initializing S and p by repeating one initial solution likelihood Isyn $ p IjWi ; i I; P. The synthesis is a good i K times. way to examine the sufficiency of models in segmentation. 2. Repeat Fig. 14 shows three segmentations on a gray-level zebra 3. Compute a new particle !KI ; xKI by DDMCMC image. As we discussed before, the DDMCMC algorithm in after a successful jump. this paper has only one free parameter which is a ªclutter S 4. S 2 S f !KI ; xKI g. factorº in the prior model (see (2)). It controls the extents of 5. p 2 S . segmentations. A big encourages coarse segmentation 6. For i I; P; F F F ; K I do with large regions. We normally extract results at three 7. SÀi 2 S =f !i ; xi g. scales by setting I:H; P:H; Q:H, respectively. In our experiments, the KEdventurers algorithm is effective only 8. pÀi 2 SÀi . for computing distinct solutions in a certain scale. We 9. p di D pjjÀi . expect the parameter can be fixed to a constant if we form 10. iÃ rg miniPfI;P;FFF;KIg di . an image pyramid with multiple scales and conduct 11. S 2 SÀiÃ ; p 2 pÀiÃ 5. See www.cis.ohio-state.edu/oval/Segmentation/DDMCMC/ 4. The name follows a statistics metaphor told by Mumford to one of the DDMCMC.htm. authors Zhu. A team of K adventurers want to occupy K largest islands in 6. See www.cs.berkeley.edu/~dmartin/segbench/BSDS100/html/ an ocean while keeping apart from each other's territories. benchmark. TU AND ZHU: IMAGE SEGMENTATION BY DATA-DRIVEN MARKOV CHAIN MONTE CARLO 669 Fig. 14. Experiments on the gray-level zebra image with three solutions: (a) input image, (b)-(d) are three solutions, Wi ; i I; P; Q, for the zebra image, (e)-(g) are synthesized images Isyn $ p IjWiÃ for verifying the results. i segmentation with KEdventurers algorithm at each scale 9 DISCUSSION and, then, propagate and refine the results to the next finer In future work, we shall extend the DDMCMC paradigm in scale sequentially. This will be done in future research. three directions: For the zebra image, WI segments out the black and white stripes while WP and WQ treat the zebra as a texture 1. Integrating other image models, such as point, curve region. The synthesized images Isyn $ p IjWi ; i I; P; Q processes for perceptual organization, and object i models such as face, for object recognition. show that the texture model is not sufficient because we 2. Incorporating specific vision tasks with this sto- choose only eight small filters for computational ease. Also chastic inference engine. When there is a special the spline surface model plays an important role in purpose, the computing process is tuned (attended segmenting the ground and background grass and this is in a psychology term) to minimize some criterion, verified by the global shading changes in Isyn and Isyn . P Q such as a risk function, which guides the selection Figs. 15 and 16 display some other gray-level and color and pruning of results. images using the same algorithm. We show the input (left) 3. Analyzing the DDMCMC convergence rate and linking it to the goodness of the set of heuristics qs. and a segmentation (middle) starting with arbitrary initial conditions and a synthesized image (right) drawn from the APPENDIX likelihood Isyn $ p IjW . The values for these images are mostly set up as I:S with a few obtained at 1.0-3.5. It took APPROXIMATING THE KL-DIVERGENCE about 10-50 minutes, depending upon the complexity of For simplicity of notation, we denote by p x an arbitrary image contents, on a Pentium III PC to segment an image distribution in space . In our case, p x represents the posterior p W jI. with medium size, such as QSH Â PSH pixels, after learning For segmentation problems, we observe that p x has the pseudolikelihood texture models at the beginning. two important properties. The synthesis images show that we need to engage more stochastic models such as point, curve process, and object like 1. p x has an enormous number of local maxima faces, etc. For example, in the first row of Fig. 16. The music (called modes in statistics). A significant mode band in a football stadium forms a point process which is not corresponds to a distinct interpretation of the image and the cloud surrounding a mode is local small captured. The face is also missing in the synthesis. Fig. 17 shows three gray-level images out of the perturbation of the region boundaries or model parameters. These significant modes of p x, de- 50 natural images in both color and gray-level for the noted by xi ; i I; P; F F F , are well separated from benchmark study. The input (left), the segmentation results each other due to the high dimensions. by DDMCMC (middle), and the manual segmentation by a 2. Each mode xi has a weight !i p xi and its energy human subject (right) are displayed. is defined as E xi À log p xi . The energies of 670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 5, MAY 2002 Fig. 15. Gray-level image segmentation by DDMCMC. Left: input images, middle: segmentation results W , right: synthesized images Isyn $ p IjW with the segmentation results W . these modes are uniformly distributed in a broad masses could differ in many orders of magnitudes. The range Emin ; Emx , say, I; HHH; IH; HHH. For example, above metaphor leads us to a mixture of Gaussian representation of p x. For a large enough N, we have, it is normal to have solutions (or local maxima) whose energies differ in the order of SHH or more. IX N X N Thus, their probability (weights) differ in the order p x !j G x À xj ; P ; j ! !j : ! jI jI of expÀSHH and our perception is interested in those ªtrivialº local modes. We denote by Intuitively, it helps to imagine that p x in is distributed like the mass of the universe. Each star is a mode as local X N So f !j ; xj ; j I; P; F F F ; Ng; !j ! ; maximum of the mass density. The significant and devel- jI oped stars are well separated from each other and their TU AND ZHU: IMAGE SEGMENTATION BY DATA-DRIVEN MARKOV CHAIN MONTE CARLO 671 Fig. 16. Color image segmentation by DDMCMC. Left: input images, middle: segmentation results W , right: synthesized images Isyn $ p IjW with the segmentation results W . Fig. 17. Some segmentation results by DDMCMC for the benchmark test by Martin. The errors for the above results by DDMCMC (middle) compared with the results by a human subject (right) are H:IHVQ, H:QHVP, and H:SPWH, respectively, according to their metrics. the set of weighted particles (or modes). Thus, our task is to By analogy, all ªstarsº have the same volume, but differ in select K << N particles S from So . We define a mapping weight. With the two properties of p x, we can approximately from the index in S to the index in So , p compute D pjj in the following. We start with an observa- tion for the KL-divergence for Gaussian distributions. X fI; P; F F F ; KgÀ3fI; P; F F F :; Ng: Let pI x G x À I Y P and pP x G x À P Y P be two Gaussian distributions, then it is easy to check that Therefore, S f ! i ; x i Y i I; P; F F F ; Kg I À P P D pI jjpP : PP S encodes a nonparametric model for approximating p x by We partition the solution space into disjoint domains I X K X K p x ! i G x À x i ; P ; i ! i : N Di ; iI Di Dj Y; Vi T j: iI iI Di is the domain where p x is decided by a particle !i ; xi . Our goal is to compute The reason for this partition is that the particles in S are far apart from each other in high dimensional space and their S Ã rg min D pjj: p energy varies significantly as the two properties state. Within jSjK each domain Di , it is reasonable to assume that p x is For notational simplicity, we assume all Gaussians have the dominated by one term in the mixture and the other N À I same variance in approximating p x, j ; j I; P; F F F ; N. terms are neglectable. 672 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 24, NO. 5, MAY 2002 Fig. 18. (a) A 1D dist. p x with four particles xi ; i I; P; Q; R. (b) A 2D dist p x with SH particles we show log p x in the image for visualization. p p (c) pI x with six particles that minimizes D pjj or D pjj. (d) pP x with six particles that minimizes jp À pj. TABLE 1 Distances between p x and p x for Different Particle Set SQ !i p x % G x À xi Y P ; x P Di ; i I; P; F F F ; N: Equation (28) has some intuitive meanings. The second ! term suggests that each selected x c i should have large The size of Di is much larger than P . After removing N À weight ! c i . The third term is the attraction forces from K particles in the space, a domain Di is dominated by a particles in So to particles in S. Thus, it helps to pull apart the particles in Sk and also plays the role of encouraging to nearby particle that is selected in S. choose particles with big weight like the second term. It can We define a second mapping function be shown that (27) generalizes the traditional (MAP) c X fI; P; F F F ; Ng 3 fI; P; F F F ; Kg; estimator when K I. To demonstrate the goodness of approximation D pjj p so that p x in Di is dominated by a particle x c i P SK , p to D pjj, we show two experiments below. Fig. 18a !c i displays a 1D distribution p x, which is a mixture of N R p x % G x À x c i Y P ; x P Di ; i I; P; F F F ; N: Gaussians (particles). We index the centers from left to right xI < xP < xQ < xR . Suppose we want to choose K Q Intuitively, the N domains are partitioned into K groups, particles for SK and p x. Table 1 displays the distances each of which is dominated by one particle in SK . Thus, we between p x and p x over the four possible combinations. p can approximate D pjj. The second row shows the KL-divergence D pjj and the p p third row is D pjj. The approximation is very accurate XZ N p x given the particles are well separable. p D pjj p x log dx Both measures choose xI ; xQ ; xR as the best S. Particle xP is nI Dn p x not favored by the KL-divergence because it is near xI , XZ N IX N although it has much higher weight than xQ and xR . The fourth !i G x À xi Y P log row shows the absolute value of the difference between p x nI Dn ! iI PN and p x. This distance favors xI ; xP ; xQ and xI ; xP ; xR . In I P ! iI !i G x À xi Y comparison, the KL-divergence favors particles that are apart Pk dx I P from each other and picks up significant peaks in the tails. jI ! j G x À j Y This idea is better demonstrated in Fig. 18. Fig. 18a X Z !n N shows log p x ÀE x which is renormalized for display- % G x À xn Y P nI Dn ! ing. p x consists of N SH particles whose centers are shown by the black spots. The energies E xi ; i I; P; F F F ; N !n G x À xn Y P log log dx are uniformly distributed in an interval H; IHH. Thus, their ! ! c n G x À x c n Y P weights differ in exponential order. Fig. 18b shows log p x " # X !n N xn À x c n P p with k T particles that minimize both D pjj and D pjj. p !n log log Fig. 18c shows the six particles that minimize the absolute nI ! ! ! c n PP difference jp À pj. Fig. 18b has more disperse particles. X !n N For segmentation, a remaining question is: How do we log measure the distance between two solutions WI and WP ? ! nI ! " # This distance measure is to some extent subjective. We xn À x c n P adopt a distance measure which simply accumulates the E x c n À E xn p D pjj: PP differences for the number of regions in WI ; WP and the types ` of image models used at each pixel by WI and WP . TU AND ZHU: IMAGE SEGMENTATION BY DATA-DRIVEN MARKOV CHAIN MONTE CARLO 673 ACKNOWLEDGMENTS [22] S. Oe, ªTexture Segmentation Method by Using Two-Dimensional AR Model and Kullback Information,º Pattern Recognition, vol. 26, This work is partially supported by two US National pp. 237-243, 1993. [23] S. Osher and J.A. Sethian, ªFront Propagating with Curvature Science Foundation grants IIS 98-77-127 and IIS-00-92-664, a Dependent Speed: Algorithms Based on Hamilton-Jacobi For- National Aeronautics and Space Agency grant NAG-13- mulation,º J. Computational Physics, vol. 79, pp. 12-49, 1988. [24] N. Paragios and R. Deriche, ªCoupled Geodesic Active Regions 00039, an Office of Naval Research grant N000140-110-535, for Image Segmentation: A Level Set Approach,º Proc. European and a Microsoft gift. MS student Rong Zhang helped in the Conf. Computer Vision, June 2000. experiments for computing multiple solution experiments [25] S. Sclaroff and J. Isidoro, ªActive Blobs,º Proc. Int'l Conf. Computer Vision, pp. 1146-1153, 1998. in 1D and 2D used in Section 7. The authors thank David [26] J. Shi and J. Malik, ªNormalized Cuts and Image Segmentation,º Mumford for discussions. IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, Aug. 2000. [27] R.H. Swendsen and J.S. Wang, ªNonuniversal Critical Dynamics in Monte Carlo Simulation,º Physical Rev. Letters, vol. 58, no. 2, REFERENCES pp. 86-88, 1987. [1] L. Alvarez, Y. Gousseau, and J.M. Morel, ªThe Size of Objects in [28] J.P. Wang, ªStochastic Relaxation on Partitions with Connected Natural Images,º Preprint of Centre de Math. and Applications, 2000. Components and Its Applications to Image Segmentation,º IEEE [2] S.A. Barker, A.C. Kokaram, and P.J.W. Rayner, ªUnsupervised Trans. Pattern Analysis and Machine Intelligence, vol. 20, no. 6, Segmentation of Images,º SPIEs 43rd Ann. Meeting, pp. 19-24, July pp. 619-636, June 1998. 1998. [29] S.C. Zhu and A.L. Yuille, ªRegion Competition: Unifying Snakes, [3] C. Bouman and B. Liu, ªMultiple Resolution Segmentation of Region Growing, and Bayes/MDL for Multi-Band Image Seg- Textured Images,º IEEE Trans. Pattern Analysis and Machine mentation,º IEEE Trans. Pattern Analysis and Machine Intelligence, Intelligence, vol. 13, no. 2, pp. 99-113, Feb. 1991 vol. 18, no. 9, pp. 884-900, Sept. 1996. [4] J.F. Canny, ªA Computational Approach to Edge Detection,º IEEE [30] S.C. Zhu, Y.N. Wu, and D. Mumford, ªFilters, Random Field and Trans. Pattern Analysis and Machine Intelligence, vol. 8, no. 6, Maximum Entropy: Towards a Unified Theory for Texture pp. 679-698, Nov. 1986. Modeling,º Int'l J. Computer Vision, vol. 27, no. 2, pp. 107-126, 1998. [5] Y. Cheng, ªMean Shift, Mode Seeking, and Clustering,º IEEE [31] S.C. Zhu and X. Liu, ªLearning in Gibbsian Fields: How Accurate Trans. Pattern Analysis and Machine Intelligence, vol. 17, no. 8, and How Fast Can It Be?º Proc. IEEE Conf. Computer Vision and pp. 790-799, Aug. 1995. Pattern Recognition, vol. 2, pp. 2-9, 2000. [6] D. Comaniciu and P. Meer, ªMean Shift Analysis and Applica- tions,º Proc. Int'l Conf. Computer Vision, pp. 1197-1203, 1999. Zhuowen Tu received the BE degree in [7] A.P. Dempster, N.M. Laird, and D.B. Rubin, ªMaximum Like- electronic engineering from Beijing Information lihood from Incomplete Data via the EM Algorithm,º J. Royal Technology Institute in 1993. He received the Statistics Soc. Series B, vol. 39, pp. 1-38, 1977. ME degree in civil engineering from Tsinghua [8] Y.N. Deng, B.S. Manjunath, and H. Shin, ªColor Image University in 1996. He received the MS degree Segmentation,º Proc. IEEE Conf. Computer Vision and Pattern in geodetic science and surveying in 1999 from Recognition, pp. 446-451, 1999. The Ohio State University. He is currently a [9] D.A. Forsyth, ªSampling, Resampling and Colour Constancy,º PhD candidate in the Department of Computer Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 300- and Information Science at The Ohio State 305, 1999 University. He was with the Institute of Compu- [10] S. Geman and D. Geman, ªStochastic Relaxation, Gibbs Distribu- ter Science and Technology at Peking Univer- tions, and the Bayesian Restoration of Images,º IEEE Trans. Pattern sity from 1996 to 1997. His research interests include computer vision, Analysis and Machine Intelligence, vol. 6, pp. 721-741, Nov. 1984. image processing, and geographic information system and mapping. [11] P.J. Green, ªReversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination,º Biometrika, Song-Chun Zhu received the BS degree in vol. 82, no. 4, pp. 711-732, 1995. computer science from the University of Science [12] U. Grenander and M.I. Millter, ªRepresentations of Knowledge in and Technology of China in 1991. He received Complex Systems,º J. Royal Statistics Soc. Series B, vol. 56, no. 4, the MS and PhD degrees in computer science pp. 549-603, 1994. from Harvard University in 1994 and 1996, [13] M. Kass, A. Witkin, and D. Terzopoulos, ªSnakes: Active Contour respectively. He was a research associate in Models,º Int'l J. Computer Vision, vol. 1, no. 4, pp. 321-332, Jan. the Division of Applied Mathematics at Brown 1988. University during 1996-1997 and he was a [14] G. Koepfler, C. Lopez, and J.M. Morel, ªA Multiscale Algorithm lecturer in the Computer Science Department for Image Segmentation by Variational Approach,º SIAM J. at Stanford University during 1997-1998. Since Numerical Analysis, vol. 31, no. 1, pp. 282-299, 1994. 1998, he has been a faculty member in the Department of Computer and [15] Y.G. Leclerc, ªConstructing Simple Stable Descriptions for Image Information Sciences at Ohio State University. His research is focused Partitioning,º Int'l J. Computer Vision, vol. 3, no. 1, pp. 73-102, 1989. on computer vision and learning, statistical modeling, and stochastic [16] A.B. Lee, D.B. Mumford, and J.G. Huang, ªOcclusion Models for computing. He has published 50 articles and received honors, including Natural Images,º Int'l J. Computer Vision, vol. 41, pp. 35-59, Jan. a David Marr prize honorary mention, a US National Science Foundation 2001. Career award, research fellow of Sloan foundation, and a Navy young [17] H.C. Lee and D.R. Cok, ªDetecting Boundaries in a Vector Field,º investigator award. IEEE Trans. Signal Processing, vol. 39, no. 5, pp. 1181-1194, 1991. [18] K.L. Mengersen and R.L. Tweedie, ªRates of Convergence of the Hastings and Metropolis Algorithms,º Annals of Statistics, vol. 24, pp. 101-121, 1994. [19] N. Metropolis, M.N. Rosenbluth, A.W. Rosenbluth, A.H. Teller, and E. Teller, ªEquations of State Calculations by Fast Computing . For more information on this or any other computing topic, Machinesº J. Chemical Physics, vol. 21, pp. 1087-1092, 1953. please visit our Digital Library at http://computer.org/publications/dilb. [20] D.B. Mumford and B. Gidas, ªStochastic Models for Generic Images,º Quarterly of Applied Math., vol. LIX, no. 1, pp. 85-111, Mar. 2001. [21] D. Martin, C. Fowlkes, D. Tal, and J. Malik, ªA Database of Human Segmented Natural Images and Its Application to Evaluating Segmentation Algorithms and Measuring Ecological Statistics,º Proc. Int'l Conf. Computer Vision, vol. 2, pp. 416-423, 2001.