Multi Image Graph Cut Clothing Segmentation for Recognizing People

Document Sample
Multi Image Graph Cut Clothing Segmentation for Recognizing People Powered By Docstoc
					CVPR                                                                                                                                            CVPR
#2670                                                                                                                                           #2670
                              CVPR 2008 Submission #2670. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

 000                                                                                                                                              054
 001                                                                                                                                              055
 002                                                                                                                                              056
 003           Multi-Image Graph Cut Clothing Segmentation for Recognizing People                                                                 057
 004                                                                                                                                              058
 005                                                                                                                                              059
 006                                                                                                                                              060
                                                     Anonymous CVPR submission
 007                                                                                                                                              061
 008                                                                                                                                              062
 009                                                           Paper ID 2670                                                                      063
 010                                                                                                                                              064
 011                                                                                                                                              065
 012                                                                                                                                              066
 013                                                                                                                                              067
 014                                                                                                                                              068
           Reseachers have verified that clothing provides informa-
 015                                                                                                                                              069
        tion about the identity of the individual. To extract features
 016                                                                                                                                              070
        from the clothing, the clothing region first must be localized
 017                                                                                                                                              071
        or segmented in the image. At the same time, given multi-
 018                                                                                                                                              072
        ple images of the same person wearing the same clothing,
 019                                                                                                                                              073
        we expect to improve the effectiveness of clothing segmen-
 020                                                                                                                                              074
        tation. Therefore, the identity recognition and clothing seg-
 021                                                                                                                                              075
        mentation problems are inter-twined; a good solution for
 022                                                                     Figure 1. It is extremely difficult even for humans to determine          076
        one aides in the solution for the other.
 023                                                                     how many different individuals are shown and which images are            077
           We build on this idea by analyzing the mutual informa-        of the same individuals from only the faces (top). However, when
 024                                                                                                                                              078
        tion between pixel locations near the face and the identity      the faces are embedded in the context of clothing, it is much easier
 025                                                                                                                                              079
        of the person to learn a global clothing mask. We segment        to distinguish the three individuals (bottom).
 026                                                                                                                                              080
        the clothing region in each image using graph cuts based on
 027                                                                                                                                              081
        a clothing model learned from one or multiple images be-
 028                                                                     people, the following experiment was performed: 7 subjects               082
        lieved to be the same person wearing the same clothing. We
 029                                                                     were given a page showing 54 labeled faces of 10 individ-                083
        use facial features and clothing features to recognize indi-
 030                                                                     uals from the image collection and asked to identify a set               084
        viduals in other images. The results show that clothing seg-
 031                                                                     of faces from the same collection. The experiment was re-                085
        mentation provides a significant improvement in recognition
 032                                                                     peated using images that included a portion of the cloth-                086
        accuracy for large image collections, and useful clothing
 033                                                                     ing (as shown in Figure 1). The average correct recogni-                 087
        masks are simultaneously produced.
 034                                                                     tion rate (on this admittedly difficult family album) jumped              088
           A further significant contribution is that we introduce a
 035                                                                     from 58% when only faces were used, to 88% when faces                    089
        publicly available consumer image collection where each
 036                                                                     and clothing were visible. This demonstrates the potential               090
        individual is identified. We hope this dataset allows the vi-
 037                                                                     of person recognition using features in addition to the face             091
        sion community to more easily compare results for tasks re-
 038                                                                     for distinguishing individuals in family albums.                         092
        lated to recognizing people in consumer image collections.
 039                                                                        When extracting clothing features from the image, it is               093
 040                                                                     important to know where the clothing is located. In this pa-             094
 041                                                                     per, we describe the use of graph cuts for segmenting cloth-             095
        1. Introduction
 042                                                                     ing in a person image. We show that using multiple im-                   096
 043        Figure 1 illustrates the limitations of using only fa-       ages of the same person from the same event allows a better              097
 044    cial features for recognizing people. When only six faces        model of the clothing to be constructed, resulting in supe-              098
 045    (cropped and scaled in the same fashion as images from the       rior clothing segmentation. We also describe the benefits of              099
 046    PIE [24] database often are) from an image collection are        accurate clothing segmentation for recognizing people in a               100
 047    shown, it is difficult to determine how many different indi-      consumer image collection.                                               101
 048    viduals are present. Even if it is known that there are only                                                                              102
 049    three different individuals, the problem is not much easier.                                                                              103
                                                                         2. Related Work
 050    In fact, the three are sisters of similar age. When the faces                                                                             104
 051    are shown in context with their clothing, it becomes almost         Clothing for identification has received much recent re-               105
 052    trivial to recognize which images are of the same person.        search attention. When attempting to identify a person from              106
 053    To quantify the role clothing plays when humans recognize        the same day as the training data for applications such as               107
CVPR                                                                                                                                             CVPR
#2670                                                                                                                                            #2670
                              CVPR 2008 Submission #2670. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

 108                                                                                                          Set 1   Set 2   Set 3      Set 4     162
        teleconferencing and surveillance, clothing is an important
 109                                                                           Total images                    401    1065    2099        227      163
        cue [9, 11, 18]. In these video-based applications, good fig-
 110                                                                           Images with faces               180     589     962        161      164
        ure segmentation is achieved from the static environment.
        The video quality is low enough that facial features are not           No. faces                       278     931    1364        436      165
 112                                                                           Detected faces                  152     709     969        294      166
 113                                                                           Images with multiple people      77     220     282        110      167
 114        In applications related to consumer image collections              Time span (days)                 28     233     385         10      168
 115    [1, 27, 29, 31, 32], clothing color features have been charac-         No. days images captured         21      50      82           9     169
        terized by the correlogram of the colors in a rectangular re-          Unique individuals               12      32      40         10
 116                                                                                                                                               170
 117    gion surrounding a detected face. For assisted tagging of all                Table 1. A summary of the four image collections.             171
 118    faces in the collection, combining face with body features                                                                                 172
 119    provides a 3-5% improvement over using just body features.                                                                                 173
 120    However, segmenting the clothing region continues to be a                                                                                  174
 121    challenge; all of the methods above simply extract cloth-                                                                                  175
 122    ing features from a box located beneath the face. (Although                                                                                176
 123    Song and Leung [27] adjust the box position based on other                                                                                 177
 124    recognized faces and attempt to exclude flesh).                                                                                             178
 125       Some reseachers have trained models to essentially learn                                                                                179
 126    the characteristics of the human form [8, 16, 19, 28].                                                                                     180
 127    Broadly speaking, these methods search for different body                                                                                  181
                                                                              Figure 2. Person images at resolution 81×49 and the correspond-
 128    parts (e.g. legs, arms, or trunk), and use a pre-defined model                                                                              182
                                                                              ing superpixel segmentations.
 129    human to find the most sensible human body amongst the                                                                                      183
 130    detected body parts. While this model-based approach is                                                                                    184
 131    certainly justified for the problem, we wonder what can be             3. Images and Features for Clothing Analysis                         185
 132    learned from the data itself. In essence, given many images                                                                                186
 133    of people, is it possible for the computer to learn what a per-          Four consumer image collections are used in this work.            187
 134    son looks like without imposing any physical human model              Each collection owner labeled the detected faces in each im-         188
 135    on its interpretation of the images?                                  age, and could add faces missed by the face detector [ 10].          189
 136                                                                          The four collections, summarized in Table 1, contain a total         190
           Regarding segmenting an object of interest, researchers            of 3009 person images of 94 unique individuals. We ex-               191
        have attemped to combine the recognition of component ob-             periment on each collection separately (rather than merging          192
        ject parts with segmentation [30], and to recognize objects           the collections), to simulate working with a single person’s         193
        among many images by first computing multiple segmenta-                image collection.                                                    194
        tions for each image [22]. Further, Rother et al. extend their           Features are extracted from the faces and clothing of peo-        195
        GrabCut [20] graph-cutting object extraction algorithm to             ple. Our implementation of a face detection algorithm [ 10]          196
        operate on simultaneously on pairs of images [ 21], and               detects faces, and also estimates the eye positions. Each            197
        along the same lines, Liu and Chen [15] use PLSA to initial-          face is normalized in scale (49×61 pixels) and projected             198
        ize the GrabCut, replacing the manual interface. We extend            onto a set of Fisherfaces [4], representing each face as a           199
        this problem into the domain of recognizing people from               37-dimensional vector. These features are not the state-of-          200
        clothing and faces. We apply graph cuts simultaneously to             the-art features for recognizing faces, but are sufficient to         201
        a group of images of the same person to produce improved              demonstrate our approach.                                            202
        clothing segmentation.                                                   For extracting features to represent the clothing region,         203
 150        Our contributions are the following: We analyze the in-           the body of the person is resampled to 81×49 pixels, such            204
 151    formation content in pixels surrounding the face to discover          that the distance between the eyes (from the face detector) is       205
 152    a global clothing mask (Section 4). Then, on each image,              8 pixels. The crop window is always axis-aligned with the            206
 153    we use graph-cutting techniques to refine the clothing mask,           image. Clothing comes in many patterns and a vast pallette           207
 154    where our clothing model is developed from one or multiple            of colors, so both texture and color features are extracted. A       208
 155    images believed to contain the same individual (Section 5).           5-dimensional feature vector of low-level features is found          209
 156    In contrast to some previous work, we do not use any model            at each pixel location in the resized person image. This             210
 157    of the human body. We build a texture and color visual word           dense description of the clothing region is used based on the        211
 158    library from features extracted in putative clothing regions          work of [13, 14] as it is necessary to capture the information       212
 159    of people images and use both facial and clothing features to         present even in uniform color areas of clothing. The three           213
 160    recognize people. We show these improved clothing masks               color features are a linear transformation of RGB color val-         214
 161    lead to better recognition (Section 7).                               ues of each pixel to a luminance-chrominance space (LCC).            215

CVPR                                                                                                                                                               CVPR
#2670                                                                                                                                                              #2670
                                CVPR 2008 Submission #2670. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

 216                                                                                    Same day mutual information maps   Different day mutual information maps     270
        The two texture features are the responses to a horizontal
 217                                                                                                                                                                 271
        and vertical edge detector.
 218                                                                                                                                                                 272
            To provide some robustness to translation and movement
 219                                                                                                                                                                 273
        of the person, the feature values are accumulated across re-
 220                                                                                                                                                                 274
        gions in one of two ways. In the first (superpixel) represen-
 221                                                                                                                                                                 275
        tation, the person image is segmented into superpixels using                         Global Clothing Masks                      Mean GT
 222                                                                                                                                                                 276
        normalized cuts [23], shown for example in Figure 2. For
 223                                                                                                                                                                 277
        each superpixel, the histogram over each of the five features
 224                                                                                                                                                                 278
        is computed. In turn, each pixel’s features are the five his-
 225                                                                                                                                                                 279
        tograms associated with its corresponding superpixel. This
 226                                                                                                                                                                 280
        representation provides localization (over each superpixel)
 227                                                                                 Figure 3. Top Left: The clothing region carries information                     281
        and maintains some robustness to translation and scaling.
        The notation s p refers to the feature histograms associated                 about identity. Maps of mutual information between Sij and                      282
 229                                                                                  si(x,y) , sj(x,y) s for four image sets all yield a map with the same          283
        with the pth superpixel. Likewise, the notation s (x,y) refers
 230                                                                                 qualitative appearance. In Set 3, the mutual information reaches                284
        to the feature histograms associated with the superpixel that                0.17, while the entropy of Sij is only 0.19. Top Right: The mutual
 231                                                                                                                                                                 285
        corresponds to position (x, y).                                              information maps for person images captured on different days.
 232                                                                                                                                                                 286
            In the second (visual word) representation, the low-level                The overall magnitude is only about 7% the same-day mutual in-
 233                                                                                                                                                                 287
        feature vector at each pixel is quantized to the index of the                formation maps, but the clothing region (and the hair region) still
 234                                                                                                                                                                 288
        closest visual word [25], where there is a separate visual                   carry information about the identity of the person. Bottom Left:
 235                                                                                                                                                                 289
        word dictionary for color features and for texture features                  The clothing masks created from the mutual information masks all
 236                                                                                 have the same general appearance, though Set 1’s mask is noisy                  290
        (each with 350 visual words). The clothing region is rep-
 237                                                                                 probably due to the relatively small number of people in this set.              291
        resented by the histogram of the color visual words and
 238                                                                                 Bottom Right: The average of 714 hand-labeled clothing masks                    292
        the histogram of the texture visual words within the cloth-
 239                                                                                 appears similar to the mutual information masks.                                293
        ing mask region (described in Section 4). Of course, this
 240                                                                                                                                                                 294
        clothing mask is the putative region of clothing for the face;
 241                                                                                                                                                                 295
        the actual clothing in a particular person image may be oc-                  where u is an index over each of the five feature types (three
 242                                                                                                                                                                 296
        cluded by another object. The visual word clothing features
 243                                                                                 for color and two for texture).                                                 297
        are represented as v.
 244                                                                                    In the region surrounding the face, we compute the mu-                       298
 245                                                                                 tual information I(S ij , si(x,y) , sj(x,y) s ) between the dis-                299
        4. Finding the Global Clothing Mask                                          tance between corresponding superpixels, and S ij at each                       300
 247       In previous recognition work using clothing, either a                     (x, y) position in the person image. Maps of the mutual in-                     301
 248    rectangular region below the face is assumed to be clothing,                 formation are shown in Figure 3. For each image collection,                     302
 249    or the clothing region is modeled using operator-labeled                     two mutual information maps are found, one where p i and                        303
 250    clothing from many images [26]. We take the approach of                      pj are captured on the same day, and one otherwise.                             304
 251    learning the clothing region automatically, using only the                      Areas of the image associated with clothing contain a                        305
 252    identity of faces (from labeled ground-truth) and no other                   great deal of information regarding whether two people are                      306
 253    input from the user. Intuitively, the region associated with                 the same, given the images are captured on the same day.                        307
 254    clothing carries information about the identity of the face.                 Even for images captured on different days, the clothing                        308
 255    For example, in a sporting event, athletes wear numbers on                   region carries some information about identity similarity,                      309
 256    their uniforms so the referees can easily distinguish them.                  due to the fact that clothes are re-worn, or that a particular                  310
 257    Similarly, in a consumer image collection, when two peo-                     individual prefers a specific clothing style or color.                           311
 258    ple in different images wear the same clothing, the proba-                      In three image Sets (1, 2, and 4), the features of the face                  312
 259    bility increases that they might be the same individual. We                  region itself carry little information about identity. (Re-                     313
 260    discover the clothing region by finding pixel locations that                  member, these features are local histograms of color and                        314
 261    carry information about facial identity. Let p i = pj be the                 texture features not meant for recognizing faces). These                        315
 262    event Sij that the pair of person images p i and pj share an                 collections have little ethnic diversity so the tone of the fa-                 316
 263    identity, and s i(x,y) , sj(x,y) s be the distance between cor-              cial skin is not an indicator of identity. However, Set 3 is                    317
 264    responding superpixel features s i(x,y) and sj(x,y) at pixel                 ethnically more diverse, and the skin tone of the facial re-                    318
 265    position (x, y). The distance is the sum of χ 2 distances be-                gion carries some information related to identity.                              319
 266    tween the five feature histograms:                                               This mutual information analysis allows us to create a                       320
 267                                                                                 mask of the most informative pixels associated with a face                      321
 268             si(x,y) , sj(x,y)   s   =       χ2 (su         u
                                                      i(x,y) , sj(x,y) )   (1)       that we call the global clothing mask. The same-day mu-                         322
 269                                         u                                       tual information maps are reflected (symmetry is assumed),                       323

CVPR                                                                                                                                               CVPR
#2670                                                                                                                                              #2670
                                 CVPR 2008 Submission #2670. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

 324                                                                                                                                                 378
        summed, and thresholded (by a value constant across the                   of β is explained by considering that clothing is often oc-
 325                                                                                                                                                 379
        image collections) to yield clothing masks that appear re-                cluded by other image objects, and is often not contiguous
 326                                                                                                                                                 380
        markably similar across collections. We emphasize again                   in the image. Figure 4 illustrates the graph cutting process
 327                                                                                                                                                 381
        that our global clothing mask is learned without using any                for segmenting the clothing region. Except for the selec-
 328                                                                                                                                                 382
        manually labeled clothing regions; simply examining the                   tion of a few constants, the algorithm essentially learned to
 329                                                                                                                                                 383
        image data and the person labels reveals that the region cor-             segment clothing first by finding a global clothing mask de-
 330                                                                                                                                                 384
        responding roughly to the torso contains information rele-                scribing regions of the image with high mutual information
 331                                                                                                                                                 385
        vant to identity.                                                         with identity, then performing a segmentation to refine the
 332                                                                                                                                                 386
                                                                                  clothing mask on any particular image.
 333                                                                                                                                                 387
        5. Graph Cuts for Clothing Segmentation                                      Multiple Images: When multiple images of the same
 334                                                                                                                                                 388
                                                                                  person with the same clothing are available, there is an op-
 335       Single Image: The global clothing mask shows the lo-                                                                                      389
                                                                                  portunity to learn a better model for the clothing. We use the
 336    cation of clothing on average, but on any given image, the                                                                                   390
                                                                                  idea from [21] that the background model for each image is
 337    pose of the body or occlusion can make the clothing in that                                                                                  391
                                                                                  independent, but the foreground model is constant across
 338    image difficult to localize. We use graph cuts to extract an                                                                                  392
                                                                                  the multiple images. Then, the clothing model is computed
 339    image-specific clothing mask. Using the idea of GrabCut                                                                                       393
                                                                                  with contribution from each of the images:
 340    [20], we define a graph over the superpixels that comprise                                                                                    394
 341    the image, where each edge in the graph corresponds to the                                       M0 =        M0i                     (5)     395
 342    cost of cutting the edge. We seek the binary labeling f over                                             i                                   396
 343    the superpixels that minimizes the energy of the cut. We                                                                                     397
                                                                                  This global clothing model M 0 is the sum for each feature         398
        use the standard graph cutting algorithms [3, 6, 7, 12] for
                                                                                  type of the corresponding feature histograms for each im-
 345    solving for the minimum energy cut. Using the notation in                                                                                    399
                                                                                  age’s individual clothing model. However, each image i has
 346    [12], the energy is:                                                                                                                         400
                                                                                  its own individual background model M 1i , formed from the         401
                                                                                  feature values of the inverse global clothing mask. Concep-
 348                                                                                                                                                 402
                 E(f ) =         Dp (fp ) +           Vp,q (fp , fq )   (2)       tually, the clothing is expected to remain the same across
 349                                                                                                                                                 403
                           p∈P                p,q∈N
                                                                                  many images, but the background can change drastically.
 350                                                                                                                                                 404
                                                                                      When applying graph cuts, a graph is created for each
 351                                                                                                                                                 405
 352    where E(f ) is the energy of a particular labeling f , p and              person image. The smoothness cost is defined as before in           406
        q are indexes over the superpixels, D p (fp ) is the data cost            Eq. (4), but the data cost for person image i becomes:
 353                                                                                                                                                 407
 354    of assigning the p th superpixel to label f p , and Vp,q (fp , f q)                                                                          408
        represents the smoothness cost of assigning superpixels p                                       exp(−α spi , M0 ) if fpi =0
 355                                                                                     Dpi (fpi ) =                                        (6)     409
 356    and q in a neighborhood N to respective labels f p and fq .                                     exp(−α spi , M1i ) if fpi =1                 410
 357       Possible labels for each superpixel are f p ∈ {0, 1} where                                                                                411
        the index 0 corresponds to foreground (i.e. the clothing                     Figure 5 shows several examples of graph cuts for cloth-
 358                                                                              ing segmentation by either treating each image indepen-            412
 359    region that is useful for recognition) and 1 corresponds to                                                                                  413
        background. The clothing model M 0 is formed by com-                      dently, or exploiting the consistency of the clothing appear-
 360                                                                              ance across multiple images for segmenting each image in           414
 361    puting the histogram over each of the five features over the                                                                                  415
        region of the person image corresponding to clothing in the               the group.
 362                                                                                                                                                 416
 363    global clothing mask. In a similar manner, the background                                                                                    417
        model M1 is formed using the feature values of pixels from                6. Recognizing people
 364                                                                                                                                                 418
 365    regions corresponding to the inverse of the clothing mask.                   For searching and browsing images in a consumer im-             419
 366    Then, the data cost term in Eq. (2) is defined:                            age collection, we describe the following scenario. At first,       420
 367                                                                              none of the people in the image collection are labeled,            421
                        Dp (fp ) = exp(−α sp , Mfp )                    (3)
 368                                                                              though we do make the simplifying assumption that the              422
        where again the distance is the sum of the χ 2 distances                  number of individuals is known. A user provides the la-            423
        for each of the corresponding five feature histograms. The                 bels for a randomly selected subset of the people images in        424
        smoothness cost term is defined as:                                        the collection. The task is to recognize all the remaining         425
 372                                                                              people, and the performance measure is the number of cor-          426
 373            Vp,q (fp , fq ) = (fp − fq ) exp(−β sp , sq )           (4)       rectly recognized people. This measure corresponds to the          427
 374                                                                              usefulness of the algorithm in allowing a user to search and       428
 375    Experimentally, we found parameter values of α = 1 and                    browse the image collection after investing the time to label      429
 376    β = 0.01 work well, though the results are not particularly               a portion of the people. We use an example-based nearest           430
 377    sensitive to the chosen parameter values. The lower value                 neighbor classifier for recognizing people in this scenario.        431

CVPR                                                                                                                                                 CVPR
#2670                                                                                                                                                #2670
                               CVPR 2008 Submission #2670. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

 432                                                                                                                                                   486
 433                                                                                                                                                   487
 434                                                                                                                                                   488
 435                                                                                                                                                   489
 436                                                                                                                                                   490
 437                                                                                                                                                   491
 438                                                                                                                                                   492
 439                                                                                                                                                   493
 440                                                                                                                                                   494
 441                                                                                                                                                   495
 442                                                                                                                                                   496
 443                                                                                                                                                   497
 444                                                                                                                                                   498
 445                                                                                                                                                   499
                        (A)                (B)                 (C)                    (D)              (E)              (F)                            500
 447    Figure 4. Using graph cuts to segment the clothing from a person image. The automatically learned global clothing mask (B) is used to          501
 448    create a clothing model (C, top) and a background model (C, bottom) that each describe the five feature types from the person image (A).        502
 449    Each superpixel is a node in a graph, and the data cost of assigning each superpixel to the clothing and background are shown (D, top)         503
 450    and (D, bottom), respectively, with light shades indicating high cost. The smoothness cost is shown in (E), with thicker, yellower edges       504
        indicating higher cost. The graph cut solution for the clothing is shown in (F).                                                               505
 452                                                                                                                                                   506
 453                                                                                                                                                   507
 454                                                                                                                                                   508
 455                                                                                                                                                   509
 456                                                                                                                                                   510
 457                                                                                                                                                   511
 458                                                                                                                                                   512
 459                                                                                                                                                   513
 460                                                                                                                                                   514
 461                                                                                                                                                   515
 462                                                                                                                                                   516
 463                                                                                                                                                   517
 464                                                                                                                                                   518
                                (A)                                  (B)                        (C)                             (D)                    519
 466                                                                                                                                                   520
 467                                                                                                                                                   521
 468                                                                                                                                                   522
 469                                                                                                                                                   523
 470                                                                                                                                                   524
 471                                                                                                                                                   525
 472                                                                                                                                                   526
 473                                                                                                                                                   527
 474                                                                                                                                                   528
 475                                                                                                                                                   529
 476                                                                                                                                                   530
 477                                                                                                                                                   531
 478                                                                                                                                                   532
                     (E)                                 (F)                                  (G)                         (H)
 479                                                                                                                                                   533
 480    Figure 5. See Section 5. For each group of person images, the top row shows the resized person images, the middle row shows the result         534
        of applying graph cuts to segment clothing on each person image individually, and the bottom row shows the result of segmenting the
 481                                                                                                                                                   535
        clothing using the entire group of images. Often times, the group graph cut learns a better model for the clothing, and is able to segment
 482                                                                                                                                                   536
        out occlusions (A, C, F, H) and adapt to difficult poses (E, G). We do not explicitly exclude flesh, so some flesh remains in the clothing
 483                                                                                                                                                   537
        masks (B, G, H).
 484                                                                                                                                                   538
 485                                                                                                                                                   539

CVPR                                                                                                                                                                                                         CVPR
#2670                                                                                                                                                                                                        #2670
                                                               CVPR 2008 Submission #2670. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

 540                                                                                                                                                                                                           594
                                           Clothing function
                                                                                                       Face Similarity Function
                                                                                                                                                We justify the similarity metric Pij based on our obser-
 541                                                       Same Day                                                                                                                                            595
                                                           Different Day                                                                    vations of how humans perform recognition by combining
 542                         0.8                                                               0.8                                                                                                             596
                                                                                                                                            multi-modal features to judge the similarity between faces.
         P(S | 〈 v , v 〉 )
                    j v

                                                                           P(p =p |D(f ,f ))
                                                                                       fi fj
 543                                                                                           0.6                                                                                                             597
                             0.6                                                                                                            If we see two person images with identical clothing from the

 544                                                                                                                                                                                                           598

                                                                                                                                            same day, we think they are likely the same person, even if


 545                                                                                                                                                                                                           599
                                                                                               0.2                                          the images have such different facial expression facial ex-
 546                         0.2                                                                                                                                                                               600
                                                                                                                                            pressions that a judgement on the faces is difficult. Like-
 547                               0.2    0.4      0.6     0.8     1                             0   0.1     0.2     0.3    0.4   0.5                                                                          601
                                   Distance in Clothing Feature Space                                  Distance Between Faces               wise, if we have high confidence that the faces are similar,
 548                                                                                                                                                                                                           602
                                                                                                                                            we are not dissuaded by seeing that the clothing is different
 549    Figure 6. Left: The probability that two person images share a                                                                                                                                         603
                                                                                                                                            (the person may have put on a sweater, we reason).
 550    common identity given the distance between the clothing features                                                                                                                                       604
                                                                                                                                                Using the metric Pij , a nearest neighbor is one that is
 551    and the time interval between the images. Right: In a similar                                                                                                                                          605
        fashion, the probability of two person images sharing a common
                                                                                                                                            similar in either facial appearance or in clothing appearance.
 552                                                                                                                                                                                                           606
        identity given the distance between faces fif and fj .
                                                           f                                                                                These K nearest neighbors are used to estimate P (p = n|f )
 553                                                                                                                                                                                                           607
                                                                                                                                            using a weighted density estimate, which can in turn be used
 554                                                                                                                                                                                                           608
                                                                                                                                            to recognize the face according to:
 555                                                                                                                                                                                                           609
            Given an unlabeled person p, P (p = n|f ) where f =
        {f f , v} includes the facial features f f and the clothing fea-                                                                                   pMAP = arg max P (p = n|f )                 (9)     610
 557                                                                                                                                                                    n∈N                                    611
        tures v, the probability that the name assigned to person p
 558                                                                                                                                                                                                           612
        is n can be estimated using the nearest neighbor algorithm.                                                                         When multiple people are in an image, there is an addi-
 559                                                                                                                                                                                                           613
        In our notation, name set N comprises the names of the                                                                              tional constraint, called the unique object constraint, that
 560                                                                                                                                                                                                           614
        U unique individuals in the image collection. An element                                                                            no person can appear more than once in an image [ 5, 26].
        nk ∈ N is a particular name in the set.                                                                                             We seek the assignment of names to people that maximizes           615
            The K nearest labeled neighbors of a person p i are se-                                                                         P (p = n|F), the posterior of the names for all people in          616
 563                                                                                                                                        the image, assuming that any group of persons is equally           617
        lected from the collection using facial similarity and cloth-
        ing similarity. When finding the nearest neighbors to a                                                                              likely. The set of M people in the image is denoted p, F           618
        query person with features f , both the facial and clothing                                                                         is the set of all the features f for all people in the image,      619
        features are considered using the measure P ij , the posterior                                                                      and n is a subset of N with M elements and is a particular         620
        probability that two person images p i and pj are the same                                                                          assignment of a name to each person in p. Although there           621
 568                                                                                                                                              U                                                            622
        individual. We propose the measure of similarity P ij be-                                                                           are M combinations of names to people, this problem is
        tween two person images, where:                                                                                                     solved in O(M 3 ) time using Munkres algorithm [17].               623
 570                                                                                                                                                                                                           624
 571                                               Pij = P (Sij |fi , fj , ti , tj )                                              (7)                                                                          625
                                                                                                                                            7. Experiments
 572                                                                v     f                                                                                                                                    626
                                                           ≈ max [Pij , Pij ]                                                     (8)
 573                                                                                                                                           Better Recognition Improves Clothing Segmentation:              627
 574    The posterior probability P ij = P (Sij | vi , vj v , |ti − tj |)                                                                   The following experiment was performed to evaluate the             628
 575    that two person images p i and pj are the same individual is                                                                        performance of the graph-cut clothing segmentation. In our         629
 576    dependent both on the distance between the clothing fea-                                                                            Sets 1 and 4, every superpixel of every person image was           630
 577    tures vi , vj v using the visual word representation, and                                                                           manually labeled as either clothing or not clothing. This          631
 578    also on the time difference |t i − tj | between the image cap-                                                                      task was difficult, not just due to the sheer number of su-         632
 579    tures. The distance between the clothing features v i , vj v                                                                        perpixels (35700 superpixels), but because of the inherent         633
 580    for two person images p i and pj is simply the sum of the                                                                           ambiguity of the problem. For our person images, we la-            634
 581    χ2 distances between the texture and the color visual word                                                                          beled as clothing any covering of the torso and legs. Un-          635
 582    histograms, similar to the superpixel distance in Eq. (1).                                                                          covered arms were not considered to be clothing, and head          636
 583    The probability P ij is approximated as a function of the                                                                           coverings such as hats and glasses were also excluded.             637
 584    distance vi , vj v , learned from a non-test image collection                                                                          We apply our clothing segmentation to each person im-           638
 585    for same-day and different-day pairs of person images with                                                                          age in both collections. Table 2 reports the accuracy of           639
 586    the same identity, and pairs with different identities. Fig-                                                                        the clothing segmentation. We compare the graph cut seg-           640
 587    ure 6 shows the maximum likelihood estimate of P ij . The                                                                           mentation against the prior (roughly 70% of the superpix-          641
 588    posterior is fit with a decaying exponential, one model for                                                                                                       ı
                                                                                                                                            els are not clothing). A na¨ve segmentation is to find the          642
 589    person images captured on the same day, and one model                                                                               mean value of the clothing mask corresponding to the re-           643
 590    for person images captured on different days. Similarly, the                                                                        gion covered by each superpixel, then classify as clothing if      644
 591    probability Pij , the probability that faces i and j are the                                                                        this value surpasses a threshold. The threshold was selected       645
 592    same person, is modeled using a decaying exponential as                                                                             by minimizing the equal error rate. This method consid-            646
 593    well.                                                                                                                               ers only the position of each superpixel and not its feature       647

CVPR                                                                                                                                                                                                                    CVPR
#2670                                                                                                                                                                                                                   #2670
                                                                 CVPR 2008 Submission #2670. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

 648                          Set 1    Set 4                                                                                                                                                                              702
                                                                                                                                                       tures are extracted from the clothing mask determined by
 649      Prior             70.7% 68.2%                                                                                                                                                                                   703
                                                                                                                                                       graph cuts on each image individually.
 650      Na¨ve
             ı              77.2% 84.2%                                                                                                                                                                                   704
                                                                                                                                                           Figure 8 compares the performance of recognizing peo-
 651      GC Individual 87.6% 88.5%                                                                                                                    ple using only clothing features. For all of our collections,      705
 652      GC Group          88.5% 90.3%                                                                                                                                                                                   706
                                                                                                                                                       the graph cut clothing masks outperform using only a box
 653    Table 2. Graph cuts provides effective clothing recognition. For                                                                                                                                                  707
                                                                                                                                                       (shown in Figure 9). Also, for each collection, the clothing
 654    each of two image collections, the accuracy of classifying super-                                                                                                                                                 708
                                                                                                                                                       masks are generated by segmenting using group segmen-
 655    pixels as either clothing or non-clothing with four different algo-                                                                                                                                               709
        rithms is shown. Using Graph Cuts for groups of images proves to                                                                               tation, and these segmentations unanimously lead to bet-
 656                                                                                                                                                                                                                      710
        be the most effective method.                                                                                                                  ter recognition performance. Finally, we show in collection
 657                                                                                                                                                                                                                      711
                                                                                                                                                       Sets 1 and 4, where ground-truth labeled clothing masks ex-
 658                                                                                                                                                                                                                      712
                                                        Set 1
                                                                                                                         Set 2
                                                                                                                                                       ist, that the best performance is achieved using the ground
 659                                                                                                                                                                                                                      713
                                                                                                                                                       truth clothing masks. This represents the maximum possi-
              Recognition Accuracy

                                                                                  Recognition Accuracy

 660                                                                                                                                                                                                                      714
                                     0.7                                                                 0.6
                                                                                                                                                       ble recognition accuracy that our system could achieve if
 661                                                                                                                                                                                                                      715
                                                                                                                                                       the clothing segmentation is perfect.
 662                                                                                                     0.5
                                                           Face                                                             Face
                                                                                                                                                           To summarize, these experiments show that:
 663                                 0.5
                                                           GC Clothing
                                                                                                                            GC Clothing
                                                                                                                                                         • Multiple images of the same person are used to im-
                                                           Face + GC Clothing                                               Face + GC Clothing
 664                                 0.4
                                        0         100          200          300                            0   200   400      600    800    1000                                                                          718
                                               Number Labeled Examples                                          Number Labeled Examples
 665                                                    Set 3
                                                                                                                         Set 4                             prove clothing segmentation.                                   719
 666                                                                                                                                                                                                                      720
                                                                                                                                                         • Person recognition improves with improvements to the
                                     0.65                                                                0.9
                                                                                  Recognition Accuracy
            Recognition Accuracy

 667                                  0.6                                                                0.8                                                                                                              721
                                                                                                         0.7                                               clothing segmentation.                                         722
 669                                 0.45                                                                                                                                                                                 723
                                      0.4                  Face
                                                                                                                                                          Ongoing work includes merging the recognition and
 670                                                                                                     0.4                 GC Clothing                                                                                  724
                                     0.35                  GC Clothing
                                                           Face + GC Clothing                                                Face + GC Clothing        clothing segmentation into a single framework where each
 671                                       0      500          1000        1500                            0   100   200      300    400     500                                                                          725
                                               Number Labeled Examples                                          Number Labeled Examples                assists the other in the following fashion: based on a la-
 672                                                                                                                                                                                                                      726
                                                                                                                                                       beled subset of people, the other people in the collection
 673    Figure 7. Combining facial and clothing features results in better                                                                                                                                                727
                                                                                                                                                       are recognized. Then, based on these putative identities,
 674    recognition accuracy than using either feature independently.                                                                                                                                                     728
                                                                                                                                                       new clothing masks are found using multiple images of the
 675                                                                                                                                                                                                                      729
                                                                                                                                                       same person within a given time window.
 676                                                                                                                                                                                                                      730
 677    values. In both collections, using the graph cut clothing                                                                                                                                                         731
 678    segmentation provides a substantial improvement over the                                                                                       8. Publically Available Dataset                                    732
 679      ı
        na¨ve approach.                                                                                                                                    One persistant problem for researchers dealing with per-       733
 680       Further improvement is achieved when the person im-                                                                                         sonal image collections is that there is a lack of standard        734
 681    ages are considered in groups. For this experiment, we as-                                                                                     datasets. As a result, each research group uses their own          735
 682    sume the ground truth for identity is known, and a group                                                                                       datasets, and results are difficult to compare. We have made        736
 683    includes all instances of an individual appearance within a                                                                                    our image Set 2 of 931 labeled people available to the re-         737
 684    20 minutes time window, nearly ensuring the clothing has                                                                                       search community [2]. The dataset is described in Table            738
 685    not been changed for each individual.                                                                                                          1, and contains original JPEG captures with all associated         739
 686       Better Clothing Recognition Improves Recognition:                                                                                           EXIF information, as well as text files containing the iden-        740
 687    The following experiment is performed to simulate the ef-                                                                                      tity of all labeled individuals. We hope this dataset provides     741
 688    fect on recognition of labeling faces in an image collection.                                                                                  a valuable common ground for the research community.               742
 689    People images are labeled according to a random order and                                                                                                                                                         743
 690    the identity of all remaining unlabeled faces is inferred by                                                                                                                                                      744
                                                                                                                                                       9. Conclusion
 691    the nearest-neighbor classifier from Section 6. Each classi-                                                                                                                                                       745
 692    fication is compared against the true label to determine the                                                                                       In this paper, we describe the advantages of performing         746
 693    recognition accuracy. We use nine nearest neighbors and                                                                                        clothing segmentation with graph cuts in a consumer im-            747
 694    repeat the random labeling procedure 50 times to find the                                                                                       age collection. We showed a data-driven (rather than driven        748
 695    average performance. The goal of these experiments is to                                                                                       by a human model) approach for finding a global clothing            749
 696    show the influence of clothing segmentation on recognition.                                                                                     mask that shows the typical location of clothing in person         750
 697       Figure 7 shows the results of the person recognition ex-                                                                                    images. Using this global clothing mask, a clothing mask           751
 698    periments. The combination of face and clothing features                                                                                       for each person image is found using graph cuts. Further           752
 699    improves recognition in all of our test sets. If only a single                                                                                 clothing segmentation improvement is attained using multi-         753
 700    feature type is to be used, the preferred feature depends on                                                                                   ple images of the same person which allows us to construct         754
 701    the image collection. For this experiment, the clothing fea-                                                                                   a better clothing model.                                           755

CVPR                                                                                                                                                                                                                                 CVPR
#2670                                                                                                                                                                                                                                #2670
                                                            CVPR 2008 Submission #2670. CONFIDENTIAL REVIEW COPY. DO NOT DISTRIBUTE.

 756                                               Set 1                                                                 Set 2                                                                                                         810
                                   0.9                                                                0.7                                              [5] T. Berg, A. Berg, J. Edwards, M. Maire, R. White, Y.-W. Teh,
 757                                                                                                 0.65                                                  E. Learned-Miller, and D. Forsyth. Names and faces in the news.             811
            Recognition Accuracy


                                                                            Recognition Accuracy
 758                                                                                                  0.6                                                  In Proc. CVPR, 2004.                                                        812
                                   0.7                                                               0.55
 759                                                                                                  0.5
                                                                                                                                                       [6] Y. Boykov and V. Kolmogorov. An experimental comparison of min-             813
                                   0.6                                                                                                                     cut/max-flow algorithms for energy minimization in vision. PAMI,
 760                                                        Box
                                   0.5                      Graph Cut                                 0.4                           Box
 761                                                        Group GC                                 0.35                           Graph Cut
                                                                                                                                                       [7] Y. Boykov, O. Veksler, and R. Zabih. Efficient approximate energy            815
                                                            Ground Truth                                                            Group GC
 762                                  0      100          200         300                                  0   200   400      600    800   1000            minimization via graph cuts. PAMI, 2001.                                    816
                                          Number Labeled Examples                                               Number Labeled Examples
 763                                               Set 3                                                                 Set 4                         [8] H. Chen, Z. J. Xu, Z. Q. Liu, and S. C. Zhu. Composite templates            817
                                    1                                                                  1
 764                                                                                                                                                       for cloth modeling and sketching. In Proc. CVPR, 2006.                      818
                                                                              Recognition Accuracy   0.9
                                                                                                                                                       [9] I. Cohen, A. Garg, and T. Huang. Vision-based overhead view person
            Recognition Accuracy

 765                                                                                                 0.8                                                                                                                               819
                                                                                                                                                           recognition. In ICPR, page 5119, 2000.
 766                               0.6
                                                                                                                                                      [10] M. Jones and P. Viola. Fast multiview face detector. In Proc. CVPR,         820
 767                                                                                                 0.5                          Box                      2003.                                                                       821
                                   0.4                        Box                                                                 Graph Cut
 768                                                          Graph Cut                              0.4                          Group GC                       u            a
                                                                                                                                                      [11] D. Kl¨ nder, M. H¨ hnel, and K.-F. Kraiss. Color and texture features       822
                                                              Group GC                                                            Ground Truth
 769                               0.2
                                      0      500          1000       1500                              0       100   200      300    400    500
                                                                                                                                                           for person recognition. In IJCNN, 2004.                                     823
                                          Number Labeled Examples                                               Number Labeled Examples
                                                                                                                                                      [12] V. Kolmogorov and R. Zabih. What energy functions can be mini-
 770                                                                                                                                                                                                                                   824
                                                                                                                                                           mized via graph cuts? PAMI, 2004.
 771    Figure 8. Using graph cuts for the extraction of clothing features                                                                            [13] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features:              825
 772    improves the accuracy of recognizing people over using a simple                                                                                    Spatial pyramid matching for recognizing natural scene categories.          826
 773    box region. Further improvement is attained by using multiple                                                                                      In CVPR, 2006.                                                              827
 774    person images when performing clothing segmentation. Sets 1 and                                                                               [14] F.-F. Li and P. Perona. A bayesian hierarchical model for learning          828
        4 demonstrate even more room for improvement when ground-                                                                                          natural scene categories. In CVPR, 2005.
 775                                                                                                                                                                                                                                   829
        truth clothing segmention is used for feature extraction.                                                                                     [15] D. Liu and T. Chen. Background cutout with automatic object dis-
 776                                                                                                                                                       covery. In Proc. ICIP, 2007.                                                830
 777                                                                                                                                                  [16] G. Mori, X. Ren, A. A. Efros, and J. Malik. Recovering human body           831
 778                                                                                                                                                       configurations: Combining segmentation and recognition. In Proc.             832
                                                                                                                                                           CVPR, 2004.
 779                                                                                                                                                                                                                                   833
                                                                                                                                                      [17] J. Munkres. Algorithms for the assignment and transportation prob-
 780                                                                                                                                                       lems. SIAM, 1957.                                                           834
 781                                                                                                                                                  [18] C. Nakajima, M. Pontil, B. Heisele, and T. Poggio. Full-body person         835
 782                                                                                                                                                       recognition system. Pattern Recognition, 2003.                              836
 783                                                                                                                                                  [19] D. Ramanan, D. Forsyth, and A. Zisserman. Strike a pose: Tracking           837
                                                                                                                                                           people by finding stylized poses. In Proc. CVPR, 2005.
 784    Figure 9. Given an image (left), using the clothing features from                                                                                                                                                              838
                                                                                                                                                      [20] C. Rother, V. Kolomogorov, and A. Blake. Grabcut- interactive fore-
 785    a graph cut clothing mask (right) results in superior recognition to                                                                               ground extraction using iterated graph cuts. In Proc. ACM Siggraph,         839
 786    using a box (middle).                                                                                                                              2004.                                                                       840
 787                                                                                                                                                  [21] C. Rother, T. Minka, A. Blake, and V. Kolomogorov. Cosegmenta-              841
 788                                                                                                                                                       tion of image pairs by histogram matching - incorporating a global          842
                                                                                                                                                           constraint into mrfs. In Proc. CVPR, 2004.
 789        This work can be viewed as a case study for the merits of                                                                                                                                                                  843
                                                                                                                                                      [22] B. Russell, A. Efros, J. Sivic, W. Freeman, and A. Zisserman. Using
 790    combining segmentation and recognition. Improvements in                                                                                            multiple segmentations to discover objects and their extent in image        844
 791    clothing segmentation improve person recognition in con-                                                                                           collections. In Proc. CVPR, 2006.                                           845
 792    sumer image collections. Likewise, using multiple images                                                                                      [23] J. Shi and J. Malik. Normalized cuts and image segmentation. PAMI,          846
 793    of the same person improves the results of clothing segmen-                                                                                        2000.                                                                       847
                                                                                                                                                      [24] T. Sim, S. Baker, and M. Bsat. The CMU pose, illumination, and
 794    tation. We are working on the next steps by merging the                                                                                            expression (PIE) database. In Proc. ICAFGR, May 2002.
 795    clothing segmentation and person recognition into a frame-                                                                                    [25] J. Sivic and A. Zisserman. Video Google: A text retrieval approach          849
 796    work where each assists the other, and anticipate applying                                                                                         to object matching in videos. In Proc. ICCV, 2003.                          850
 797    these concepts to other computer vision problems as well.                                                                                     [26] J. Sivic, C. Zitnick, and R. Szeliski. Finding people in repeated shots     851
 798                                                                                                                                                       of the same scene. In Proc. BMVC, 2006.                                     852
                                                                                                                                                      [27] Y. Song and T. Leung. Context-aided human recognition- clustering.
 799    References                                                                                                                                         In Proc. ECCV, 2006.
 800                                                                                                                                                  [28] N. Sprague and J. Luo. Clothed people detection in still images. In         854
         [1] D. Anguelov, K.-C. Lee, S. Burak, Gokturk, and B. Sumengen.
 801                                                                                                                                                       Proc. ICPR, 2002.                                                           855
             Contextual identity recognition in personal photo albums. In Proc.
 802         CVPR, 2007.                                                                                                                              [29] R. X. Y. Tian, W. Liu, F. Wen, and X. Tang. A face annotation frame-        856
 803                                                                                                                                                       work with partial clustering and interactive labeling. In Proc. CVPR,       857
         [2] Anonymous. Consumer image database. This url is anonymous until                                                                               2007.
 804         publication.                                                                                                                                                                                                              858
                                                                                                                                                      [30] S. Yu, R. Gross, and J. Shi. Concurrent object recognition and seg-
 805     [3] S. Bagon. Matlab wrapper for graph cut. Downloaded July 2007                                                                                  mentation by graph partitioning. In Proc. NIPS, 2002.                       859
 806         from the Weizmann Institute.                                                                           [31] L. Zhang, L. Chen, M. Li, and H. Zhang. Automated annotation of             860
             ˜bagon.                                                                                                                                       human faces in family albums. In Proc. MM, 2003.
 807                                                                                                                                                                                                                                   861
         [4] P. N. Belhumeur, J. Hespanha, and D. J. Kriegman. Eigenfaces vs.                                                                         [32] L. Zhang, Y. Hu, M. Li, and H. Zhang. Efficient propagation for face
 808         fisherfaces: Recognition using class specific linear projection. PAMI,                                                                                                                                                      862
                                                                                                                                                           annotation in family albums. In Proc. MM, 2004.
 809         1997.                                                                                                                                                                                                                     863


Shared By: