A Sparse Texture Representation Using Affine-Invariant Regions

Document Sample
A Sparse Texture Representation Using Affine-Invariant Regions Powered By Docstoc
					Learning Local Affine Representations
 for Texture and Object Recognition

                   Svetlana Lazebnik
Beckman Institute, University of Illinois at Urbana-Champaign
      (joint work with Cordelia Schmid, Jean Ponce)
• Goal:
  – Recognition of 3D textured surfaces, object classes
• Our contribution:
  – Texture and object representations based on
    local affine regions
• Advantages of proposed approach:
  – Distinctive, repeatable primitives
  – Robustness to clutter and occlusion
  – Ability to approximate 3D geometric transformations
                        The Scope
1. Recognition of single-texture images (CVPR 2003)

2. Recognition of individual texture regions in multi-texture
   images (ICCV 2003)

3. Recognition of object classes (BMVC 2004, work in progress)
1. Recognition of Single-Texture Images
        Affine Region Detectors
                   Harris detector (H)   Laplacian detector (L)

Mikolajczyk & Schmid (2002), Gårding & Lindeberg (1996)
          Affine Rectification Process
Patch 1                                           Patch 2

            Rectified patches (rotational ambiguity)
     Rotation-Invariant Descriptors 1:
               Spin Images
• Based on range spin images (Johnson & Hebert 1998)
• Two-dimensional histogram:
     distance from center × intensity value
 Rotation-Invariant Descriptors 2: RIFT
• Based on SIFT (Lowe 1999)
• Two-dimensional histogram:
      distance from center × gradient orientation
• Gradient orientation is measured w.r.t. to the direction
  pointing from the center of the patch
             Signatures and EMD
• Signatures
     S = {(m , w ), … , (m , w )}
             1   1         k   k

         mi — cluster center
         wi — relative weight

• Earth Mover’s Distance (Rubner et al. 1998)
  – Computed from ground distances d(mi, m'j)
  – Can compare signatures of different sizes
  – Insensitive to the number of clusters
         Database: Textured Surfaces

25 textures, 40 sample images each (640x480)
• Channels: HS, HR, LS, LR
  – Combined through addition of EMD matrices
• Classification results
  – 10 training images per class, rates averaged over
    200 random training subsets
                Comparative Evaluation
                      Our method                Varma & Zisserman

Spatial selection     Harris and Laplacian      None (every pixel
                      detectors                 location is used)

Neighborhood shape    Affine adaptation         None (support of
selection                                       descriptors is fixed)

Descriptors           Spin images, RIFT         Raw pixel values

Textons               Separate set of textons   Universal texton
                      for each image            dictionary

Representing/comparing Signatures/EMD           Histograms/
texton distributions                            chi-squared distance
              Results of Evaluation:
Classification rate vs. number of training samples

    (H+L)(S+R)              VZ-Joint              VZ-MRF

• Conclusion: an intrinsically invariant representation is
  necessary to deal with intra-class variations when they are
  not adequately represented in the training set
• A sparse texture representation based on local
  affine regions
• Two novel descriptors (spin images, RIFT)
• Successful recognition in the presence of viewpoint
  changes, non-rigidity, non-homogeneity
• A flexible approach to invariance
 2. Recognition of Individual Regions in
         Multi-Texture Images
• A two-layer architecture:
  – Local appearance + neighborhood relations
• Learning:
  – Represent the local appearance of each texture class
    using a mixture-of-Gaussians model
  – Compute co-occurrence statistics of sub-class labels over
    affinely adapted neighborhoods
• Recognition:
  – Obtain initial class membership probabilities from the
    generative model
  – Use relaxation to refine these probabilities
          Two Learning Scenarios
• Fully supervised: every region in the training image
  is labeled with its texture class


• Weakly supervised: each training image is labeled
  with the classes occurring in it

                    brick, marble, carpet
   Neighborhood Statistics

                          • probability p(c,c')
                          • correlation r(c,c')

Neighborhood definition
    Relaxation (Rosenfeld et al. 1976)
• Iterative process:
  – Initialized with posterior probabilities p(c|xi) obtained from
    the generative model
  – For each region i and each sub-class label c, update the
    probability pi(c) based on neighbor probabilities pj(c') and
    correlations r(c,c')

• Shortcomings:
  – No formal guarantee of convergence
  – After the initialization, the updates to the probability values
    do not depend on the image data
       Experiment 1: 3D Textured Surfaces
                                       Single-texture images
   T1 (brick)      T2 (carpet)       T3 (chair)     T4 (floor 1)     T5 (floor 2)    T6 (marble)      T7 (wood)

                                       Multi-texture images

10 single-texture training images per class, 13 two-texture training images, 45 multi-texture test images
Effect of Relaxation on Labeling
                    Original image

     Top: before relaxation, bottom: after relaxation
                      (single-texture training images)

T1 (brick)                  T2 (carpet)                 T3 (chair)               T4 (floor 1)

             T5 (floor 2)                 T6 (marble)                T7 (wood)
Successful Segmentation Examples
Unsuccessful Segmentation Examples
               Experiment 2: Animals

   cheetah, background   zebra, background   giraffe, background

• No manual segmentation
• Training data: 10 sample images per class
• Test data: 20 samples per class + 20 negative
Cheetah Results
Zebra Results
Giraffe Results
• A two-level representation (local appearance +
  neighborhood relations)
• Weakly supervised learning of texture models

                 Future Work
• Design an improved representation using a random
  field framework, e.g., conditional random fields
  (Lafferty 2001, Kumar & Hebert 2003)
• Develop a procedure for weakly supervised
  learning of random field parameters
• Apply method to recognition of natural
  texture categories
    3. Recognition of Object Classes

The approach:
• Represent objects using multiple composite
  semi-local affine parts
  – More expressive than individual regions
  – Not globally rigid
• Correspondence search is key to learning and
              Correspondence Search
• Basic operation: a two-image matching procedure for finding
  collections of affine regions that can be mapped onto each
  other using a single affine transformation


• Implementation: greedy search based on geometric and
  photometric consistency constraints
   – Returns multiple correspondence hypotheses
   – Automatically determines number of regions in correspondence
   – Works on unsegmented, cluttered images (weakly supervised learning)
Matching: 3D Objects
Matching: 3D Objects

     closeup           closeup
Matching: Faces

           spurious match ???
Finding Symmetries
Finding Repeated Patterns and
Learning Object Models for Recognition
• Match multiple pairs of training images to produce a
  set of candidate parts
• Use additional validation images to evaluate
  repeatability of parts and individual regions
• Retain a fixed number of parts having the best
  repeatability score
        Recognition Experiment: Butterflies
    Admiral   Swallowtail   Machaon   Monarch 1   Monarch 2   Peacock   Zebra

•     16 training images (8 pairs) per class
•     10 validation images per class
•     437 test images
•     619 images total
Butterfly Parts
• Top 10 parts per class used for recognition
                                total number of regions detected
• Relative repeatability score:          total part size
• Classification results:

               Total part size (smallest/largest)
Classification Rate vs.
   Number of Parts
         Detection Results (ROC Curves)

Circles: reference relative repeatability rates. Red square: ROC equal error rate (in parentheses)
Successful Detection Examples
             Training images

   Test images (blue: occluded regions)

    All ellipses found in the test images
Unsuccessful Detection Examples
               Training images

    Test images (blue: occluded regions)

      All ellipses found in the test image
• Semi-local affine parts for describing structure
  of 3D objects
• Finding a part vocabulary:
   – Correspondence search between pairs of images
   – Validation
• Additional application:
   – Finding symmetry and repetition

                     Future Work
• Find a better affine region detector
• Represent, learn inter-part relations
• Evaluation: CalTech database, harder classes, etc.
Egret                Birds                      Puffin
        Snowy Owl   Mandarin Duck   Wood Duck
Birds: Candidate Parts
      Mandarin Duck

Objects without Characteristic Texture

                 Summary of Talk
1. Recognition of single-texture images
  •   Distribution of local appearance descriptors

2. Recognition of individual regions in
   multi-texture images
  •   Local appearance + loose statistical neighborhood

3. Recognition of object categories
  •   Local appearance + strong geometric relations

For more information:
                 Issues, Extensions
• Weakly supervised learning
    – Evaluation methods?
    – Learning from contaminated data?
•   Probabilistic vs. geometric approaches to invariance
•   EM vs. direct correspondence search
•   Training set size
•   Background modeling
•   Strengthening the representation
    – Heterogeneous local features
    – Automatic feature selection
    – Inter-part relations