www.jdl.ac.cnprojectfaceIdpaperreadingPPTsCVP

Document Sample
www.jdl.ac.cnprojectfaceIdpaperreadingPPTsCVP Powered By Docstoc
					ICCV 2009 Paper Reading

        Ren Haoyu
        2009.10.30
Selected Paper

• Paper 1
   – 187 LabelMe Video: Building a Video Database with Human
     Annotations
   – J. Yuen, B. Russell, C. Liu, and A. Torralba
• Paper 2
   – 005 An HOG-LBP Human Detector with Partial Occlusion
     Handling
   – X. Wang, T. Han, and S. Yan
Paper 1 Overview

• Author information
• Abstract
• Paper content
  – To build a database
  – Database information
  – Database function
• Conclusion
Author information(1/4)

• Jenny Yuen
   – Education
      • BSD at Computer Science from the University of Washington
      • Now a third year PhD student at MIT-CSAIL
      • Advised by Professor Antonio Torralba
   – Research Interest
      • Computer vision (object/action recognition, scene understanding,
        image/video databases)
   – Papers
      • 2 ECCV’08, 1 CVPR’09
Author information(2/4)

• Bryan Russell
   – Education
      • ...
      • Postdoctoral Fellow at INRIA WILLOW Team
   – Research Interest
      • 3D object and scene modeling, analysis, and retrieval
      • Human activity capture and classification
      • Category-level object and scene recognition
   – Papers
      • 1 NIPS’09 1 CVPR’09, 1CVPR’08
Author information(3/4)

• Ce Liu
   – Education
      • BSD at Department of Automation, Tsinghua University
      • MSD at Department of Automation, Tsinghua University
      • PhD in Department of Electrical Engineering and Computer
        Science, MIT
   – Research Interest
      • Computer Vision, Computer Graphics, Computational Photography,
        applications of Machine Learning in Vision and Graphics
   – Publications
      • ECCV08, CVPR08, PAMI08…
Author information(4/4)

• Antonio Torralba
   – Education
      • …
      • Associate Professor at MIT-CSAIL
   – Research Interest
      • scene and object recognition
   – Papers
      • 3 CVPR’09, 2 ICCV09…
Abstract (1/1)

• Problem
   – Currently, video analysis algorithms suffer from lack of information
     regarding the objects present, their interactions, as well as from
     missing comprehensive annotated video databases for
     benchmarking.
• Main contribution
   – We designed an online and openly accessible video annotation
     system that allows anyone with a browser and internet access to
     efficiently annotate object category, shape, motion, and activity
     information in real-world videos.
   – The annotations are also complemented with knowledge from
     static image databases to infer occlusion and depth
     information.
   – Using this system, we have built a scalable video database
     composed of diverse video samples and paired with human-guided
     annotations.
• We complement this paper demonstrating potential uses
  of this database by studying motion statistics as well as
  cause-effect motion relationships between objects.
To build a database (1/1)

• In original database, little has been taken into account for
  the prior knowledge of motion, location and appearance at
  the object and object interaction levels in real world
  videos.
• To build one that will scale in quantity, variety, and quality
  like the currently available ones in benchmark databases
  for both static images and videos
• Diversity, accuracy and openness
   – We want to collect a large and diverse database of videos that span
     many different scene, object, and action categories, and to
     accurately label the identity and location of objects and actions.
   – Furthermore, we wish to allow open and easy access to the data
     without copyright restrictions.
Database Information (1/1)

• 238 object classes, 70 action classes and 1,903 video
  sequences
Database function (1/1)

• Object Annotation
• Event Annotation
• Annotation interpolation
   – To fill the missing polygons in between key frames
   – 2D/3D interpolation
• Occlusion handling and depth ordering
• Cause-effect relations within moving objects
Annotation interpolation (1/3)

• 2D interpolation
   – Given the object 2-D position p0 and p1 at two key frames t = 0 and t
     = 1, assumes that the points in outlining objects are transformed by
     a 2D projection plus a residual term, where S, R, and T are scaling,
     rotation, and translation matrices encoding the projection from p0
     to p1 that minimizes the residual term r

                          p1  SR * p0  T  r
   – Then any polygons at frame t ∈ [0, 1], can then be linearly
     interpolated
     as:              pt  [SR]t p0  t[T  r ]
Annotation interpolation (2/3)

• 3D interpolation
   – Assuming that a given point on the object moves in a straight line
     in the 3D world, the motion of point X(t) at time t in 3D can be
     written as
                          X(t )  X0   (t ) D
                                                               
     where the X0 is the initial point, D is the 3D direction and is the
     displacement along the direction vector
   – Assume perspective projection and that the camera is stationary, the
     intrinsic and extrinsic parameters of the camera can be expressed as
     a 3×4 matrix P, the points projected to the image plane is
                 PX(t )  x0   (t )xv , x0  PX0 , xv  PD
   – The image coordinates for points on the object can be written
     as
                                  x0   (t ) xv y0   (t ) yv
              ( x(t ), y(t ))  (               ,               )
                                     (t )  1      (t )  1
Annotation interpolation (3/3)

• 3D interpolation
                                                                (t
   – Assume that the point moves with constant velocity, we have)  vt
   – Given a corresponding second point(t )  ( x, y,1)
                                       x                    along the
     path projected into another frame, we can recover the velocity as
                                     x  x0
                             v
                                  t ( xv  x)
   – In summary, to find the image coordinates for points on the object
     at any time, we simply need to know the coordinates of a point at
     two different times.
• Comparison of 2D and 3D interpolation
   – Pixel error per object class.
Occlusion handling and depth ordering (1/1)

• Occlusion handling and depth ordering
   – Method 1: not work
      • Model the appearance of the object, wherever there is overlapping
        with another object, infer which object owns the visible part based on
        matching appearance
   – Method 2: not work
      • When two objects overlap, the polygon with more control points in the
        intersection region is in front
   – Method 3: work
      • Extract accurate depth information using the object labels and to infer
        support relationships from a large database of annotated images
      • Defining a subset of objects as being ground objects (e.g., road,
        sidewalk,
        etc.), inferred the support relationship by counting how many times
        the bottom part of a polygon overlaps with the supporting object (e.g,
        a person + road)
Cause-effect relations within moving objects (1/1)

• Cause-effect relations
   – Define a measure of causality, which is the degree to which an
     object class C causes the motion in an object of class E:
                                       p( Emoves | Cmoves and cause E )
               Causality (C , E ) 
                                      p( Emoves | Cmoves and not cause E )
Examples (1/1)
Conclusion (1/1)

• We designed an open, easily accessible, and scalable
  annotation system to allow online users to label a database
  of real-world videos.
• Using our labeling tool, we created a video database that is
  diverse in samples and accurate, with human guided
  annotations.
• Based on this database, we studied motion statistics and
  cause-effect relationships between moving objects to
  demonstrate examples of the wide array of applications for
  our database.
• Furthermore, we enriched our annotations by propagating
  depth information from a static and densely annotated
  image database.
Paper 2 Overview

• Author information
• Abstract
• Paper content
  – HOG feature
  – LBP feature
  – Occlusion handling
• Experimental result
• Conclusion
Author information(1/2)

• Wang Xiaoyu: can not find the homepage
• Tony Xu Han
   – Education
      • PhD at University of Illinois at Urbana-Champaign.
        (Advisor: Prof. Thomas Huang)
      • MSD at University of Rhode Island
      • MSD at Beijing Jiaotong University
      • BSD at Beijing Jiaotong University
   – Research interest
      • Computer Vision, Machine Learning, Human Computer Interaction,
        Elder Care Technology
   – Papers
      • 1 CSVT08, 1 CSVT09, 1CVPR09
Author information(2/2)

• Yan Shuicheng
   – Education
      • BSD, MSD, PhD at mathematics department, PKU
      • Assistant Professor in the Department of Electrical and Computer
        Engineering at National University of Singapore
      • The founding lead of the Learning and Vision Research Group
   – Research Interest
      • Activity and event detection in images and videos, Subspace learning
        and manifold learning, Transfer Learning…
   – Papers
      • 2 CVPR’09, 1 IP’09, 1PAMI’09, 2 ACM’09…
Abstract (1/1)

 • Problem
    – Performance, occlusion handling
 • Main contribution
    – By combining Histograms of Oriented Gradients (HOG) and Local
      Binary Pattern (LBP) as the feature set, we propose a novel human
      detection approach capable of handling partial occlusion.
    – For each ambiguous scanning window, we construct an occlusion
      likelihood map by using the response of each block of the HOG
      feature to the global detector.
    – If partial occlusion is indicated with high likelihood in a certain
      scanning window, part detectors are applied on the unoccluded
      regions to achieve the final classification on the current scanning
      window.
 • We achieve a detection rate of 91.3% with FPPW= 10−6,
   94.7% with FPPW= 10−5, and 97.9% with FPPW= 10−4 on
   the INRIA dataset, which, to our best knowledge, is the
   best human detection performance on the INRIA
   dataset.
HOG feature(1/4)

• HOG feature
   – Histogram of Oriented Gradient, the most famous and most
     successful feature in human detection
   – Calculate the gradient orientation histogram voted by the
     gradient magnitude
   – Perform well with Linear SVM, Kernel SVM (RBF, IK, Quadratic…),
     LDA + AdaBoost, SVM + AdaBoost, Logistic Boost…
• HOG feature extraction
HOG feature(2/4)

• HOG feature extraction
   – For a 64x128 patch, the minimum cell size is 8x8
   – Quantize the gradient orientation into 9 bins, use tri-linear
     interpolation + Gauss weighting to vote the gradient magnitude
   – A block consists of 2x2 cells, overlap 50%, total 105 blocks with
     3,780 bins
   – We can use integral image to speed up the HOG feature extraction
     without tri-linear interpolation and Gauss weighting
                                                               9 orientation bins for each cell
                              HOG histogram of block
     Cell C0             C1
            dx    1-dx

      dy                                        Cell C0                 C1

     1-dy


                                     C2                   C3
             C2          C3
HOG feature(3/4)

• Convoluted Tri-linear Interpolation (CTI)
   – Use CTI instead of Tri-linear Interpolation to fit integral image
   – Vote the gradient with a real-valued direction between 0 and pi
     into the 9 discrete bins according to its direction and magnitude
   – Using bilinear interpolation to distribute the magnitude of the
     gradient into two adjacent bins
   – Designed a 7x7 convolution kernel. The weights are distributed
     to the neighborhood linearly according to the distances. Use this
     kernel to convolve over the orientation bin image to achieve the
     tri-linear interpolation.
   – Make the integral image on the convoluted bin image
HOG feature(4/4)

• Convoluted Tri-linear Interpolation (CTI)
   – Use FFT for convolution is efficient. So CTI doesn’t increase the
     space complexity of the integral image approach
LBP feature(1/1)

• LBP feature
   – Local Binary Pattern, an exceptional texture that widely used in
     various applications and has achieved very good results in face
     recognition
   – Build pattern histograms in cells
   – Use LBPur to denote LBP feature that takes n sample points with
               n,
     radius r, and the number of 0-1 transitions is no more than u
   – Use bilinear interpolation to locate the points, use Euclidean
     distance to measure the distance instead of l
   – Integral image for fast extraction
Occlusion handling (1/5)

• Basic idea
   – If a portion of the pedestrian is occluded, the densely extracted
     blocks of features in that area uniformly respond to the linear SVM
     classifier with negative inner products
   – Use the classification score of each block to infer whether the
     occlusion occurs and where it occurs
   – When the occlusion occurs, the part-based detector is triggered to
     examine the unoccluded portion
Occlusion handling (2/5)

 • The decision function of linear SVM is
                          l
            f ( x )      k  x, x k     w T x
                           k 1
    where w is the weighting vector of the linear SVM as
                    l
              w    k x k  [w1 ,..., w105 ]T
                   i 1

    We distribute the constant bias  to each block Bi.
                                    105               105
               f (x)    w x   i  w x   f ( Bi )
                             T                    T
                                                  i
                                     i 1             i 1

    Then the real contribution of a block could be got by
    subtracting the corresponding bias from the summation of
    feature inner production over this block. So the key
    problem is how to learn  i
Occlusion handling (3/5)

 • Learn the  i , i.e. the constant bias from the training part
   of the INRIA dataset by collecting the relative ratio of
   the bias constant in each block to the total bias
   constant.
    Denote the set of HOG features of positive/negative
                                              
    training samples as: {x } p  1,..., N {xq } q  1,..., N
                           p
    (N+/N- is the number of positive/negative samples)
                                    
    Denote the ith block as Bp;i , Bq;i, we have
               N                     N  105
                f (x )  S   N     wT Bp:i
                p
               p 1
                                            i
                                      p 1 i 1
                                               



               N                     N  105
                f (x )  S   N     wT Bq:i
                q
               q 1
                                            i
                                      q 1 i 1
                                               
Occlusion handling (4/5)
                S
 • Denoted A    , we have
                S
                                   105           N     N
           0  A N    N     wT ( A Bp:i  Bq:i )
                                    i
                                                   

                                   i 1          p 1   q 1

   i.e.              105      N           N
              B    wT ( A Bp:i  Bq:i )
                     i
                     i 1
                              

                              p 1
                                      

                                          q 1

                      1
   where     B
                   A N  N
   Then we have
                            N            N
             i  B wT ( A Bp:i   Bq:i )
                     i
                             

                            p 1          q 1
Occlusion handling (5/5)

• Implementation
   – Construct the binary occlusion likelihood image according to the
     response of each block of the HOG feature. The intensity of the
     occlusion likelihood image is the sign of f i ( Bi )
   – Use mean-shift to segment out the possible occlusion regions on
     the binary occlusion likelihood image with | f i ( Bi ) | as the weight
   – A segmented region of the window with an overall negative
     response is inferred as an occluded region. The part detector is
     applied. But if all the segmented regions are consistently negative,
     we tends to treat the image as a negative image




           Original patches                         occlusion likelihood image
Experimental result (1/3)

 • Experiment 1: using cell-LBP feature on INRIA dataset
    –   LBP2 with L1-norm of 16x16 size shows the best result, about 94.0%
           8,1
        at 10e-4
Experimental result (2/3)

 • Experiment 2: using HOG-LBP feature on INRIA dataset
    – HOG-LBP outperforms all state-of-the are algorithms under both
      FPPI and FPPW criteria
Experimental result (3/3)

 • Experiment 3: occlusion handling
    – Using occlusion handling strategy shows sufficient improvement on
      the detection performance
    – Overlaying PASCAL segmented objects to the testing images in the
      INRIA dataset
Conclusion (1/1)

• We propose a human detection approach capable of
  handling partial occlusion and a feature set that
  combines the tri-linear interpolated HOG with LBP in
  the framework of integral image.
• It has been shown in our experiments that the HOG-LBP
  feature outperforms other state-of-the-art detectors on
  the INRIA dataset.
• However, our detector cannot handle the articulated
  deformation of people, which is the next problem to be
  tackled.
Thanks

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:1/26/2012
language:English
pages:37