Human Object Recognition

Document Sample
Human Object Recognition Powered By Docstoc
					Human Object Recognition

           Part 3 of the Biomimetic

           Bruce Draper
A Divided Vision System

The human vision system has three major components:
1. The early vision system
        Retinogeniculate pathway
            RetinaLGNd V1 (V2 V3)
             and  channels
        Retinotectal pathway
            Retina S.C. Pulvinar Nucleus V1(V2 V3)
            Retina S.C. Pulvinar Nucleus MT(dorsal)
            Retina S.C. LGNd(interlaminar) V1(V2 V3)
2.   The dorsal (“where”) pathway
3.   The ventral (“what”) pathway

                                       D. Milner & M. Goodale, The Visual Brain in Action, p. 22
The Early Vision System

   Retinotopically mapped
    –   Small receptive fields in LGNd, V1
    –   Receptive fields grow with processing depth
            Bigger in V2, Bigger still in V3…
   Spatially organized into feature maps
    –   Edge maps (Gabor filters, quadrature pairs)
    –   Color maps
    –   Disparity maps
    –   Motion maps (in MT, if not before)
   Afferent & efferent connections
   Measurable neural correlates of spatial attention
An Early Vision Hypothesis

                 The primary role of the early
               vision system is spatial attention

   Logic: why compute any feature across the entire image when it would
    be cheaper to compute it later across only the attention window?
    Because you need the feature to select the attention window.
   Neural evidence: Neural correlates of spatial attention (e.g.
    anticipatory firing, enhanced firing) are measurable in V1 and even
   Psychological evidence: ventral and dorsal streams appear to process
    the same attention windows, suggesting that attention is selected prior
    to the ventral/dorsal split.
   Caveat: Some dorsal vision tasks (e.g. ego-motion estimation) benefit
    from a broad field of view, and may be non-attentional
The Dorsal/Ventral Split

                     Color Codes:
                        –   Red: early vision
                        –   Orange/Yellow: dorsal
                                Leads to
                                 somatosensory and
                                 motor cortex
                        –   Blue/Green: ventral
                                Leads more to
                                 memories, frontal
                                More developed in
                                 humans than
A Dorsal Vision Hypothesis

    Milner & Goodale: the dorsal vision system supports
     immediate actions, and not cognition or memory

   Anatomical evidence:
    1.   strongly connected to motion and stereo processing in V1;
    2.   dorsal areas (e.g. LIP, 7a) inactive under anaesthesia
    3.   neurons conjointly tuned for perception and action
    4.   saccade-responsive neurons and gaze-responsive neurons
   Behavioral evidence:
    1.   monkeys with dorsal lesions recognize objects but can’t
         grab them;
    2.   blindsight (see next slide)
   Patients with severe damage to V1 are
    “cortically blind”
     –   Report no sensation of vision
     –   MRI confirms no activity in V1
     –   Saccadic eye movements continue
   Nonetheless, they can point at targets
     –   Much better than random (see chart)
     –   Once they relax & let it happen
   Why?
     –   Retina S.C. Pulvinar Nucleus
     –   MRI confirms some dorsal vision activity
   So?
     –   Confirms that dorsal vision has no contact with
A Ventral Vision Hypothesis

Milner & Goodale: the ventral pathway supports vision
for cognition, including (categorical & sub-categorical)
  object recognition and landmark-based navigation

    Anatomical evidence:
    1.   Visual pathways connects early vision to areas associated with
         memory (e.g. right inferior frontal lobe (RIFL))
    2.   MRI centers of activity in ventral stream during (a) expert object
         recognition and (b) landmark recognition
    Behavioral evidence:
    1.   Ventral lesions in monkeys prevent object recognition
    2.   Lesions in fusiform gyrus in humans lead to prosopagnosia
    3.   Stimulation of RIFL during surgery creates mental images
                           This may seem like a tangent, but its not…

Repetition Suppression

   What happens when the same stimulus is presented
    repeatedly to the vision system?
    –   In fMRI studies, the total response of a voxel drops with
        each presentation
    –   In single-cell recording studies, neural responses become
            Most cells stop firing at all
            A few cells start responding at their maximal firing rate
    –   This can be observed in the ventral stream
            But not the early vision system
    –   This can be observed at both short and long time scales
            Short-time scale repetition suppression is interrupted by novel
Decomposing the Ventral Stream
The ventral stream has 4 major parts, as revealed by MRI:
1.   The early vision system
        Both the ventral & dorsal streams start here
        Selects spatial attention windows (our hypothesis)
2.   The lateral occipital cortex
        Large area, diffusely active in MRI studies
        Including (at least) V4 & V8
        Kosslyn: hypothesizes feature extraction
3.   The inferotemporal cortex
        Large area, diffusely active in MRI studies
        Sharp focus of activity in fusiform gyrus during expert recognition
        Sharp focus of activity in parahippocampal gyrus during landmark
4.   The right inferior frontal cortex
        Associated with visual memories
        Efferently stimulates V1 when active
        Strongly lateralized
Area V8 (Lateral Occipital Cortex)

    Short-term repetition studies suggest V8 computes edge-
     based features
    –    Equal amounts of suppression for image/image, image/edge,
         edge/image or edge/edge pairs
    Psychological studies suggest the recognition is sensitive to
     the disruption of “non-accidental” features
    1.   Colinearity
    2.   Parallelism (translational symmetry)
    3.   Reflection (anti-symmetry)
    4.   Co-termination (end-point near)
    5.   Constant curvature
    Diffuse response suggests population coding
An LOC Hypothesis

Area V8 detects non-accidental edge relations through parameter-
          space voting schemes (e.g. Hough spaces)

Other LOC areas use voting schemes to summarize other features,
e.g. color histograms in area V4/V7. Together, LOC areas create a
      high-dimensional but distributed feature representation

   Evidence:
    –   Diffuse responses consistent with population codes
    –   Fit psychology models of LOC as feature extraction
    –   Explains repetition suppression effects in V8
    –   Explains non-classical receptive field responses in V1
        (assuming efferent feedback to early vision)
Infero-temporal Cortex (IT)

   Diffusely active in fMRI during all types of object
   Last visual processing stage before memories
   Distributed responses to objects (Tsunoda, et al):

    Test Stimulus    Hot spots (versus control, shown for different
                           Levels of statistical significance)
 Inferotemporal Cortex (continued)

  Hot spots
overlap, and
 contiguous                   Always some
(pop. Code)                   response

stimuli yield
greater total                 Minimal effect
 responses;                   of stimulus
  responses                   intensity
IT (III): when stimulus is simplified
Significant results

   Figure A is a control: hot spots from 3 different
   Figure B: red spots respond to the whole cat; a
    subset of spots (blue) respond to just the head; a
    subset of that responds to a silhouette of the head
    • Implication: part-based features.
   Figure C: Blue spot responds to whole object, but
    not to simplification. Some red spots respond only to
    simplified version
    –Implication: More complex scenario: some feature responses
    are turned off by the whole object (competition?)
An IT Hypothesis

     Repetition suppression in infero-temporal cortex
        implements unsupervised feature space
    segmentation, thus categorizing attention windows

   Repetition suppression effects are strongest in IT
   Single cell recording studies show that IT cells
    respond to multiple features (e.g. color + shape)
   Simpler organizations (e.g. part/subpart hierarchies,
    “view maps”) are not supported by single-cell
    recording data
Expert Object Recognition

Expert object recognition applies when:
    –   The viewer is very familiar with the target object
    –   The illumination and viewpoint are familiar
    –   The target is recognized at both a categorical & sub-categorical
    –   Example: human faces
            Sub-categories: expression, age, gender
Expert recognition properties include:
    –   Fine sub-categorical discrimination, increased recognition speed
    –   Equal response times for category/sub-category
    –   Inability to dissassociate categorical & sub-categorical recognition
    –   Trainable
            Everyone is expert at recognizing faces*, chairs; dog show judges are
             expert at dogs; subjects can be trained to be expert with Greebles.
Expert Object Recognition (II)

   Anatomically, expert object recognition is
    distinguished by:
    1.   (fMRI)Activation of early vision, LOC & IT
         –   All forms of recognition do this
    2.   (fMRI) Sharp centers of activation in fusiform
         gyrus (in IT) and right inferior frontal lobe
    3.   (ERP) The n170 signal (170 ms post stimulus)
An Expert Recognition Hypothesis

     Expert Object Recognition is appearance-based,
    matching the current stimulus to previous memories.
     When a category becomes familiar, the fusiform
    gyrus is recruited to build a manifold representation
      of the samples. Sub-categorical properties are
            encoded in the manifold dimensions

     Evidence:
      1.   Expert recognition is illumination & viewpoint dependent
      2.   It activates RIFL, which creates mental images & can
           activate the image buffers in V1.
An End-to-end computational model

(1) Bottom-up spatial selective attention
     Multi-scale maps for intensity, colors, edges (V1)
     Difference of Gaussian (on-center/off-surround)
      filtering to find impulses
     Select peaks in x, y, scale as attention windows
Step 1 Issues

(1) Issues with step #1:
  –   More information channels
         Motion
          –   Trent Williams found this is hard
         Disparity
  –   Inhibition of return
  –   Top-down control
         Integration of predictions (predictive attention)
         Split attention?
Note: attention windows do not correspond to objects.
They are just interesting parts of the image (but
repeatability is key)
Step 2: Feature Extraction

(2) Attention windows are converted into fixed-
  length sparse feature vectors by parameter
  space voting techniques.
  –   V8 is modeled with multiple non-accidental
          Hough space for colinearity
          Hough space of axes of reflection for anti-symmetry and
  –   V4 is modeled as a color histogram
  –   Simplest feature: low-resolution pixels
Step 2 Examples

Source Attention           Collinearity                  Reflection
    Window                                          (Symmetry & Vertices)
                   Edges vote in Hough space
                                                    Pairs of edges votes for
                   for positions and orientations
                                                    axes of reflection that map
                   of lines                         one onto the other (if any)

  Image Space

  Hough Space
Step 2 Issues

   Missing features
    –   Constant curvature (V8)
    –   Apparent-color-corrected histograms (V4)
    –   Disparity features
   Huge parameter space
    –   How to evaluate features without supervision
Step 3: Feature Space Segmentation

(3) IT is modeled as O(1) unsupervised segmentation:
   –   The features extracted in step #2 are concatenated to form a
       single, high-dimensional representation
   –   A 1-level neural net is trained to segment the samples by:
           If a neuron responds < 0.5 to a sample, give it a training signal of
            0 for that sample
           If a neuron responds > 0.5, give a training signal of 1.0
           Note that every neuron is trained independently, and there is no
            communication among them
   –   The response of IT to a sample is the vector of binarized
       neural responses
           Each pattern of responses is a region in feature space
Step 3 issues

   Stability
    –   If neurons keep adapting, then region codes change
    –   Linear neurons imply non-local interactions
            Radial basis neurons should perform better
   Evaluation: what makes one categorization better than
    –   No supervised training data
    –   Number and size of categories vary
         Gabe Salazar is cutting his teeth on this one…
   Top-down predictions
    –   Can we predict a category, and use it influence steps 1 & 2?
Steps 4 & 5 (unimplemented)

(4) Create sub-space manifold to describe
  samples in crowded regions.
  –   PCA subspaces are a first approximation
  –   Local linear embedding manifolds are better
          Sub-categories should correspond to manifold dimensions
(5) Associative Memory
  –   Associate attention windows with:
          Other attention windows (generate predictions)
          With other modalities (e.g. language)
            Adele Howe and I have a joint interest in this last point

   We have a biologically plausible model that
    –   Learns to extract and categorize image windows
        from larger scenes
    –   Without any human supervision or intervention
   We need help improving, evaluating, and
    extending it
    –   Interested parties should let me know!

Shared By: