Inductive inference in perception and cognition by yaosaigeng

VIEWS: 12 PAGES: 156

									        Bayesian models of human
          learning and inference
  (http://web.mit.edu/cocosci/Talks/nips06-tutorial.ppt)

               Josh Tenenbaum
                           MIT
         Department of Brain and Cognitive Sciences
           Computer Science and AI Lab (CSAIL)
Thanks to Tom Griffiths, Charles Kemp, Vikash Mansinghka
The probabilistic revolution in AI
• Principled and effective solutions for
  inductive inference from ambiguous data:
  –   Vision
  –   Robotics
  –   Machine learning
  –   Expert systems / reasoning
  –   Natural language processing

• Standard view: no necessary connection to
  how the human brain solves these problems.
      Probabilistic inference in
         human cognition?
• “People aren‟t Bayesian”
  – Kahneman and Tversky (1970‟s-present): “heuristics and
    biases” research program. 2002 Nobel Prize in
    Economics.
  – Slovic, Fischhoff, and Lichtenstein (1976): “It appears
    that people lack the correct programs for many important
    judgmental tasks.... it may be argued that we have not had
    the opportunity to evolve an intellect capable of dealing
    conceptually with uncertainty.”
  – Stephen Jay Gould (1992): “Our minds are not built (for
    whatever reason) to work by the rules of probability.”
The probability of breast cancer is 1% for a woman at 40 who
participates in a routine screening. If a woman has breast cancer,
the probability is 80% that she will have a positive mammography.
If a woman does not have breast cancer, the probability is 9.6%
that she will also have a positive mammography.

A woman in this age group had a positive mammography in a
routine screening. What is the probability that she actually has
breast cancer?
A. greater than 90%
B. between 70% and 90%           95 out of 100 doctors
C. between 50% and 70%
D. between 30% and 50%
E. between 10% and 30%                                “Base rate
F. less than 10%                 Correct answer        neglect”
        Availability biases in
        probability judgment
• How likely is that a randomly chosen word
  – ends in “g”?
  – ends in “ing”?


• When buying a car, how much do you
  weigh your friend‟s experience relative to
  consumer satisfaction surveys?
      Probabilistic inference in
         human cognition?
• “People aren‟t Bayesian”
  – Kahneman and Tversky (1970‟s-present): “heuristics
    and biases” research program. 2002 Nobel Prize in
    Economics.
• Psychology is often drawn towards the
  mind‟s errors and apparent irrationalities.
• But the computationally interesting question
  remains: How does mind work so well?
       Bayesian models of cognition
Visual perception [Weiss, Simoncelli, Adelson, Richards, Freeman, Feldman,
   Kersten, Knill, Maloney, Olshausen, Jacobs, Pouget, ...]
Language acquisition and processing [Brent, de Marken, Niyogi, Klein,
   Manning, Jurafsky, Keller, Levy, Hale, Johnson, Griffiths, Perfors, Tenenbaum, …]
Motor learning and motor control [Ghahramani, Jordan, Wolpert, Kording,
   Kawato, Doya, Todorov, Shadmehr, …]
Associative learning [Dayan, Daw, Kakade, Courville, Touretzky, Kruschke, …]
Memory [Anderson, Schooler, Shiffrin, Steyvers, Griffiths, McClelland, …]
Attention [Mozer, Huber, Torralba, Oliva, Geisler, Yu, Itti, Baldi, …]
Categorization and concept learning [Anderson, Nosfosky, Rehder, Navarro,
   Griffiths, Feldman, Tenenbaum, Rosseel, Goodman, Kemp, Mansinghka, …]
Reasoning [Chater, Oaksford, Sloman, McKenzie, Heit, Tenenbaum, Kemp, …]
Causal inference [Waldmann, Sloman, Steyvers, Griffiths, Tenenbaum, Yuille, …]
Decision making and theory of mind [Lee, Stankiewicz, Rao, Baker,
   Goodman, Tenenbaum, …]
Learning concepts from examples

• Word learning




 “horse”          “horse”   “horse”
Learning concepts from examples
             “tufa”
                             “tufa”

                             “tufa”
       Everyday inductive leaps
How can people learn so much about the
 world . . .
  –   Kinds of objects and their properties
  –   The meanings of words, phrases, and sentences
  –   Cause-effect relations
  –   The beliefs, goals and plans of other people
  –   Social structures, conventions, and rules

. . . from such limited evidence?
Contributions of Bayesian models
• Principled quantitative models of human behavior,
  with broad coverage and a minimum of free
  parameters and ad hoc assumptions.
• Explain how and why human learning and
  reasoning works, in terms of (approximations to)
  optimal statistical inference in natural
  environments.
• A framework for studying people‟s implicit
  knowledge about the structure of the world: how it
  is structured, used, and acquired.
• A two-way bridge to state-of-the-art AI and
  machine learning.
Marr‟s Three Levels of Analysis
• Computation:
   “What is the goal of the computation, why is it
    appropriate, and what is the logic of the
    strategy by which it can be carried out?”

• Algorithm:
    Cognitive psychology

• Implementation:
    Neurobiology
       What about those errors?
• The human mind is not a universal Bayesian engine.
• But, the mind does appear adapted to solve
  important real-world inference problems in
  approximately Bayesian ways, e.g.
   – Predicting everyday events
   – Causal learning and reasoning
   – Learning concepts from examples
• Like perceptual tasks, adults and even young
  children solve these problems mostly
  unconsciously, effortlessly, and successfully.
           Technical themes
• Inference in probabilistic models
  – Role of priors, explaining away.
• Learning in graphical models
  – Parameter learning, structure learning.
• Bayesian model averaging
  – Being Bayesian over network structures.
• Bayesian Occam‟s razor
  – Trade off model complexity against data fit.
           Technical themes
• Structured probabilistic models
  – Grammars, first-order logic, relational schemas.
• Hierarchical Bayesian models
  – Acquire abstract knowledge, supports transfer.
• Nonparametric Bayes
  – Flexible models that grow in complexity as new
    data warrant.
• Tractable approximate inference
  – Markov chain Monte Carlo (MCMC),
    Sequential Monte Carlo (particle filtering).
              Outline
• Predicting everyday events
• Causal learning and reasoning
• Learning concepts from examples
              Outline
• Predicting everyday events
• Causal learning and reasoning
• Learning concepts from examples
    Basics of Bayesian inference
                            P(d | h) P(h)
• Bayes‟ rule: P(h | d ) 
• An example
                            P(d | hi ) P(hi )
                                  hi H

   – Data: John is coughing
   – Some hypotheses:
      1. John has a cold
      2. John has lung cancer
      3. John has a stomach flu
   – Likelihood P(d|h) favors 1 and 2 over 3
   – Prior probability P(h) favors 1 and 3 over 2
   – Posterior probability P(h|d) favors 1 over 2 and 3
 Bayesian inference in perception and
      sensorimotor integration




(Weiss, Simoncelli & Adelson 2002)   (Kording & Wolpert 2004)
                      Memory retrieval as Bayesian inference
                                      (Anderson & Schooler, 1991)
                      Power law of        Additive effects                         Spacing effects
                      forgetting:         of practice & delay:                     in forgetting:
Log memory strength




                                                                 Mean # recalled
                          Log delay            Log delay                            Retention interval
                           (hours)             (seconds)                                 (days)
Memory retrieval as Bayesian inference
             (Anderson & Schooler, 1991)

For each item in
 memory, estimate
 the probability that it
 will be useful in the
 present context.

Use priors based on
 the statistics of
 natural information
 sources.
                Memory retrieval as Bayesian inference
                               (Anderson & Schooler, 1991)
                Power law of        Additive effects                       Spacing effects
                forgetting:         of practice & delay:                   in forgetting:
Log need odds




                                                           Log need odds
                 Log # days since      Log # days since                     Log # days since
                  last occurrence       last occurrence                      last occurrence

                [New York Times data; c.f. email sources, child-directed speech]
     Everyday prediction problems
             (Griffiths & Tenenbaum, 2006)

• You read about a movie that has made $60 million
  to date. How much money will it make in total?
• You see that something has been baking in the
  oven for 34 minutes. How long until it‟s ready?
• You meet someone who is 78 years old. How long
  will they live?
• Your friend quotes to you from line 17 of his
  favorite poem. How long is the poem?
• You see taxicab #107 pull up to the curb in front of
  the train station. How many cabs in this city?
          Making predictions
• You encounter a phenomenon that has
  existed for tpast units of time. How long will
  it continue into the future? (i.e. what‟s
  ttotal?)

• We could replace “time” with any other
  quantity that ranges from 0 to some
  unknown upper limit.
    Bayesian inference

P(ttotal|tpast)  P(tpast|ttotal) P(ttotal)
  posterior         likelihood     prior
 probability
    Bayesian inference

P(ttotal|tpast)  P(tpast|ttotal) P(ttotal)
  posterior           likelihood       prior
 probability

                     1/ttotal        1/ttotal
                      Assume            “Uninformative”
                       random            prior
                       sample
                   (0 < tpast < ttotal)       (e.g., Jeffreys,
                                               Jaynes)
   Bayesian inference

P(ttotal|tpast)  1/ttotal           1/ttotal
  posterior           Random         “Uninformative”
 probability          sampling        prior

   P(ttotal|tpast)


                                 ttotal
                     tpast
           Bayesian inference

       P(ttotal|tpast)  1/ttotal            1/ttotal
         posterior            Random         “Uninformative”
        probability           sampling        prior

           P(ttotal|tpast)


                                         ttotal
                             tpast
Best guess for ttotal: t such that P(ttotal > t|tpast) = 0.5:
             Bayesian inference

         P(ttotal|tpast)  1/ttotal            1/ttotal
           posterior            Random         “Uninformative”
          probability           sampling        prior

             P(ttotal|tpast)


                                           ttotal
                               tpast
Yields Gott‟s Rule: P(ttotal > t|tpast) = 0.5 when t = 2tpast
                    i.e., best guess for ttotal = 2tpast .
        Evaluating Gott‟s Rule
• You read about a movie that has made $78 million
  to date. How much money will it make in total?
   – “$156 million” seems reasonable.
• You meet someone who is 35 years old. How
  long will they live?
   – “70 years” seems reasonable.
• Not so simple:
   – You meet someone who is 78 years old. How long will
     they live?
   – You meet someone who is 6 years old. How long will
     they live?
             The effects of priors
• Different kinds of priors P(ttotal) are
  appropriate in different domains.




         e.g., wealth,            e.g., height,
               contacts                 lifespan
  [Gott: P(ttotal) ttotal-1 ]
The effects of priors
 Evaluating human predictions
• Different domains with different priors:
  –   A movie has made $60 million
  –   Your friend quotes from line 17 of a poem
  –   You meet a 78 year old man
  –   A move has been running for 55 minutes
  –   A U.S. congressman has served for 11 years
  –   A cake has been in the oven for 34 minutes
• Use 5 values of tpast for each.
• People predict ttotal .
You learn that in ancient
Egypt, there was a great
flood in the 11th year of
a pharaoh‟s reign. How
long did he reign?
You learn that in ancient
Egypt, there was a great
flood in the 11th year of
a pharaoh‟s reign. How
long did he reign?

How long did the typical
pharaoh reign in ancient
egypt?
          Summary: prediction
• Predictions about the extent or magnitude of
  everyday events follow Bayesian principles.
• Contrast with Bayesian inference in perception,
  motor control, memory: no “universal priors” here.
• Predictions depend rationally on priors that are
  appropriately calibrated for different domains.
   – Form of the prior (e.g., power-law or exponential)
   – Specific distribution given that form (parameters)
   – Non-parametric distribution when necessary.

• In the absence of concrete experience, priors may
  be generated by qualitative background knowledge.
              Outline
• Predicting everyday events
• Causal learning and reasoning
• Learning concepts from examples
                   Bayesian networks
Nodes: variables
                                         Four random variables:
Links: direct dependencies
                                           X1   coughing
Each node has a conditional
                                           X2   high body temperature
probability distribution
                                           X3   flu
                                           X4   lung cancer
Data: observations of X1, ..., X4


                P(x4) X4                   X3 P(x3)


                   P(x1|x3, x4)     X1             X2 P(x2|x3)
             Causal Bayesian networks
Nodes: variables
                                       Four random variables:
Links: causal mechanisms
                                         X1   coughing
Each node has a conditional
                                         X2   high body temperature
probability distribution
                                         X3   flu
                                         X4   lung cancer
Data: observations of and
interventions on X1, ..., X4

                 P(x4) X4                X3 P(x3)


                   P(x1|x3, x4)   X1             X2 P(x2|x3)
(Pearl; Glymour
& Cooper)
Inference in causal graphical models
• Explaining away or “discounting” in                 A       B

  social reasoning (Kelley; Morris & Larrick)
                                                          C

• “Screening off” in intuitive causal reasoning
  (Waldmann, Rehder & Burnett, Blok & Sloman, Gopnik & Sobel)
                                B
    A     B      C                         P(c|b) vs. P(c|b, a)
                                                      P(c|b, not a)
                           A        C
   – Better in chains than common-cause structures; common-cause
     better if mechanisms clearly independent
• Understanding and predicting the effects of
  interventions (Sloman & Lagnado; Gopnik & Schulz)
      Learning graphical models
• Structure learning: what causes what?
        P(x4) X4            X3 P(x3)


         P(x1|x3, x4) X1          X2 P(x2|x3)

• Parameter learning: how do causes work?
        P(x4) X4            X3 P(x3)


          P(x1|x3, x4) X1          X2 P(x2|x3)
     Bayesian learning of causal structure
    Data d                                     Causal hypotheses h
d1  X 1  1, X 2  1, X 3  1, X 4  1       X4            X3                  X4        X3
d 2  X 1  1, X 2  0, X 3  0, X 4  1
d 3  X 1  0, X 2  1, X 3  0, X 4  1           X1                X2              X1        X2
d 4  X 1  1, X 2  0, X 3  1, X 4  1

1. What is the most likely                              P(d | hi )P(hi )
network h given                            P(hi | d) 
observed data d ?                                       P(d | h j )P(h j )
                                                                 j
2. How likely
is there to be a
                            P( X 4  X 2 | d )     P( X
                                                   h j H
                                                                     4     X 2 | h j ) P(h j | d )
link X4     X2 ?                                        (Bayesian model averaging)
      Bayesian Occam‟s Razor

                                    M1

          p(D = d | M )
                                                     (MacKay, 2003;
                                                      Ghahramani
                                                      tutorials)
                                               M2

                          All possible data sets d

     For any model M,                    p(D  d | M )  1
                               all d D
Law of “conservation of belief”: A model that can predict many
possible data sets must assign each of them low probability.
 Learning causation from contingencies
                   C present C absent
                     (c+)      (c-)
                                        e.g., “Does injecting
E present   (e+)       a        c
                                        this chemical cause
                                        mice to express a
E absent (e-)          b        d       certain gene?”


   Subjects judge the extent C to which causes E
           (rate on a scale from 0 to 100)
  Two models of causal judgment

• Delta-P (Jenkins & Ward, 1965):
            P  P(e  | c  )  P(e  | c  )

• Power PC (Cheng, 1997):
                      P
        Power 
            p
                1  P (e  | c  )
Judging the probability that C                  E
         (Buehner & Cheng, 1997; 2003)
P        0.00          0.25       0.50    0.75 1.00


       People


        P


       Power


• Independent effects of both P and causal power.
• At P=0, judgments decrease with base rate.
  (“frequency illusion”)
          Learning causal strength
               (parameter learning)
Assume this causal structure:
                       B        C
                       w0       w1
                            E

P and causal power are maximum likelihood
 estimates of the strength parameter w1, under
 different parameterizations for P(E|B,C):
    linear  P, Noisy-OR  causal power
              Learning causal structure
                  (Griffiths & Tenenbaum, 2005)
• Hypotheses:           h1:     B        C        h0:   B          C
                                w0       w1             w0
                                     E                        E

                               P (d | h1 ) likelihood ratio
• Bayesian causal support: log             (Bayes factor)
                               P (d | h0 ) gives evidence
   P(d | h0 )   0 P(d | w0 ) p(w0 | h0 ) dw0
                  1                                     in favor of h1


                   
                   1    1
   P(d | h1 )              P(d | w0 ,w1) p(w0 ,w1 | h1) dw0 dw1
                  0     0

                        noisy-OR
(assume uniform parameter priors, but see Yuille et al., Danks et al.)
Buehner and Cheng (1997)

    People




  P (r = 0.89)




Power (r = 0.88)




Support (r = 0.97)
       Implicit background theory
• Injections may or may not cause gene expression,
  but gene expression does not cause injections.
  – No hypotheses with E    C
• Other naturally occurring processes may also
  cause gene expression.
  – All hypotheses include an always-present background
    cause B    C
• Causes are generative, probabilistically sufficient
  and independent, i.e. each cause independently
  produces the effect in some proportion of cases.
  – Noisy-OR parameterization
           Sensitivity analysis

    People




Support (Noisy-OR)




      2




Support (generic parameterization)
           Generativity is essential
P(e+|c+)      8/8      6/8      4/8      2/8      0/8
P(e+|c-)      8/8      6/8      4/8      2/8      0/8
   100                        Support
    50
     0


 • Predictions result from “ceiling effect”
    – ceiling effects only matter if you believe a cause
      increases the probability of an effect
                                           Blicket detector
                                                                  Both objects activate     Object A does not       Chi
                                                                       the detector        activate the detector   each
                                                                                                  by itself        Then
                                                                                                                   they
                                                                                                                   maket

                                 (Sobel, Gopnik, and colleagues)
 Procedure used in Sobel et al. (2002), Experiment 2
                                               Backward Blocking Condition
e Condition




                                                                  Both objects activate   Object Aactivates the     Chi

       See this? It‟s a                       Let‟s put this one                          Oooh, it‟s a
s activate          Object A does not          Children are asked if   the detector         detector by itself     each
tector             activate the detector      each is a blicket                                                    Then
                                                                                                                   they
                          by itself           Then
                                              they are asked to                                                    maket
       blicket machine.                       on the machine.
                                              makehe machine go
                                                  t                                       blicket!
       Blickets
Blocking Condition make it go.




s activate        Object Aactivates the        Children are asked if
tector              detector by itself        each is a blicket
                                              Then
                                              they are asked to
                                                                   Both objects activate                Object A does not                Children are a


                                “Backwards blocking”
                                                                        the detector                   activate the detector            each is a blicke
                                                                                                              by itself                 Then
                                                                                                                                        they are asked t
                                                                                                                                        makehe machin
                                                                                                                                            t
el et al. (2002), Experiment 2
                             Figure 13: Procedure used in Sobel et al. (2002), Experiment 2
                             (Sobel, Tenenbaum & Gopnik, 2004)
                                               Backward Blocking Condition
                             One-Cause Condition




t A does not                 A                B
                          Children are asked if objects activate
                                           Both                    AB Trial
                                                                   Both objects activate
                                                                               Object A does not
                                                                        the detector
                                                                                                         A Trial
                                                                                                      Object Aactivates the
                                                                                                                Children are asked if
                                                                                                        detector by itself
                                                                                                                                         Children are a
                                                                                                                                        each is a blicke
e the detector           each is a blicket      the detector                  activate the detector            each is a blicket
by itself                Then
                         they are asked to                                                                                              Then
                                                                                                                                        they are asked t
                                                                                     by itself                 Then
                                                                                                               they are asked to
                         makehe machine go
                             t                                                                                                          makehe machin
                                                                                                                                            t
                                                                                                               makehe machine go
                                                                                                                    t

             –   Initially: Nothing on detector – detector silent (A=0, B=0, E=0)
                                     Backward Blocking Condition
             –   Trial 1: A B on detector – detector active (A=1, B=1, E=1)
             –   Trial 2: A on detector – detector active (A=1, B=0, E=1)
             –   4-year-olds judge if each object is a blicket                         A B
Aactivates the
                  A: a blicketasked if     (100% say yes)
                          Children are Both objects activate                            ? ?
                                                               Object Aactivates the  Children are asked if
 tor by itself           each is a blicket
                  B: probablytonot detector
                         Then
                        they are asked
                        make
                                             the
                                                                  say itself
                                                 a blicket (34% detector byyes)      each is a blicket
                                                                                     Then E
                                                                                     they are asked to
                             the machine go                                                                     makehe machine go
                                                                                                                   t


      (cf. “explaining away in weight space”, Dayan & Kakade)
                    Possible hypotheses?
A       B   A       B   A       B   A       B   A       B   A       B   A       B   A       B


    E           E           E           E           E           E           E           E


A       B   A       B   A       B   A       B   A       B   A       B   A       B   A       B


    E           E           E           E           E           E           E           E


A       B   A       B   A       B   A       B   A       B   A       B   A       B   A       B


    E           E           E           E           E           E           E           E
          Bayesian causal learning
With a uniform prior on hypotheses,
generic parameterization:              Probability of
                                      being a blicket:
                                       A         B
                                      0.32      0.32




                                      0.34     0.34
          A stronger hypothesis space
• Links can only exist from blocks to detectors.
• Blocks are blickets with prior probability q.
• Blickets always activate detectors, detectors never activate on their
  own (i.e., deterministic OR parameterization, no hidden causes).

           P(h00) = (1 – q)2     P(h01) = (1 – q) q   P(h10) = q(1 – q)   P(h11) = q2

                     A       B     A       B            A       B          A       B


                         E             E                    E                  E
P(E=1 | A=0, B=0):       0             0                    0                  0
P(E=1 | A=1, B=0):       0             0                    1                  1
P(E=1 | A=0, B=1):       0             1                    0                  1
P(E=1 | A=1, B=1):       0             1                    1                  1
                        Manipulating prior probability
                                          (Tenenbaum, Sobel, Griffiths, & Gopnik)
                                                                             Figure 13: Procedure used in Sobel et al. (2002), Experiment 2
                                                                             One-Cause Condition




                                                                                   Both objects activate                        Object A does not                       Children are ask
                                                                                        the detector                           activate the detector                   each is a blicket
                                                                                                                                      by itself                        Then
                                                                                                                                                                       they are asked to
Figure 13: Procedure used in Sobel et al. (2002), Experiment 2                                                                                                         makehe machine
                                                                                                                                                                           t
ed in Sobel et al. (2002), Experiment 2
 et al. (2002), Experiment 2
One-Cause Condition                  Figure 13: Procedure used in Sobel et al. (2002), Experiment 2
                                              Backward Blocking Condition
                                    One-Cause Condition




     Both objects activate                         Object A does not                          Children are asked if
A does not the detector not
         Object A does                 Initial    activate theare asked if
                                                               detector
                                                    Children
                                  Children are asked if
                                                         byaitself
                                                                                   AB Trial  each is a blicket
                                                                                   Both objects activate
                                                                                             Then
                                                                                            they are asked to
                                                                                                                                 A Trial
                                                                                                                               Object Aactivates the                     Children are ask
         activate the detector
 he detector                     each is a blicket each is blicket Both objects activate the detector           Object A does not detector by itself                    eachif a blicket
                                                                                                                                                        Children are asked is
                                                                                            makehe machine go
                                                                                                 t                                                                      Then
y itself        by itself        Then              Then
                                 they are asked to they are asked to the detector                              activate the detector                   each is a blicketthey are asked to
                                 makehe machine go the machine go
                                     t             make                                                               by itself                                         makehe machine
                                                                                                                                                       they are asked to t
                                                                                                                                                       Then
Learning more complex structures
• Tenenbaum et al., Griffiths & Sobel: detectors with
  more than two objects and noisy mechanisms
• Steyvers et al., Sobel & Kushnir: active learning
  with interventions (c.f. Tong & Koller, Murphy)




• Lagnado & Sloman: learning
  from interventions on
  continuous dynamical
  systems
           Inferring hidden causes
Common unobserved cause


4x           2x           2x
Independent unobserved causes


1x           2x           2x           2x          2x
One observed cause
                                The “stick ball”
2x           4x                 machine
                        (Kushnir, Schulz, Gopnik, & Danks, 2003)
Bayesian learning with unknown
  number of hidden variables




                            (Griffiths
                            et al 2006)
 Common unobserved cause




Independent unobserved causes




     One observed cause



                                a = 0.3
                                w = 0.8

                                r = 0.94
               Inferring latent causes in
                 classical conditioning
            (Courville, Daw, Gordon, Touretzky 2003)
e.g.,
 A noise
 X tone
 B click
 US shock



Training:
 A US
 A X
 B US
Test:
 X
 X B
         Inferring latent causes in
            perceptual learning
         (Orban, Fiser, Aslin, Lengyel 2006)

Learning to recognize objects and segment scenes:
Inferring latent causes in
   sensory integration
 (Kording et al. 2006, NIPS 06)
               Coincidences
        (Griffiths & Tenenbaum, in press)

• The birthday problem
  – How many people do you need to have in the
    room before the probability exceeds 50% that
    two of them have the same birthday?    23.

• The bombing of London
How much of
a coincidence?
                                  P (d | latent)
Bayesian coincidence factor: log
                                 P (d | random)

 Chance:                Latent common cause:
                                       C

   x   x   x   x   x           x   x   x      x     x

                                           August

                       Alternative hypotheses:
                         proximity in date, matching
                         days of the month, matching
                         month, ....
How much of
a coincidence?
                                  P (d | latent)
Bayesian coincidence factor: log
                                 P (d | random)

 Chance:                   Latent common cause:
                                       C

     x   x   x   x   x         x   x   x   x    x

                                            uniform
                 uniform                       +
                                           regularity
 Summary: causal inference & learning
• Human causal induction can be explained using
  core principles of graphical models.
  – Bayesian inference (explaining away, screening off)
  – Bayesian structure learning (Occam‟s razor,
     model averaging)
  – Active learning with interventions
  – Identifying latent causes
 Summary: causal inference & learning
• Crucial constraints on hypothesis spaces come
  from abstract prior knowledge, or “intuitive
  theories”.
  – What are the variables?
  – How can they be connected?
  – How are their effects parameterized?

• Big open questions…
  – How can these theories be described formally?
  – How can these theories be learned?
    Hierarchical Bayesian framework
Abstract
Principles



Structure




  Data


(Griffiths, Tenenbaum, Kemp et al.)
A theory for blickets
(c.f. PRMs, BLOG, FOPL)
Learning with a
uniform prior on
network structures:




True network
                                     attributes (1-12)
Sample 75 observations…
                          patients   observed data
Learning a block-          0.0 0.8 0.01
                                                   z     1 2
                                                         3 4          5 6

structured prior on        0.0 0.0 0.75   h                          7 8

                                                               9 10
                           0.0 0.0 0.0
network structures:                                           11 12

(Mansinghka et al. 2006)




True network
                                          attributes (1-12)
Sample 75 observations…
                           patients       observed data
       The “blessing of abstraction”
True structure of                      1    2    3    4    5    6

graphical model G:
                              7   8    9   10   11   12   13   14   15   16



                    # of samples: 20                  80                  1000
Graph G
                                                                                 edge
                                                                                 (G)
Data D

Abstract theory Z                                                                edge
                                                                                 (G)
Graph G
                                                                                 class
Data D                                                                            (z)
              Outline
• Predicting everyday events
• Causal learning and reasoning
• Learning concepts from examples
                       “tufa”

                                                 “tufa”

                                                 “tufa”




Learning from just one or a few examples, and mostly
unlabeled examples (“semi-supervised learning”).
Simple model of concept learning
              “This is a blicket.”
                       “Can you show me the
                        other blickets?”
Simple model of concept learning
              “This is a blicket.”

                       Other blickets.
Simple model of concept learning
                           “This is a blicket.”

                                    Other blickets.




Learning from just one positive example is possible if:
   – Assume concepts refer to clusters in the world.
   – Observe enough unlabeled data to identify clear clusters.
   (c.f. Learning with mixture models and EM, Ghahramani &
      Jordan, 1994; Nigam et al. 2000)
  Concept learning with mixture
   models in cognitive science
• Fried & Holyoak (1984)
  – Modeled unsupervised and
    semi-supervised categorization
    as EM in a Gaussian mixture.


• Anderson (1990)
  – Modeled unsupervised and semi-supervised
    categorization as greedy sequential search in an
    infinite (Chinese restaurant process) mixture.
   Infinite (CRP) mixture models
• Construct from k-component mixtures by integrating
  out mixing weights, collapsing equivalent partitions,
  and taking the limit as k  .
• Does not require that we commit to a fixed – or even
  finite – number of classes.
• Effective number of classes can grow with number
  of data points, balancing complexity with data fit.
• Computationally much simpler than applying
  Bayesian Occam‟s razor or cross-validation.
• Easy to learn with standard Monte Carlo
  approximations (MCMC, particle filtering),
  hopefully avoiding local minima.
High school lunch room analogy
Sampling from
the CRP:




                                       “punks”

                          “preppies”




                “jocks”

  “nerds”
                             Assign to       Group with
Gibbs sampler (Neal):      larger groups   similar objects




                                                  “punks”

                            “preppies”




                 “jocks”

 “nerds”
    A typical cognitive experiment
                    F1   F2   F3   F4   Label

Training stimuli:   1    1    1    1      1
                    1    0    1    0      1
                    0    1    0    1      1
                    0    0    0    0      0
                    0    1    0    0      0
                    1    0    1    1      0


    Test stimuli:   0    1    1    1     ?
                    1    1    0    1     ?
                    1    1    1    0     ?
                    1    0    0    0     ?
                    0    0    1    0     ?
                    0    0    0    1     ?
Anderson (1990), “Rational
model of categorization”:
  Greedy sequential search
  in an infinite mixture
  model.


Sanborn, Griffiths, Navarro
(2006), “More rational
model of categorization”:
  Particle filter with a
  small # of particles
Towards more natural concepts
CrossCat: Discovering multiple structures
 that capture different subsets of features
 (Shafto, Kemp, Mansinghka, Gordon & Tenenbaum, 2006)
              Infinite relational models
     (Kemp, Tenenbaum, Griffiths, Yamada & Ueda, AAAI 06)

                                                         (c.f. Xu,
                                                         Tresp, et al.
                                                         SRL 06)
              concept




                                         predicate

                        concept

Biomedical predicate data from UMLS (McCrae et al.):
   – 134 concepts: enzyme, hormone, organ, disease, cell function ...
   – 49 predicates: affects(hormone, organ), complicates(enzyme, cell
     function), treats(drug, disease), diagnoses(procedure, disease) …
            Infinite relational models
(Kemp, Tenenbaum, Griffiths, Yamada & Ueda, AAAI 06)




e.g., Diseases affect   Chemicals interact   Chemicals cause
  Organisms              with Chemicals       Diseases
  Learning from very few examples
                                  “tufa”

                                                    “tufa”
• Word learning
                                                    “tufa”




• Property induction
 Cows have T9 hormones.          Cows have T9 hormones.
 Seals have T9 hormones.         Sheep have T9 hormones.
 Squirrels have T9 hormones.     Goats have T9 hormones.

 All mammals have T9 hormones.   All mammals have T9 hormones.
              The computational problem
                     (c.f., semi-supervised learning)
                                                                      ?

                Horse
                 Cow                                                              ?
               Chimp                                                              ?
               Gorilla                                                            ?
               Mouse                                                              ?
              Squirrel                                                            ?
              Dolphin                                                             ?
                 Seal                                                             ?
                Rhino
                                                                                  ?
             Elephant
                                        Features                              New property

 (85 features from Osherson et al., e.g., for Elephant: „gray‟, „hairless‟,
„toughskin‟, „big‟, „bulbous‟, „longleg‟, „tail‟, „chewteeth‟, „tusks‟,
„smelly‟, „walks‟, „slow‟, „strong‟, „muscle‟, „quadrapedal‟,…)
            Many sources of priors
Chimps have T9 hormones.                  Taxonomic
Gorillas have T9 hormones.                similarity


Poodles can bite through wire.            Jaw strength
Dobermans can bite through wire.



Salmon carry E. Spirus bacteria.
                                          Food web
Grizzly bears carry E. Spirus bacteria.   relations
Hierarchical Bayesian Framework
               (Kemp & Tenenbaum)
 P(form)

F: form                        Tree


                           mouse
    P(structure | form)
                           squirrel

S: structure                chimp

                            gorilla




                                               hormones
    P(data | structure)




                                               Has T9
                                      F1
                                      F2
                                      F3
                                      F4
                           mouse
D: data                   squirrel         …
                                                ?
                                                ?
                           chimp
                           gorilla              ?
 P(D|S): How the structure constrains the
           data of experience
• Define a stochastic process over structure S that
  generates hypotheses h.
   – For generic properties, prior should favor hypotheses that
     vary smoothly over structure.
   – Many properties of biological species were actually
     generated by such a process (i.e., mutation + selection).

   Smooth: P(h) high                      Not smooth: P(h) low
     P(D|S): How the structure constrains the
               data of experience
                   S
Gaussian Process
 (~ random walk,
    diffusion)
     ~1
    (S )
[Zhu, Ghahramani
& Lafferty 2003]   y

  Threshold



                   h
      Structure S




Data D             Species 1
                   Species 2
                   Species 3
                   Species 4
                   Species 5
                   Species 6
                   Species 7
                   Species 8
                   Species 9
                   Species 10
                                                   Features
 (85 features from Osherson et al., e.g., for Elephant: „gray‟, „hairless‟,
„toughskin‟, „big‟, „bulbous‟, „longleg‟, „tail‟, „chewteeth‟, „tusks‟,
„smelly‟, „walks‟, „slow‟, „strong‟, „muscle‟, „quadrapedal‟,…)
                    Structure S




Data D        Species 1
              Species 2                                                           ?
              Species 3                                                           ?
              Species 4                                                           ?
              Species 5                                                           ?
              Species 6                                                           ?
              Species 7                                                           ?
              Species 8                                                           ?
              Species 9
              Species 10                                                          ?
                                           Features                           New property
 (85 features from Osherson et al., e.g., for Elephant: „gray‟, „hairless‟,
„toughskin‟, „big‟, „bulbous‟, „longleg‟, „tail‟, „chewteeth‟, „tusks‟,
„smelly‟, „walks‟, „slow‟, „strong‟, „muscle‟, „quadrapedal‟,…)
       Cows have property P.
       Elephants have property P.

       Horses have property P.




Tree
2D
 Gorillas have property P.
 Mice have property P.
 Seals have property P.

 All mammals have property P.
       Reasoning about spatially
          varying properties
“Native American artifacts” task
Property type
 “has T9 hormones”           “can bite through wire”       “carry E. Spirus bacteria”

Theory Structure
  taxonomic tree                directed chain              directed network
+ diffusion process           + drift process              + noisy transmission
                                   Class D                                       Class D
             Class A
             Class B               Class A                 Class A
                                   Class F                                       Class E
             Class C
             Class D               Class C           Class C         Class B
             Class E                                                             Class G
                                   Class E
             Class F
                                   Class B
             Class G
                                                                                 Class F
Hypotheses                         Class G

 Class A
 Class B
 Class C
 Class D               ...                     ...                         ...
 Class E
 Class F
 Class G
                 Herring

                 Tuna
                 Mako shark
                 Sand shark

                 Dolphin
                 Human


                 Kelp



                        Sand shark

Kelp   Herring          Tuna         Mako shark   Human


                        Dolphin
Hierarchical Bayesian Framework

 F: form        Chain                  Tree      Space



                chimp             mouse       mouse
                                                      squirrel
                                 squirrel
 S: structure   gorilla

                squirrel              chimp                gorilla
                                                chimp
                mouse             gorilla




                                        F1
                                        F2
                                        F3
                                        F4
 D: data                    mouse
                           squirrel
                            chimp
                            gorilla
       Discovering structural forms




Ostrich Robin Crocodile Snake Turtle Bat   Orangutan
        Discovering structural forms
  “Great chain of being”




  Linnaeus




Ostrich Robin Crocodile Snake Turtle Bat   Orangutan
People can discover structural forms
• Scientists
  – Tree structure for living kinds (Linnaeus)
  – Periodic structure for chemical elements
    (Mendeleev)

• Children
  –   Hierarchical structure of category labels
  –   Clique structure of social groups
  –   Cyclical structure of seasons or days of the week
  –   Transitive structure for value
    Typical structure learning algorithms
      assume a fixed structural form

Flat Clusters             Line                     Circle
K-Means                   Guttman scaling          Circumplex models
Mixture models            Ideal point models
Competitive learning



Tree                      Grid                     Euclidean Space
Hierarchical clustering   Self-Organizing Map      MDS
Bayesian phylogenetics    Generative topographic   PCA
                          mapping                  Factor Analysis
  Hierarchical Bayesian Framework
F: form
    Favors simplicity
                              mouse

                             squirrel
S: structure                   chimp

                              gorilla
    Favors smoothness
    [Zhu et al., 2003]



                                    F1
                                    F2
                                    F3
                                    F4
D: data                   mouse
                         squirrel
                          chimp
                          gorilla
 Structural forms as graph grammars
Form     Process     Form     Process
Development of structural forms
  as more data are observed
Beyond “Nativism” versus “Empiricism”
 • “Nativism”: Explicit knowledge of structural
   forms for core domains is innate.
    – Atran (1998): The tendency to group living kinds into
      hierarchies reflects an “innately determined cognitive
      structure”.
    – Chomsky (1980): “The belief that various systems of mind are
      organized along quite different principles leads to the natural
      conclusion that these systems are intrinsically determined, not
      simply the result of common mechanisms of learning or
      growth.”
 • “Empiricism”: General-purpose learning systems
   without explicit knowledge of structural form.
    – Connectionist networks (e.g., Rogers and McClelland, 2004).
    – Traditional structure learning in probabilistic graphical models.
       Summary: concept learning
Models based on                     F: form
Bayesian inference over
hierarchies of structured                               mouse

                                                       squirrel
representations.                    S: structure        chimp
– How does abstract domain                              gorilla
  knowledge guide learning of




                                                              F1
                                                              F2
                                                              F3
                                                              F4
  new concepts?
                                                    mouse
– How can this knowledge be        D: data         squirrel
                                                    chimp
  represented, and how might it                     gorilla
  be learned?
– How can probabilistic inference work together with
  flexibly structured representations to model complex,
  real-world learning and reasoning?
Contributions of Bayesian models
• Principled quantitative models of human behavior,
  with broad coverage and a minimum of free
  parameters and ad hoc assumptions.
• Explain how and why human learning and
  reasoning works, in terms of (approximations to)
  optimal statistical inference in natural
  environments.
• A framework for studying people‟s implicit
  knowledge about the structure of the world: how it
  is structured, used, and acquired.
• A two-way bridge to state-of-the-art AI and
  machine learning.
                  Looking forward
• What we need to understand: the mind‟s ability to build
  rich models of the world from sparse data.
   –   Learning about objects, categories, and their properties.
   –   Causal inference
   –   Language comprehension and production
   –   Scene understanding
   –   Understanding other people‟s actions, plans, thoughts, goals
• What do we need to understand these abilities?
   –   Bayesian inference in probabilistic generative models
   –   Hierarchical models, with inference at all levels of abstraction
   –   Structured representations: graphs, grammars, logic
   –   Flexible representations, growing in response to observed data
             Learning word meanings
              Whole-object principle
Abstract      Shape bias
              Taxonomic principle
Principles    Contrast principle
              Basic-level bias




Structure




 Data

                                       (Tenenbaum & Xu)
         Causal learning and reasoning
Abstract
Principles



Structure




  Data


(Griffiths, Tenenbaum, Kemp et al.)
      “Universal Grammar”                       Hierarchical phrase structure
                                                grammars (e.g., CFG, HPSG, TAG)
               P(grammar | UG)
                                                       S  NP VP
       Grammar
                                                      NP  Det [ Adj] Noun [ RelClause]
                                                RelClause  [ Rel] NP V
                                                       VP  VP NP
              P(phrase structure | grammar)
                                                       VP  Verb



       Phrase structure

              P(utterance | phrase structure)



       Utterance
              P(speech | utterance)
       Speech signal
(c.f. Chater and Manning, 2006)
Vision as probabilistic parsing




                        (Han & Zhu, 2006; c.f.,
                        Zhu, Yuanhao
                        & Yuille NIPS 06 )
      Goal-directed action
(production and comprehension)




                      (Wolpert et al., 2003)
   Open directions and challenges
• Effective methods for learning structured knowledge
   – How to balance expressiveness/learnability tradeoff?
• More precise relation to psychological processes
   – To what extent do mental processes implement boundedly
     rational methods of approximate inference?
• Relation to neural computation
   – How to implement structured representations in brains?
• Modeling individual subjects and single trials
   – Is there a rational basis for probability matching?
• Understanding failure cases
   – Are these simply “not Bayesian”, or are people using a
     different model? How do we avoid circularity?
         Want to learn more?
• Special issue of Trends in Cognitive
  Sciences (TiCS), July 2006 (Vol. 10, no. 7),
  on “Probabilistic models of cognition”.
• Tom Griffiths‟ reading list, a/k/a
  http://bayesiancognition.com
• Summer school on probabilistic models of
  cognition, July 2007, Institute for Pure and
  Applied Mathematics (IPAM) at UCLA.
Extra slides
                  Bayesian prediction
         P(ttotal|tpast)     1/ttotal        P(tpast)
            posterior         Random        Domain-dependent
           probability        sampling       prior

  What is the best guess for ttotal?
  Compute t such that P(ttotal > t|tpast) = 0.5:

P(ttotal|tpast)                       We compared the median
                                      of the Bayesian posterior
                                      with the median of subjects‟
                             ttotal   judgments… but what about
                                      the distribution of subjects‟
                                      judgments?
  Sources of individual differences
• Individuals‟ judgments could by noisy.
• Individuals‟ judgments could be optimal,
  but with different priors.
  – e.g., each individual has seen only a sparse
    sample of the relevant population of events.
• Individuals‟ inferences about the posterior
  could be optimal, but their judgments could
  be based on probability (or utility) matching
  rather than maximizing.
                                                Individual differences in prediction
Proportion of judgments below predicted value




                                                                                              P(ttotal|tpast)




                                                                                                                ttotal




                                                Quantile of Bayesian posterior distribution
Individual differences in prediction

                       P(ttotal|tpast)




                                                    ttotal

                            Average over all
                            prediction tasks:
                            •   movie run times
                            •   movie grosses
                            •   poem lengths
                            •   life spans
                            •   terms in congress
                            •   cake baking times
Individual differences in concept learning
        Why probability matching?
• Optimal behavior under some
  (evolutionarily natural) circumstances.
  –   Optimal betting theory, portfolio theory
  –   Optimal foraging theory
  –   Competitive games
  –   Dynamic tasks (changing probabilities or utilities)

• Side-effect of algorithms for approximating
  complex Bayesian computations.
  – Markov chain Monte Carlo (MCMC): instead of integrating over
    complex hypothesis spaces, construct a sample of high-probability
    hypotheses.
  – Judgments from individual (independent) samples can on average
    be almost as good as using the full posterior distribution.
Markov chain Monte Carlo




       (Metropolis-Hastings algorithm)
    The puzzle of coincidences
Discoveries of hidden causal structure are
  often driven by noticing coincidences. . .
• Science
  – Halley‟s comet (1705)
(Halley, 1705)
(Halley, 1705)
    The puzzle of coincidences
Discoveries of hidden causal structure are
  often driven by noticing coincidences. . .
• Science
  – Halley‟s comet (1705)
  – John Snow and the cause of cholera (1854)
    Rational analysis of cognition
• Often can show that apparently irrational behavior
  is actually rational.




Which cards do you have to turn over to test this rule?
“If there is an A on one side, then there is a 2 on the other side”
   Rational analysis of cognition
• Often can show that apparently irrational behavior
  is actually rational.
• Oaksford & Chater‟s rational analysis:
  – Optimal data selection based
    on maximizing expected
    information gain.
  – Test the rule “If p, then q”
    against the null hypothesis
    that p and q are independent.
  – Assuming p and q are rare
    predicts people‟s choices:
  Integrating multiple forms of reasoning
        (Kemp, Shafto, Berke & Tenenbaum NIPS 06)

                2) Causal relations                 … Parameters of causal
                between features                    relations vary smoothly
                                                    over the category hierarchy.
1) Taxonomic
relations
between
categories



   T9 hormones cause elevated heart rates.
   Elevated heart rates cause faster metabolisms.
   Mice have T9 hormones.

   …?
Integrating multiple forms of reasoning
              Infinite relational models
     (Kemp, Tenenbaum, Griffiths, Yamada & Ueda, AAAI 06)

                                                         (c.f. Xu,
                                                         Tresp, et al.
                                                         SRL 06)
              concept




                                         predicate

                        concept

Biomedical predicate data from UMLS (McCrae et al.):
   – 134 concepts: enzyme, hormone, organ, disease, cell function ...
   – 49 predicates: affects(hormone, organ), complicates(enzyme, cell
     function), treats(drug, disease), diagnoses(procedure, disease) …
      Learning relational theories




e.g., Diseases affect   Chemicals interact   Chemicals cause
  Organisms              with Chemicals       Diseases
  Learning annotated hierarchies
       from relational data
(Roy, Kemp, Mansinghka, Tenenbaum NIPS 06)
  Learning abstract relational structures
 Dominance
  hierarchy             Tree             Cliques           Ring




Primate troop   Bush administration   Prison inmates     Kula islands
 “x beats y”        “x told y”          “x likes y”    “x trades with y”
Bayesian inference in
  neural networks




                    (Rao, in press)
 The big problem of intelligence
• The development of intuitive theories in
  childhood.
  – Psychology: How do we learn to understand
    others‟ actions in terms of beliefs, desires,
    plans, intentions, values, morals?
  – Biology: How do we learn that people, dogs,
    bees, worms, trees, flowers, grass, coral, moss
    are alive, but chairs, cars, tricycles, computers,
    the sun, Roomba, robots, clocks, rocks are not?
 The big problem of intelligence
• Common sense reasoning.
  Consider a man named Boris.
    – Is the mother of Boris‟s father his grandmother?
    – Is the mother of Boris‟s sister his mother?
    – Is the son of Boris‟s sister his son?
                 (Note: Boris and his family were
                 stranded on a desert island when he
                 was a young boy.)

								
To top