Docstoc

Bayesian models of inductive lea

Document Sample
Bayesian models of inductive lea Powered By Docstoc
					        Bayesian models of
        inductive learning

     Josh Tenenbaum & Tom Griffiths
                   MIT
 Computational Cognitive Science Group
Department of Brain and Cognitive Sciences
  Computer Science and AI Lab (CSAIL)
              What to expect
• What you’ll get out of this tutorial:
  – Our view of what Bayesian models have to offer
    cognitive science.
  – In-depth examples of basic and advanced models:
    how the math works & what it buys you.
  – Some comparison to other approaches.
  – Opportunities to ask questions.
• What you won’t get:
  – Detailed, hands-on how-to.
  – Where you can learn more:
          http://bayesiancognition.com
                     Outline
• Morning
  – Introduction (Josh)
  – Basic case study #1: Flipping coins (Tom)
  – Basic case study #2: Rules and similarity (Josh)
• Afternoon
  – Advanced case study #1: Causal induction (Tom)
  – Advanced case study #2: Property induction (Josh)
  – Quick tour of more advanced topics (Tom)
                     Outline
• Morning
  – Introduction (Josh)
  – Basic case study #1: Flipping coins (Tom)
  – Basic case study #2: Rules and similarity (Josh)
• Afternoon
  – Advanced case study #1: Causal induction (Tom)
  – Advanced case study #2: Property induction (Josh)
  – Quick tour of more advanced topics (Tom)
           Bayesian models in
            cognitive science
•   Vision
•   Motor control
•   Memory
•   Language
•   Inductive learning and reasoning….
     Everyday inductive leaps
• Learning concepts and words from examples

                          “horse”

                                “horse”



              “horse”
Learning concepts and words
            “tufa”

                                  “tufa”

                                  “tufa”




            Can you pick out the tufas?
              Inductive reasoning
Input:
 Cows can get Hick’s disease.
                                       (premises)
 Gorillas can get Hick’s disease.

 All mammals can get Hick’s disease.   (conclusion)


Task: Judge how likely conclusion is to be
      true, given that premises are true.
           Inferring causal relations
Input:
                Took vitamin B23       Headache
   Day 1        yes                    no
   Day 2        yes                    yes
   Day 3        no                     yes
   Day 4        yes                    no
    ...         ...                    ...
   Does vitamin B23 cause headaches?

Task: Judge probability of a causal link
      given several joint observations.
       Everyday inductive leaps
How can we learn so much about . . .
  –   Properties of natural kinds
  –   Meanings of words
  –   Future outcomes of a dynamic process
  –   Hidden causal properties of an object
  –   Causes of a person’s action (beliefs, goals)
  –   Causal laws governing a domain

. . . from such limited data?
              The Challenge
• How do we generalize successfully from very
  limited data?
  – Just one or a few examples
  – Often only positive examples
• Philosophy:
  – Induction is a “problem”, a “riddle”, a “paradox”,
    a “scandal”, or a “myth”.
• Machine learning and statistics:
  – Focus on generalization from many examples,
    both positive and negative.
Rational statistical inference
     (Bayes, Laplace)

Posterior         Likelihood      Prior
probability                       probability

                  p ( d | h) p ( h)
     p(h | d ) 
                  p(d | h) p(h)
               hH
                          Sum over space
                          of hypotheses
   Bayesian models of inductive
   learning: some recent history
• Shepard (1987)
  – Analysis of one-shot stimulus generalization, to explain
    the universal exponential law.
• Anderson (1990)
  – Models of categorization and causal induction.
• Oaksford & Chater (1994)
  – Model of conditional reasoning (Wason selection task).
• Heit (1998)
  – Framework for category-based inductive reasoning.
  Theory-Based Bayesian Models
• Rational statistical inference (Bayes):
                        p ( d | h) p ( h)
           p(h | d ) 
                        p(d | h) p(h)
                      hH

• Learners’ domain theories generate their
  hypothesis space H and prior p(h).
  – Well-matched to structure of the natural world.
  – Learnable from limited data.
  – Computationally tractable inference.
            What is a theory?
• Working definition
  – An ontology and a system of abstract principles
    that generates a hypothesis space of candidate
    world structures along with their relative
    probabilities.

• Analogy to grammar in language.
• Example: Newton’s laws
           Structure and statistics
• A framework for understanding how structured
  knowledge and statistical inference interact.
   – How structured knowledge guides statistical inference, and is
     itself acquired through higher-order statistical learning.


   – How simplicity trades off with fit to the data in evaluating
     structural hypotheses.


   – How increasingly complex structures may grow as required
     by new data, rather than being pre-specified in advance.
           Structure and statistics
• A framework for understanding how structured
  knowledge and statistical inference interact.
   – How structured knowledge guides statistical inference, and is
     itself acquired through higher-order statistical learning.
                                     Hierarchical Bayes.
   – How simplicity trades off with fit to the data in evaluating
     structural hypotheses.
                                     Bayesian Occam’s Razor.
   – How increasingly complex structures may grow as required
     by new data, rather than being pre-specified in advance.
                                    Non-parametric Bayes.
       Alternative approaches to
       inductive generalization
•   Associative learning
•   Connectionist networks
•   Similarity to examples
•   Toolkit of simple heuristics
•   Constraint satisfaction
•   Analogical mapping
Marr’s Three Levels of Analysis
• Computation:
   “What is the goal of the computation, why is it
    appropriate, and what is the logic of the
    strategy by which it can be carried out?”

• Representation and algorithm:
    Cognitive psychology

• Implementation:
    Neurobiology
                    Why Bayes?
• A framework for explaining cognition.
   – How people can learn so much from such limited data.
   – Why process-level models work the way that they do.
   – Strong quantitative models with minimal ad hoc assumptions.

• A framework for understanding how structured
  knowledge and statistical inference interact.
   – How structured knowledge guides statistical inference, and is
     itself acquired through higher-order statistical learning.
   – How simplicity trades off with fit to the data in evaluating
     structural hypotheses (Occam’s razor).
   – How increasingly complex structures may grow as required
     by new data, rather than being pre-specified in advance.
                     Outline
• Morning
  – Introduction (Josh)
  – Basic case study #1: Flipping coins (Tom)
  – Basic case study #2: Rules and similarity (Josh)
• Afternoon
  – Advanced case study #1: Causal induction (Tom)
  – Advanced case study #2: Property induction (Josh)
  – Quick tour of more advanced topics (Tom)
Coin flipping
         Coin flipping


         HHTHT
         HHHHH
What process produced these sequences?
               Bayes’ rule
For data D and a hypothesis H, we have:

                       P( H ) P( D | H )
           P( H | D) 
                            P( D)

• “Posterior probability”: P( H | D)
• “Prior probability”: P(H )
• “Likelihood”: P( D | H )
     The origin of Bayes’ rule
• A simple consequence of using probability
  to represent degrees of belief
• For any two random variables:
           p ( A & B)  p ( A) p ( B | A)
           p( A & B)  p( B) p( A | B)
       p( B) p( A | B)  p( A) p( B | A)
                         p( A) p( B | A)
             p( A | B) 
                             p( B)
 Why represent degrees of belief
      with probabilities?
• Good statistics
  – consistency, and worst-case error bounds.
• Cox Axioms
  – necessary to cohere with common sense
• “Dutch Book” + Survival of the Fittest
  – if your beliefs do not accord with the laws of
    probability, then you can always be out-gambled by
    someone whose beliefs do so accord.
• Provides a theory of learning
  – a common currency for combining prior knowledge and
    the lessons of experience.
               Bayes’ rule
For data D and a hypothesis H, we have:

                       P( H ) P( D | H )
           P( H | D) 
                            P( D)

• “Posterior probability”: P( H | D)
• “Prior probability”: P(H )
• “Likelihood”: P( D | H )
Hypotheses in Bayesian inference
• Hypotheses H refer to processes that could
  have generated the data D
• Bayesian inference provides a distribution
  over these hypotheses, given D
• P(D|H) is the probability of D being
  generated by the process identified by H
• Hypotheses H are mutually exclusive: only
  one process could have generated D
      Hypotheses in coin flipping
Describe processes by which D could be generated
                 D = HHTHT
  • Fair coin, P(H) = 0.5
  • Coin with P(H) = p       statistical
                              models
  • Markov model
  • Hidden Markov model
  • ...
      Hypotheses in coin flipping
Describe processes by which D could be generated
                 D = HHTHT
  • Fair coin, P(H) = 0.5
  • Coin with P(H) = p       generative
                              models
  • Markov model
  • Hidden Markov model
  • ...
  Representing generative models
• Graphical model notation
  – Pearl (1988), Jordan (1998)   d1     d2   d3    d4
• Variables are nodes, edges      Fair coin, P(H) = 0.5
  indicate dependency
• Directed edges show causal
                                  d1    d2    d3    d4
  process of data generation
                                       Markov model

        HHTHT
         d1 d2 d3 d4 d5
      Models with latent structure
                                             p
• Not all nodes in a graphical
  model need to be observed
• Some variables reflect latent   d1   d2        d3   d4
  structure, used in generating        P(H) = p
  D but unobserved
                                  s1    s2       s3   s4


        HHTHT                     d1   d2        d3   d4
         d1 d2 d3 d4 d5           Hidden Markov model
               Coin flipping
• Comparing two simple hypotheses
  – P(H) = 0.5 vs. P(H) = 1.0
• Comparing simple and complex hypotheses
  – P(H) = 0.5 vs. P(H) = p
• Comparing infinitely many hypotheses
  – P(H) = p
• Psychology: Representativeness
               Coin flipping
• Comparing two simple hypotheses
  – P(H) = 0.5 vs. P(H) = 1.0
• Comparing simple and complex hypotheses
  – P(H) = 0.5 vs. P(H) = p
• Comparing infinitely many hypotheses
  – P(H) = p
• Psychology: Representativeness
Comparing two simple hypotheses
• Contrast simple hypotheses:
  – H1: “fair coin”, P(H) = 0.5
  – H2:“always heads”, P(H) = 1.0
• Bayes’ rule:
                       P( H ) P( D | H )
           P( H | D) 
                            P( D)
• With two hypotheses, use odds form
           Bayes’ rule in odds form
           P(H1|D)        P(D|H1)          P(H1)
                     =            x
           P(H2|D)        P(D|H2)          P(H2)

D:           data
H1, H2:      models
P(H1|D):     posterior probability H1 generated the data
P(D|H1):     likelihood of data under model H1
P(H1):       prior probability H1 generated the data
         Coin flipping


         HHTHT
         HHHHH
What process produced these sequences?
 Comparing two simple hypotheses
        P(H1|D)         P(D|H1)         P(H1)
                   =            x
        P(H2|D)         P(D|H2)         P(H2)

D:         HHTHT
H1, H2:   “fair coin”, “always heads”
P(D|H1) = 1/25          P(H1) =     999/1000
P(D|H2) = 0             P(H2) =     1/1000

             P(H1|D) / P(H2|D) = infinity
 Comparing two simple hypotheses
        P(H1|D)         P(D|H1)          P(H1)
                   =            x
        P(H2|D)         P(D|H2)          P(H2)

D:         HHHHH
H1, H2:   “fair coin”, “always heads”
P(D|H1) = 1/25          P(H1) =     999/1000
P(D|H2) = 1             P(H2) =     1/1000

                P(H1|D) / P(H2|D)  30
 Comparing two simple hypotheses
        P(H1|D)         P(D|H1)         P(H1)
                   =            x
        P(H2|D)         P(D|H2)         P(H2)

D:         HHHHHHHHHH
H1, H2:   “fair coin”, “always heads”
P(D|H1) = 1/210         P(H1) =     999/1000
P(D|H2) = 1             P(H2) =     1/1000

                P(H1|D) / P(H2|D)  1
Comparing two simple hypotheses

• Bayes’ rule tells us how to combine prior
  beliefs with new data
  – top-down and bottom-up influences
• As a model of human inference
  – predicts conclusions drawn from data
  – identifies point at which prior beliefs are
    overwhelmed by new experiences
• But… more complex cases?
               Coin flipping
• Comparing two simple hypotheses
  – P(H) = 0.5 vs. P(H) = 1.0
• Comparing simple and complex hypotheses
  – P(H) = 0.5 vs. P(H) = p
• Comparing infinitely many hypotheses
  – P(H) = p
• Psychology: Representativeness
Comparing simple and complex hypotheses
                                           p


   d1   d2    d3     d4    vs.   d1   d2       d3   d4
   Fair coin, P(H) = 0.5              P(H) = p



 • Which provides a better account of the data:
   the simple hypothesis of a fair coin, or the
   complex hypothesis that P(H) = p?
Comparing simple and complex hypotheses

 • P(H) = p is more complex than P(H) = 0.5 in
   two ways:
    – P(H) = 0.5 is a special case of P(H) = p
    – for any observed sequence X, we can choose p
      such that X is more probable than if P(H) = 0.5
     Comparing simple and complex hypotheses
Probability
     Comparing simple and complex hypotheses
Probability




                HHHHH     p = 1.0
     Comparing simple and complex hypotheses
Probability




                HHTHT     p = 0.6
Comparing simple and complex hypotheses
 • P(H) = p is more complex than P(H) = 0.5 in
   two ways:
    – P(H) = 0.5 is a special case of P(H) = p
    – for any observed sequence X, we can choose p
      such that X is more probable than if P(H) = 0.5
 • How can we deal with this?
    – frequentist: hypothesis testing
    – information theorist: minimum description length
    – Bayesian: just use probability theory!
Comparing simple and complex hypotheses

       P(H1|D)      P(D|H1)      P(H1)
                  =            x
       P(H2|D)      P(D|H2)      P(H2)

Computing P(D|H1) is easy:
       P(D|H1) = 1/2N
Compute P(D|H2) by averaging over p:
     Comparing simple and complex hypotheses
Probability




              Distribution is an average over all values of p
     Comparing simple and complex hypotheses
Probability




              Distribution is an average over all values of p
Comparing simple and complex hypotheses

 • Simple and complex hypotheses can be
   compared directly using Bayes’ rule
    – requires summing over latent variables
 • Complex hypotheses are penalized for their
   greater flexibility: “Bayesian Occam’s razor”
 • This principle is used in model selection
   methods in psychology (e.g. Myung & Pitt, 1997)
               Coin flipping
• Comparing two simple hypotheses
  – P(H) = 0.5 vs. P(H) = 1.0
• Comparing simple and complex hypotheses
  – P(H) = 0.5 vs. P(H) = p
• Comparing infinitely many hypotheses
  – P(H) = p
• Psychology: Representativeness
Comparing infinitely many hypotheses
 • Assume data are generated from a model:
                            p


                 d1    d2       d3   d4
                       P(H) = p

 • What is the value of p?
   – each value of p is a hypothesis H
   – requires inference over infinitely many hypotheses
Comparing infinitely many hypotheses
 • Flip a coin 10 times and see 5 heads, 5 tails.
 • P(H) on next flip? 50%
 • Why? 50% = 5 / (5+5) = 5/10.
 • “Future will be like the past.”

 • Suppose we had seen 4 heads and 6 tails.
 • P(H) on next flip? Closer to 50% than to 40%.
 • Why? Prior knowledge.
Integrating prior knowledge and data
                          P( H ) P( D | H )
              P( H | D) 
                               P( D)

             P(p | D)  P(D | p) P(p)

• Posterior distribution P(p | D) is a probability
  density over p = P(H)
• Need to work out likelihood P(D | p) and
  specify prior distribution P(p)
           Likelihood and prior
• Likelihood:
           P(D | p) = pNH (1-p)NT
  – NH: number of heads
  – NT: number of tails
• Prior:
             P(p)  pFH-1 (1-p)FT-1
                          ?
A simple method of specifying priors

 • Imagine some fictitious trials, reflecting a
   set of previous experiences
    – strategy often used with neural networks
 • e.g., F ={1000 heads, 1000 tails} ~ strong
   expectation that any new coin will be fair

 • In fact, this is a sensible statistical idea...
           Likelihood and prior
• Likelihood:
           P(D | p) = pNH (1-p)NT
  – NH: number of heads
  – NT: number of tails
• Prior:
             P(p)  pFH-1 (1-p)FT-1 Beta(FH,FT)
  – FH: fictitious observations of heads
  – FT: fictitious observations of tails
             Conjugate priors
• Exist for many standard distributions
  – formula for exponential family conjugacy
• Define prior in terms of fictitious observations
• Beta is conjugate to Bernoulli (coin-flipping)

                                FH = FT = 1
                                FH = FT = 3
                                FH = FT = 1000
           Likelihood and prior
• Likelihood:
           P(D | p) = pNH (1-p)NT
  – NH: number of heads
  – NT: number of tails
• Prior:
             P(p)  pFH-1 (1-p)FT-1
  – FH: fictitious observations of heads
  – FT: fictitious observations of tails
Comparing infinitely many hypotheses

   P(p | D)  P(D | p) P(p) = pNH+FH-1 (1-p)NT+FT-1

  • Posterior is Beta(NH+FH,NT+FT)
    – same form as conjugate prior
  • Posterior mean:

  • Posterior predictive distribution:
            Some examples
• e.g., F ={1000 heads, 1000 tails} ~ strong
  expectation that any new coin will be fair
• After seeing 4 heads, 6 tails, P(H) on next
  flip = 1004 / (1004+1006) = 49.95%
• e.g., F ={3 heads, 3 tails} ~ weak
  expectation that any new coin will be fair
• After seeing 4 heads, 6 tails, P(H) on next
  flip = 7 / (7+9) = 43.75%
                      Prior knowledge too weak
     But… flipping thumbtacks
• e.g., F ={4 heads, 3 tails} ~ weak expectation
  that tacks are slightly biased towards heads
• After seeing 2 heads, 0 tails, P(H) on next flip
  = 6 / (6+3) = 67%

• Some prior knowledge is always necessary to
  avoid jumping to hasty conclusions...
• Suppose F = { }: After seeing 2 heads, 0 tails,
  P(H) on next flip = 2 / (2+0) = 100%
     Origin of prior knowledge
• Tempting answer: prior experience
• Suppose you have previously seen 2000
  coin flips: 1000 heads, 1000 tails

• By assuming all coins (and flips) are alike,
  these observations of other coins are as
  good as observations of the present coin
 Problems with simple empiricism
• Haven’t really seen 2000 coin flips, or any flips of a
  thumbtack
   – Prior knowledge is stronger than raw experience justifies
• Haven’t seen exactly equal number of heads and tails
   – Prior knowledge is smoother than raw experience justifies
• Should be a difference between observing 2000 flips
  of a single coin versus observing 10 flips each for 200
  coins, or 1 flip each for 2000 coins
   – Prior knowledge is more structured than raw experience
              A simple theory
• “Coins are manufactured by a standardized
  procedure that is effective but not perfect.”
  – Justifies generalizing from previous coins to the
    present coin.
  – Justifies smoother and stronger prior than raw
    experience alone.
  – Explains why seeing 10 flips each for 200 coins is
    more valuable than seeing 2000 flips of one coin.
• “Tacks are asymmetric, and manufactured to
  less exacting standards.”
               Limitations
• Can all domain knowledge be represented
  so simply, in terms of an equivalent number
  of fictional observations?
• Suppose you flip a coin 25 times and get all
  heads.         Something funny is going on…
• But with F ={1000 heads, 1000 tails}, P(H)
  on next flip = 1025 / (1025+1000) = 50.6%.
                    Looks like nothing unusual
                Hierarchical priors

• Higher-order hypothesis: is this
  coin fair or unfair?
                                                  fair
• Example probabilities:
   – P(fair) = 0.99
                                                   p
   – P(p|fair) is Beta(1000,1000)
   – P(p|unfair) is Beta(1,1)
                                        d1   d2          d3   d4
• 25 heads in a row propagates up,
  affecting p and then P(fair|D)
    P(fair|25 heads)   P(25 heads|fair) P(fair) = 9 x 10-5
                     =
   P(unfair|25 heads) P(25 heads|unfair) P(unfair)
                  More hierarchical priors

 • Latent structure can capture coin variability
                                                      p ~ Beta(FH,FT)
                                     FH,FT

Coin 1        p             Coin 2     p        ...             p Coin 200


d1       d2       d3   d4   d1   d2        d3   d4    d1   d2     d3    d4


 • 10 flips from 200 coins is better than 2000 flips
   from a single coin: allows estimation of FH, FT
          Yet more hierarchical priors
                        physical knowledge


                             FH,FT

          p                       p                       p


d1   d2       d3   d4   d1   d2       d3   d4   d1   d2       d3   d4

• Discrete beliefs (e.g. symmetry) can influence
  estimation of continuous properties (e.g. FH, FT)
Comparing infinitely many hypotheses
 • Apply Bayes’ rule to obtain posterior
   probability density
 • Requires prior over all hypotheses
   – computation simplified by conjugate priors
   – richer structure with hierarchical priors
 • Hierarchical priors indicate how simple
   theories can inform statistical inferences
   – one step towards structure and statistics
               Coin flipping
• Comparing two simple hypotheses
  – P(H) = 0.5 vs. P(H) = 1.0
• Comparing simple and complex hypotheses
  – P(H) = 0.5 vs. P(H) = p
• Comparing infinitely many hypotheses
  – P(H) = p
• Psychology: Representativeness
 Psychology: Representativeness
Which sequence is more likely from a fair coin?



     HHTHT                  more representative
                               of a fair coin
                        (Kahneman & Tversky, 1972)


     HHHHH
 What might representativeness mean?

    Evidence for a random generating process
       P(H1|D)       P(D|H1)              P(H1)
                   =                    x
       P(H2|D)       P(D|H2)              P(H2)
                     likelihood ratio

H1: random process (fair coin)
H2: alternative processes
   A constrained hypothesis space
Four hypotheses:

 h1       fair coin             HHTHTTTH
 h2       “always alternates”   HTHTHTHT
 h3       “mostly heads”        HHTHTHHH
 h4       “always heads”        HHHHHHHH
Representativeness judgments
                  Results
• Good account of representativeness data,
  with three pseudo-free parameters,  = 0.91
  – “always alternates” means 99% of the time
  – “mostly heads” means P(H) = 0.85
  – “always heads” means P(H) = 0.99


• With scaling parameter, r = 0.95
                           (Tenenbaum & Griffiths, 2001)
       The role of theories
The fact that HHTHT looks representative of
a fair coin and HHHHH does not reflects our
implicit theories of how the world works.
– Easy to imagine how a trick all-heads coin
  could work: high prior probability.
– Hard to imagine how a trick “HHTHT” coin
  could work: low prior probability.
                  Summary
• Three kinds of Bayesian inference
  – comparing two simple hypotheses
  – comparing simple and complex hypotheses
  – comparing an infinite number of hypotheses
• Critical notions:
  – generative models, graphical models
  – Bayesian Occam’s razor
  – priors: conjugate, hierarchical (theories)
                     Outline
• Morning
  – Introduction (Josh)
  – Basic case study #1: Flipping coins (Tom)
  – Basic case study #2: Rules and similarity (Josh)
• Afternoon
  – Advanced case study #1: Causal induction (Tom)
  – Advanced case study #2: Property induction (Josh)
  – Quick tour of more advanced topics (Tom)
Rules and similarity
          Structure versus statistics
Rules                               Statistics
Logic                               Similarity
Symbols                             Typicality
A better metaphor
A better metaphor
Structure and statistics

Statistics
Similarity
Typicality



       Rules
       Logic
       Symbols
       Structure and statistics
• Basic case study #1: Flipping coins
  – Learning and reasoning with structured
    statistical models.
• Basic case study #2: Rules and similarity
  – Statistical learning with structured
    representations.
        The number game




• Program input: number between 1 and 100
• Program output: “yes” or “no”
         The number game




• Learning task:
  – Observe one or more positive (“yes”) examples.
  – Judge whether other numbers are “yes” or “no”.
                The number game
Examples of      Generalization
“yes” numbers    judgments (N = 20)

60
                                      Diffuse similarity
                The number game
Examples of      Generalization
“yes” numbers    judgments (n = 20)

60
                                      Diffuse similarity

60 80 10 30                           Rule:
                                      “multiples of 10”
                The number game
Examples of      Generalization
“yes” numbers    judgments (N = 20)

60
                                      Diffuse similarity

60 80 10 30                           Rule:
                                      “multiples of 10”

60 52 57 55                           Focused similarity:
                                       numbers near 50-60
                The number game
Examples of      Generalization
“yes” numbers    judgments (N = 20)

16
                                      Diffuse similarity

16 8 2 64                             Rule:
                                      “powers of 2”

16 23 19 20                           Focused similarity:
                                       numbers near 20
              The number game
60
                                        Diffuse similarity

60 80 10 30                             Rule:
                                        “multiples of 10”

60 52 57 55                             Focused similarity:
                                         numbers near 50-60

     Main phenomena to explain:
       – Generalization can appear either similarity-
         based (graded) or rule-based (all-or-none).
       – Learning from just a few positive examples.
  Rule/similarity hybrid models
• Category learning
  – Nosofsky, Palmeri et al.: RULEX
  – Erickson & Kruschke: ATRIUM
     Divisions into “rule” and
     “similarity” subsystems
• Category learning
  – Nosofsky, Palmeri et al.: RULEX
  – Erickson & Kruschke: ATRIUM
• Language processing
  – Pinker, Marcus et al.: Past tense morphology
• Reasoning
  – Sloman
  – Rips
  – Nisbett, Smith et al.
  Rule/similarity hybrid models




• Why two modules?
• Why do these modules work the way that they do,
  and interact as they do?
• How do people infer a rule or similarity metric
  from just a few positive examples?
                      Bayesian model
• H: Hypothesis space of possible concepts:
   –   h1 = {2, 4, 6, 8, 10, 12, …, 96, 98, 100} (“even numbers”)
   –   h2 = {10, 20, 30, 40, …, 90, 100} (“multiples of 10”)
   –   h3 = {2, 4, 8, 16, 32, 64} (“powers of 2”)
   –   h4 = {50, 51, 52, …, 59, 60} (“numbers between 50 and 60”)
   –   ...


Representational interpretations for H:
   – Candidate rules
   – Features for similarity
   – “Consequential subsets” (Shepard, 1987)
        Inferring hypotheses from
            similarity judgment
Additive clustering (Shepard & Arabie, 1977):
                sij   wk fik f jk
                       k
    sij : similarity of stimuli i, j
    wk : weight of cluster k
    f ik : membership of stimulus i in cluster k
        (1 if stimulus i in cluster k, 0 otherwise)

Equivalent to similarity as a weighted sum of
 common features (Tversky, 1977).
Additive clustering for the integers 0-9:

    sij   wk fik f jk
          k

   Rank   Weight   Stimuli in cluster                     Interpretation
                   0 1 2 3 4 5 6 7 8 9
   1      .444            *       *               *       powers of two
   2      .345     * * *                                  small numbers
   3      .331                *           *           *   multiples of three
   4      .291                            * * * *         large numbers
   5      .255            * * * * *                       middle numbers
   6      .216        *       *       *       *       *   odd numbers
   7      .214        * * * *                             smallish numbers
   8      .172                    * * * * *               largish numbers
 Three hypothesis subspaces for
        number concepts
• Mathematical properties (24 hypotheses):
  – Odd, even, square, cube, prime numbers
  – Multiples of small integers
  – Powers of small integers
• Raw magnitude (5050 hypotheses):
  – All intervals of integers with endpoints between
    1 and 100.
• Approximate magnitude (10 hypotheses):
  – Decades (1-10, 10-20, 20-30, …)
  Hypothesis spaces and theories
• Why a hypothesis space is like a domain theory:
   – Represents one particular way of classifying entities in
     a domain.
   – Not just an arbitrary collection of hypotheses, but a
     principled system.
• What’s missing?
   – Explicit representation of the principles.
• Hypothesis spaces (and priors) are generated by
  theories. Some analogies:
   – Grammars generate languages (and priors over
     structural descriptions)
   – Hierarchical Bayesian modeling
                    Bayesian model
• H: Hypothesis space of possible concepts:
   – Mathematical properties: even, odd, square, prime, . . . .
   – Approximate magnitude: {1-10}, {10-20}, {20-30}, . . . .
   – Raw magnitude: all intervals between 1 and 100.

• X = {x1, . . . , xn}: n examples of a concept C.
• Evaluate hypotheses given data:
                            p ( X | h) p ( h)
                p(h | X ) 
                                 p( X )

   – p(h) [“prior”]: domain knowledge, pre-existing biases
   – p(X|h) [“likelihood”]: statistical information in examples.
   – p(h|X) [“posterior”]: degree of belief that h is the true extension of C.
                    Bayesian model
• H: Hypothesis space of possible concepts:
   – Mathematical properties: even, odd, square, prime, . . . .
   – Approximate magnitude: {1-10}, {10-20}, {20-30}, . . . .
   – Raw magnitude: all intervals between 1 and 100.

• X = {x1, . . . , xn}: n examples of a concept C.
• Evaluate hypotheses given data:
                             p ( X | h) p ( h)
                p(h | X ) 
                             p( X | h) p(h)
                               hH
   – p(h) [“prior”]: domain knowledge, pre-existing biases
   – p(X|h) [“likelihood”]: statistical information in examples.
   – p(h|X) [“posterior”]: degree of belief that h is the true extension of C.
Likelihood: p(X|h)
• Size principle: Smaller hypotheses receive greater
  likelihood, and exponentially more so as n increases.
                                        n
                              1 
                p ( X | h)            if x1 ,  , xn  h
                              size(h) 
                            0 if any xi  h


• Follows from assumption of randomly sampled examples.
• Captures the intuition of a representative sample.
Illustrating the size principle
   h1   2    4    6    8 10     h2
        12   14   16   18 20
        22   24   26   28 30
        32   34   36   38 40
        42   44   46   48 50
        52   54   56   58 60
        62   64   66   68 70
        72   74   76   78 80
        82   84   86   88 90
        92   94   96   98 100
Illustrating the size principle
   h1       2    4    6    8 10         h2
            12   14   16   18 20
            22   24   26   28 30
            32   34   36   38 40
            42   44   46   48 50
            52   54   56   58 60
            62   64   66   68 70
            72   74   76   78 80
            82   84   86   88 90
            92   94   96   98 100


   Data slightly more of a coincidence under h1
Illustrating the size principle
   h1       2    4    6    8 10       h2
            12   14   16   18 20
            22   24   26   28 30
            32   34   36   38 40
            42   44   46   48 50
            52   54   56   58 60
            62   64   66   68 70
            72   74   76   78 80
            82   84   86   88 90
            92   94   96   98 100


   Data much more of a coincidence under h1
                 Bayesian Occam’s Razor
                                                     Law of
                                       M1         “Conservation
                                                    of Belief”
 p(D = d | M )




                                                  M2


                       All possible data sets d

For any model M,                     p(D  d | M )  1
                           all d D
     Comparing simple and complex hypotheses
Probability




              Distribution is an average over all values of p
Prior: p(h)
• Choice of hypothesis space embodies a strong prior:
  effectively, p(h) ~ 0 for many logically possible but
  conceptually unnatural hypotheses.
• Prevents overfitting by highly specific but unnatural
  hypotheses, e.g. “multiples of 10 except 50 and 70”.
  Prior: p(h)
  • Choice of hypothesis space embodies a strong prior:
    effectively, p(h) ~ 0 for many logically possible but
    conceptually unnatural hypotheses.
  • Prevents overfitting by highly specific but unnatural
    hypotheses, e.g. “multiples of 10 except 50 and 70”.
  • p(h) encodes relative weights of alternative theories:
                    H: Total hypothesis space
    p(H1) = 1/5                                       p(H3) = 1/5
                            p(H2) = 3/5

H1: Math properties (24)   H2: Raw magnitude (5050)    H3: Approx. magnitude (10)
• even numbers             • 10-15                     • 10-20
• powers of two            • 20-32                     • 20-30
• multiples of three       • 37-54                     • 30-40
  …. p(h) = p(H1) / 24       …. p(h) = p(H2) / 5050      …. p(h) = p(H3) / 10
   A more complex approach to priors
• Start with a base set of regularities R and combination
  operators C.
• Hypothesis space = closure of R under C.
   – C = {and, or}: H = unions and intersections of regularities in R (e.g.,
     “multiples of 10 between 30 and 70”).
   – C = {and-not}: H = regularities in R with exceptions (e.g., “multiples
     of 10 except 50 and 70”).

• Two qualitatively similar priors:
   – Description length: number of combinations in C needed to generate
     hypothesis from R.
   – Bayesian Occam’s Razor, with model classes defined by number of
     combinations: more combinations    more hypotheses      lower prior
                          p ( X | h) p ( h)
Posterior:   p(h | X ) 
                          p( X | h) p(h)
                        hH


• X = {60, 80, 10, 30}
• Why prefer “multiples of 10” over “even
  numbers”? p(X|h).
• Why prefer “multiples of 10” over “multiples of
  10 except 50 and 20”? p(h).
• Why does a good generalization need both high
  prior and high likelihood? p(h|X) ~ p(X|h) p(h)
     Bayesian Occam’s Razor
Probabilities provide a common currency for
balancing model complexity with fit to the data.
   Generalizing to new objects
Given p(h|X), how do we compute p( y  C | X ) ,  p( y 
                                               
the probability that C applies to some new       hH
stimulus y?
     Generalizing to new objects
Hypothesis averaging:
 Compute the probability that C applies to some
 new object y by averaging the predictions of all
 hypotheses h, weighted by p(h|X):

      p( y  C | X )      p(h) p(h | X )
                            
                              y C |
                                   
                         hH       1 if yh
                                 
                                   0 if yh


                                      p(h | X )
                         h { y , X }
Examples:
 16
    Connection to feature-based
            similarity
• Additive clustering model of similarity:
                  sij   wk fik f jk
                         k

• Bayesian hypothesis averaging:
       p( y  C | X )   y  C ( h ) |X (hp|(X )
                        p(     p | X| p ) h )
                                         h
                       h{ y , X }
                        hH

• Equivalent if we identify features fk with
  hypotheses h, and weights wk with p(h | X ) .
Examples:
 16
 8
 2
 64
Examples:
 16
 23
 19
 20
                   Model fits
Examples of     Generalization       Bayesian Model
“yes” numbers   judgments (N = 20)   (r = 0.96)

60


60 80 10 30


60 52 57 55
                   Model fits
Examples of     Generalization       Bayesian Model
“yes” numbers   judgments (N = 20)   (r = 0.93)

16


16 8 2 64


16 23 19 20
Summary of the Bayesian model
• How do the statistics of the examples interact with
  prior knowledge to guide generalization?
                                
           posterior likelihood prior

• Why does generalization appear rule-based or
  similarity-based?
          hypothesisaveraging size principle


       broad p(h|X): similarity gradient
      narrow p(h|X): all-or-none rule
Summary of the Bayesian model
• How do the statistics of the examples interact with
  prior knowledge to guide generalization?
                                
           posterior likelihood prior

• Why does generalization appear rule-based or
  similarity-based?
          hypothesisaveraging size principle


                       Many h of similar size: broad p(h|X)
                        One h much smaller: narrow p(h|X)
         Alternative models
• Neural networks




        even   multiple   multiple   power
                of 10      of 3       of 2
60
80
10
30
             Alternative models
• Neural networks
• Hypothesis ranking and elimination
Hypothesis
ranking:      1        2          3        4      ….
             even   multiple   multiple   power   ….
                     of 10      of 3       of 2
60
80
10
30
              Alternative models
• Neural networks
• Hypothesis ranking and elimination
• Similarity to exemplars
                                              1
     – Average similarity: p( y  C | X )          sim( y, x j )
                                            | X | x X
                                                   j

60

60 80 10 30


60 52 57 55

                            Data                Model (r = 0.80)
              Alternative models
• Neural networks
• Hypothesis ranking and elimination
• Similarity to exemplars
     – Max similarity: p( y  C | X )  max sim( y, x j )
                                        x j X
60

60 80 10 30


60 52 57 55

                          Data              Model (r = 0.64)
           Alternative models
• Neural networks
• Hypothesis ranking and elimination
• Similarity to exemplars
  – Average similarity
  – Max similarity
  – Flexible similarity? Bayes.
               Alternative models
•    Neural networks
•    Hypothesis ranking and elimination
•    Similarity to exemplars
•    Toolbox of simple heuristics
     – 60: “general” similarity
     – 60 80 10 30: most specific rule (“subset principle”).
     – 60 52 57 55: similarity in magnitude
    Why these heuristics? When to use which heuristic?
     Bayes.
                       Summary
• Generalization from limited data possible via the
  interaction of structured knowledge and statistics.
   – Structured knowledge: space of candidate rules, theories
     generate hypothesis space (c.f. hierarchical priors)
   – Statistics: Bayesian Occam’s razor.
• Better understand the interactions between
  traditionally opposing concepts:
   – Rules and statistics   – Rules and representativeness
   – Rules and similarity
• Explains why central but notoriously slippery
  processing-level concepts work the way they do.
   – Similarity
   – Representativeness
                    Why Bayes?
• A framework for explaining cognition.
   – How people can learn so much from such limited data.
   – Why process-level models work the way that they do.
   – Strong quantitative models with minimal ad hoc assumptions.

• A framework for understanding how structured
  knowledge and statistical inference interact.
   – How structured knowledge guides statistical inference, and is
     itself acquired through higher-order statistical learning.
   – How simplicity trades off with fit to the data in evaluating
     structural hypotheses (Occam’s razor).
   – How increasingly complex structures may grow as required
     by new data, rather than being pre-specified in advance.
  Theory-Based Bayesian Models
• Rational statistical inference (Bayes):
                        p ( d | h) p ( h)
           p(h | d ) 
                        p(d | h) p(h)
                      hH

• Learners’ domain theories generate their
  hypothesis space H and prior p(h).
  – Well-matched to structure of the natural world.
  – Learnable from limited data.
  – Computationally tractable inference.
  Looking towards the afternoon
• How do we apply these ideas to more
  natural and complex aspects of cognition?
• Where do the hypothesis spaces come
  from?
• Can we formalize the contributions of
  domain theories?
                     Outline
• Morning
  – Introduction (Josh)
  – Basic case study #1: Flipping coins (Tom)
  – Basic case study #2: Rules and similarity (Josh)
• Afternoon
  – Advanced case study #1: Causal induction (Tom)
  – Advanced case study #2: Property induction (Josh)
  – Quick tour of more advanced topics (Tom)
                     Outline
• Morning
  – Introduction (Josh)
  – Basic case study #1: Flipping coins (Tom)
  – Basic case study #2: Rules and similarity (Josh)
• Afternoon
  – Advanced case study #1: Causal induction (Tom)
  – Advanced case study #2: Property induction (Josh)
  – Quick tour of more advanced topics (Tom)
Marr’s Three Levels of Analysis
• Computation:
   “What is the goal of the computation, why is it
    appropriate, and what is the logic of the
    strategy by which it can be carried out?”

• Representation and algorithm:
    Cognitive psychology

• Implementation:
    Neurobiology
Working at the computational level
                statistical
• What is the computational problem?
  – input: data
  – output: solution
Working at the computational level
                statistical
• What is the computational problem?
  – input: data
  – output: solution
• What knowledge is available to the learner?

• Where does that knowledge come from?
  Theory-Based Bayesian Models
• Rational statistical inference (Bayes):
                        p ( d | h) p ( h)
           p(h | d ) 
                        p(d | h) p(h)
                      hH

• Learners’ domain theories generate their
  hypothesis space H and prior p(h).
  – Well-matched to structure of the natural world.
  – Learnable from limited data.
  – Computationally tractable inference.
Causality
      Bayes nets and beyond...
• Increasingly popular approach to studying
  human causal inferences
           (e.g. Glymour, 2001; Gopnik et al., 2004)
• Three reactions:
  – Bayes nets are the solution!
  – Bayes nets are missing the point, not sure why…
  – what is a Bayes net?
     Bayes nets and beyond...
• What are Bayes nets?
  – graphical models
  – causal graphical models
• An example: elemental causal induction
• Beyond Bayes nets…
  – other knowledge in causal induction
  – formalizing causal theories
     Bayes nets and beyond...
• What are Bayes nets?
  – graphical models
  – causal graphical models
• An example: elemental causal induction
• Beyond Bayes nets…
  – other knowledge in causal induction
  – formalizing causal theories
            Graphical models
• Express the probabilistic dependency
  structure among a set of variables (Pearl, 1988)
• Consist of
   – a set of nodes, corresponding to variables
   – a set of edges, indicating dependency
   – a set of functions defined on the graph that
     defines a probability distribution
   Undirected graphical models
                                     X3               X4
                     X1
• Consist of
  – a set of nodes             X2            X5
  – a set of edges
  – a potential for each clique, multiplied together to
    yield the distribution over variables
• Examples
  – statistical physics: Ising model, spinglasses
  – early neural networks (e.g. Boltzmann machines)
     Directed graphical models
                                    X3           X4
                    X1
• Consist of
  – a set of nodes              X2           X5
  – a set of edges
  – a conditional probability distribution for each
    node, conditioned on its parents, multiplied
    together to yield the distribution over variables
• Constrained to directed acyclic graphs (DAG)
• AKA: Bayesian networks, Bayes nets
    Bayesian networks and Bayes
• Two different problems
  – Bayesian statistics is a method of inference
  – Bayesian networks are a form of representation
• There is no necessary connection
  – many users of Bayesian networks rely upon
    frequentist statistical methods (e.g. Glymour)
  – many Bayesian inferences cannot be easily
    represented using Bayesian networks
 Properties of Bayesian networks
• Efficient representation and inference
  – exploiting dependency structure makes it easier
    to represent and compute with probabilities


• Explaining away
  – pattern of probabilistic reasoning characteristic of
    Bayesian networks, especially early use in AI
Efficient representation and inference
 • Three binary variables: Cavity, Toothache, Catch
Efficient representation and inference
 • Three binary variables: Cavity, Toothache, Catch

 • Specifying P(Cavity, Toothache, Catch) requires 7
   parameters (1 for each set of values, minus 1
   because it’s a probability distribution)

 • With n variables, we need 2n -1 parameters
 • Here n=3. Realistically, many more: X-ray, diet,
   oral hygiene, personality, . . . .
      Conditional independence
• All three variables are dependent, but Toothache
  and Catch are independent given the presence or
  absence of Cavity
• In probabilistic terms:
     P(ache  catch | cav)  P(ache | cav) P(catch | cav)
    P(ache  catch | cav)  P(ache | cav) P(catch | cav)
                            1  P(ache | cav)P(catch | cav)
• With n evidence variables, x1, …, xn, we need 2 n
  conditional probabilities: P( xi | cav), P( xi | cav)
        A simple Bayesian network
• Graphical representation of relations between a set of
  random variables:
                               Cavity

                     Toothache          Catch

• Probabilistic interpretation: factorizing complex terms
            P( A, B, C )              P(V | parents[V ])
                             V { A, B,C}

    P( Ache, Catch, Cav)  P( Ache, Catch | Cav) P(Cav)
                         P( Ache | Cav) P(Catch | Cav) P(Cav)
                A more complex system
                      Battery

             Radio              Ignition            Gas

                                           Starts

                                     On time to work

     • Joint distribution sufficient for any inference:
   P( B, R, I , G, S , O)  P( B) P( R | B) P( I | B) P(G) P(S | I , G) P(O | S )

                             P( B) P( R | B) P( I | B) P(G) P(S | I , G) P(O | S )
            P(O, G ) B, R, I , S
P(O | G )          
             P(G )                                   P(G )
                A more complex system
                      Battery

             Radio              Ignition            Gas

                                           Starts

                                     On time to work

      • Joint distribution sufficient for any inference:
   P( B, R, I , G, S , O)  P( B) P( R | B) P( I | B) P(G) P(S | I , G) P(O | S )

                                                           
                         P( B) P( I | B) P( S | I , G )  P(O | S )
            P(O, G )
P(O | G ) 
             P(G )                                         
                       S  B, I                             
             A more complex system
                   Battery

          Radio              Ignition            Gas

                                        Starts

                                  On time to work

  • Joint distribution sufficient for any inference:
P( B, R, I , G, S , O)  P( B) P( R | B) P( I | B) P(G) P(S | I , G) P(O | S )
  • General inference algorithm: local message passing
    (belief propagation; Pearl, 1988)
       – efficiency depends on sparseness of graph structure
              Explaining away
                  Rain               Sprinkler

                         Grass Wet

          P( R, S ,W )  P( R) P(S ) P(W | S , R)

• Assume grass will be wet if and only if it rained last
  night, or if the sprinklers were left on:
        P(W  w | S , R)  1 if S  s or R  r
                          0 if R  r and S  s.
                   Explaining away
                       Rain               Sprinkler

                              Grass Wet

               P( R, S ,W )  P( R) P(S ) P(W | S , R)

            P(W  w | S , R)  1 if S  s or R  r
                              0 if R  r and S  s.

Compute probability it                         P( w | r ) P(r )
                                 P (r | w) 
rained last night, given                           P ( w)
that the grass is wet:
                   Explaining away
                       Rain                Sprinkler

                              Grass Wet

               P( R, S ,W )  P( R) P(S ) P(W | S , R)

            P(W  w | S , R)  1 if S  s or R  r
                              0 if R  r and S  s.

Compute probability it                        P( w | r ) P(r )
                              P(r | w) 
rained last night, given
that the grass is wet:
                                            P(w | r, s) P(r, s)
                                           r , s 
                   Explaining away
                       Rain                Sprinkler

                              Grass Wet

               P( R, S ,W )  P( R) P(S ) P(W | S , R)

            P(W  w | S , R)  1 if S  s or R  r
                              0 if R  r and S  s.

Compute probability it                                   P(r )
                              P(r | w) 
rained last night, given                   P(r , s )  P(r , s )  P(r , s)
that the grass is wet:
                   Explaining away
                       Rain                Sprinkler

                              Grass Wet

               P( R, S ,W )  P( R) P(S ) P(W | S , R)

            P(W  w | S , R)  1 if S  s or R  r
                              0 if R  r and S  s.

Compute probability it                           P(r )
                              P(r | w) 
rained last night, given                   P(r )  P(r , s)
that the grass is wet:
                   Explaining away
                       Rain                Sprinkler

                              Grass Wet

               P( R, S ,W )  P( R) P(S ) P(W | S , R)

            P(W  w | S , R)  1 if S  s or R  r
                              0 if R  r and S  s.

Compute probability it                               P(r )
                              P(r | w)                                  P(r )
rained last night, given                   P ( r )  P ( r ) P ( s )
that the grass is wet:
                                         Between 1 and P(s)
                   Explaining away
                       Rain                  Sprinkler

                              Grass Wet

               P( R, S ,W )  P( R) P(S ) P(W | S , R)

            P(W  w | S , R)  1 if S  s or R  r
                              0 if R  r and S  s.

Compute probability it                       P( w | r , s) P(r | s)
                            P(r | w, s ) 
rained last night, given                          P( w | s)
that the grass is wet and
sprinklers were left on:                       Both terms = 1
                   Explaining away
                       Rain               Sprinkler

                              Grass Wet

               P( R, S ,W )  P( R) P(S ) P(W | S , R)

            P(W  w | S , R)  1 if S  s or R  r
                              0 if R  r and S  s.

Compute probability it
rained last night, given    P(r | w, s)  P(r | s)  P(r )
that the grass is wet and
sprinklers were left on:
        Explaining away
             Rain                  Sprinkler

                     Grass Wet

   P( R, S ,W )  P( R) P(S ) P(W | S , R)

P(W  w | S , R)  1 if S  s or R  r
                  0 if R  r and S  s.

                        P(r )
  P(r | w)                                 P(r )
              P ( r )  P ( r ) P ( s )
                                               “Discounting” to
P(r | w, s)  P(r | s)  P(r )                 prior probability.
  Contrast w/ production system
                Rain               Sprinkler

                       Grass Wet

• Formulate IF-THEN rules:
   – IF Rain THEN Wet
   – IF Wet THEN Rain    IF Wet AND NOT Sprinkler
                                           THEN Rain
• Rules do not distinguish directions of inference
• Requires combinatorial explosion of rules
 Contrast w/ spreading activation
               Rain               Sprinkler

                      Grass Wet

• Excitatory links: Rain    Wet, Sprinkler     Wet
• Observing rain, Wet becomes more active.
• Observing grass wet, Rain and Sprinkler become
  more active.
• Observing grass wet and sprinkler, Rain cannot
  become less active. No explaining away!
 Contrast w/ spreading activation
                Rain               Sprinkler

                       Grass Wet

• Excitatory links: Rain       Wet, Sprinkler   Wet
• Inhibitory link: Rain      Sprinkler
• Observing grass wet, Rain and Sprinkler become
  more active.
• Observing grass wet and sprinkler, Rain becomes
  less active: explaining away.
 Contrast w/ spreading activation
            Rain                       Burst pipe
                       Sprinkler

                      Grass Wet

• Each new variable requires more inhibitory
  connections.
• Interactions between variables are not causal.
• Not modular.
   – Whether a connection exists depends on what other
     connections exist, in non-transparent ways.
   – Big holism problem.
   – Combinatorial explosion.
           Graphical models
• Capture dependency structure in distributions
• Provide an efficient means of representing
  and reasoning with probabilities
• Allow kinds of inference that are problematic
  for other representations: explaining away
  – hard to capture in a production system
  – hard to capture with spreading activation
     Bayes nets and beyond...
• What are Bayes nets?
  – graphical models
  – causal graphical models
• An example: causal induction
• Beyond Bayes nets…
  – other knowledge in causal induction
  – formalizing causal theories
       Causal graphical models
• Graphical models represent statistical
  dependencies among variables (ie. correlations)
  – can answer questions about observations
• Causal graphical models represent causal
  dependencies among variables
  – express underlying causal structure
  – can answer questions about both observations and
    interventions (actions upon a variable)
    Observation and intervention
              Battery

      Radio             Ignition            Gas

                                   Starts

                             On time to work

Graphical model:                   P(Radio|Ignition)
Causal graphical model:            P(Radio|do(Ignition))
    Observation and intervention
              Battery

      Radio             Ignition            Gas

                                   Starts

                             On time to work

Graphical model:                   P(Radio|Ignition)
Causal graphical model:            P(Radio|do(Ignition))


         “graph surgery” produces “mutilated graph”
        Assessing interventions
• To compute P(Y|do(X=x)), delete all edges
  coming into X and reason with the resulting
  Bayesian network (“do calculus”; Pearl, 2000)

• Allows a single structure to make predictions
  about both observations and interventions
    Causality simplifies inference
• Using a representation in which the direction of
  causality is correct produces sparser graphs
• Suppose we get the direction of causality wrong,
  thinking that “symptoms” causes “diseases”:

                Ache            Catch

                       Cavity
• Does not capture the correlation between symptoms:
  falsely believe P(Ache, Catch) = P(Ache) P(Catch).
    Causality simplifies inference
• Using a representation in which the direction of
  causality is correct produces sparser graphs
• Suppose we get the direction of causality wrong,
  thinking that “symptoms” causes “diseases”:

                  Ache              Catch

                           Cavity
• Inserting a new arrow allows us to capture this
  correlation.
• This model is too complex: do not believe that
    P( Ache, Catch | Cav)  P( Ache | Cav) P(Catch | Cav)
    Causality simplifies inference
• Using a representation in which the direction of
  causality is correct produces sparser graphs
• Suppose we get the direction of causality wrong,
  thinking that “symptoms” causes “diseases”:

           Ache                     X-ray
                      Catch

                      Cavity

• New symptoms require a combinatorial proliferation
  of new arrows. This reduces efficiency of inference.
 Learning causal graphical models


          B       C           B       C


              E                   E




• Strength: how strong is a relationship?
• Structure: does a relationship exist?
Causal structure vs. causal strength


          B       C           B       C


              E                   E




• Strength: how strong is a relationship?
Causal structure vs. causal strength


           B        C             B         C
           w0       w1             w0
                E                       E




• Strength: how strong is a relationship?
  – requires defining nature of relationship
             Parameterization
• Structures: h1 =      B           C     h0 =        B       C


                                E                         E




• Parameterization:      Generic
 C B      h1: P(E = 1 | C, B)           h0: P(E = 1| C, B)
 0   0           p00                             p0
 1   0           p10                             p0
 0   1           p01                             p1
 1   1           p11                             p1
             Parameterization
• Structures: h1 =      B           C      h0 =        B        C
                        w0          w1                 w0
                                E                           E
                         w0, w1: strength parameters for B, C


• Parameterization:      Linear
 C B      h1: P(E = 1 | C, B)            h0: P(E = 1| C, B)
 0   0           0                                0
 1   0           w1                               0
 0   1           w0                               w0
 1   1         w1+ w0                             w0
             Parameterization
• Structures: h1 =      B           C      h0 =        B        C
                        w0          w1                 w0
                                E                           E
                         w0, w1: strength parameters for B, C


• Parameterization:      “Noisy-OR”
 C B      h1: P(E = 1 | C, B)            h0: P(E = 1| C, B)
 0   0            0                               0
 1   0           w1                               0
 0   1           w0                               w0
 1   1      w1+ w0 – w1 w0                        w0
        Parameter estimation

• Maximum likelihood estimation:

       maximize i P(bi,ci,ei; w0, w1)

• Bayesian methods: as in the “Comparing
  infinitely many hypotheses” example…
Causal structure vs. causal strength


          B       C           B       C


              E                   E




• Structure: does a relationship exist?
    Approaches to structure learning
• Constraint-based                               B         C
  – dependency from statistical tests (eg. 2)
  – deduce structure from dependencies                E
                                (Pearl, 2000; Spirtes et al., 1993)
    Approaches to structure learning
• Constraint-based:                              B         C
  – dependency from statistical tests (eg. 2)
  – deduce structure from dependencies                E
                                (Pearl, 2000; Spirtes et al., 1993)
    Approaches to structure learning
• Constraint-based:                              B         C
  – dependency from statistical tests (eg. 2)
  – deduce structure from dependencies                E
                                (Pearl, 2000; Spirtes et al., 1993)
    Approaches to structure learning
• Constraint-based:                              B         C
  – dependency from statistical tests (eg. 2)
  – deduce structure from dependencies                E
                                (Pearl, 2000; Spirtes et al., 1993)


Attempts to reduce inductive problem to deductive problem
     Approaches to structure learning
• Constraint-based:                                  B          C
   – dependency from statistical tests (eg. 2)
   – deduce structure from dependencies                    E
                                     (Pearl, 2000; Spirtes et al., 1993)

• Bayesian:                          B         C       B         C
   – compute posterior
     probability of structures,
                                          E                 E
     given observed data
                                      P(S1|data)         P(S0|data)
P(S|data)  P(data|S) P(S)        (Heckerman, 1998; Friedman, 1999)
      Causal graphical models
• Extend graphical models to deal with
  interventions as well as observations
• Respecting the direction of causality results
  in efficient representation and inference
• Two steps in learning causal models
  – parameter estimation
  – structure learning
     Bayes nets and beyond...
• What are Bayes nets?
  – graphical models
  – causal graphical models
• An example: elemental causal induction
• Beyond Bayes nets…
  – other knowledge in causal induction
  – formalizing causal theories
Elemental causal induction

            C present   C absent


E present       a          c

E absent        b          d



     “To what extent does C cause E?”
Causal structure vs. causal strength


          B        C          B        C
          w0       w1         w0
               E                   E




• Strength: how strong is a relationship?
• Structure: does a relationship exist?
                 Causal strength
• Assume structure:        B        C
                           w0       w1
                                E

• Leading models (DP and causal power) are maximum
  likelihood estimates of the strength parameter w1, under
  different parameterizations for P(E|B,C):
   – linear  DP, Noisy-OR  causal power
               Causal structure
• Hypotheses: h1 =       B         C        h0 =    B         C


                              E                          E

• Bayesian causal inference:
                               1 1
               P(data | h1)     0 P(data | w0 , w1) p(w0 , w1 | h1
   support =                    0
                                1
               P(data | h0 )   P(data | w0 ) p(w0 | h0 ) dw0
                                0
Buehner and Cheng (1997)

    People




  DP (r = 0.89)




Power (r = 0.88)




Support (r = 0.97)
The importance of parameterization
• Noisy-OR incorporates mechanism assumptions:
  – generativity: causes increase probability of effects
  – each cause is sufficient to produce the effect
  – causes act via independent mechanisms
                                            (Cheng, 1997)
• Consider other models:
  – statistical dependence: 2 test
  – generic parameterization (Anderson, computer science)
    People




Support (Noisy-OR)




      2




Support (generic)
           Generativity is essential
P(e+|c+)      8/8      6/8      4/8      2/8      0/8
P(e+|c-)      8/8      6/8      4/8      2/8      0/8
  100                         Support
   50
    0


 • Predictions result from “ceiling effect”
    – ceiling effects only matter if you believe a cause
      increases the probability of an effect
     Bayes nets and beyond...
• What are Bayes nets?
  – graphical models
  – causal graphical models
• An example: elemental causal induction
• Beyond Bayes nets…
  – other knowledge in causal induction
  – formalizing causal theories
                                                chemicals
                                                  genes




Clofibrate   Wyeth 14,643    Gemfibrozil      Phenobarbital




      p450 2B1          Carnitine Palmitoyl Transferase 1


                  Hamadeh et al. (2002) Toxicological sciences.
                                                chemicals
                            X                     genes




Clofibrate   Wyeth 14,643       Gemfibrozil   Phenobarbital




      p450 2B1          Carnitine Palmitoyl Transferase 1


                  Hamadeh et al. (2002) Toxicological sciences.
                                                    chemicals
                         Chemical X                   genes

           peroxisome proliferators

Clofibrate     Wyeth 14,643      Gemfibrozil      Phenobarbital



       +          +        +



      p450 2B1              Carnitine Palmitoyl Transferase 1


                      Hamadeh et al. (2002) Toxicological sciences.
  Using causal graphical models
• Three questions (usually solved by researcher)
  – what are the variables?
  – what structures are plausible?
  – how do variables interact?


• How are these questions answered if causal
  graphical models are used in cognition?
     Bayes nets and beyond...
• What are Bayes nets?
  – graphical models
  – causal graphical models
• An example: elemental causal induction
• Beyond Bayes nets…
  – other knowledge in causal induction
  – formalizing causal theories
     Theory-based causal induction
 Causal theory
    – Ontology                  P(h|data)  P(data|h) P(h)
    – Plausible relations       Evaluated by statistical inference
    – Functional form
                            P(h1) =                 P(h0) =1 – 
                   h1:      B
                            X         Y        h0:   X
                                                     B         Y


Generates                        Z                        Z

                    Hypothesis space of causal graphical models
                                           Blicket detector
                                                                  Both objects activate     Object A does not       Chi
                                                                       the detector        activate the detector   each
                                                                                                  by itself        Then
                                                                                                                   they
                                                                                                                   maket

                                 (Gopnik, Sobel, and colleagues)
 Procedure used in Sobel et al. (2002), Experiment 2
                                              Backward Blocking Condition
e Condition




                                                                  Both objects activate   Object Aactivates the     Chi

       See this? It’s a                       Let’s put this one                          Oooh, it’s a
s activate          Object A does not          Children are asked if   the detector         detector by itself     each
tector             activate the detector      each is a blicket                                                    Then
                                                                                                                   they
                          by itself           Then
                                              they are asked to                                                    maket
       blicket machine.                       on the machine.
                                              makehe machine go
                                                  t                                       blicket!
       Blickets
Blocking Condition make it go.




s activate        Object Aactivates the        Children are asked if
tector              detector by itself        each is a blicket
                                              Then
                                              they are asked to
       Both objects activate                Object A does not                 Children are asked if     Both objects activate
            the detector                   activate the detector             each is a blicket               the detector
                                                  by itself                  Then
                                                                             they are asked to
                                                                             makehe machine go


                                                      “Blocking”
                                                                                 t
 Experiment Sobel et al. (2002), Experiment 2
edure used in2
    Backward Blocking Condition                                                                       Backward Blocking C
dition




e         A
       Both objects asked
                           B
                     activate
       Children areObject if does not
                 detector A
             theblicket
                                                Trial 1
                                          Object Aactivates the
                                            detector Children are asked if
                                                     by itself                          Trial 2
                                                                              Children are asked if
                                                                             each is a blicket
                                                                                                       Trials 3, 4
                                                                                                        Both objects activate
                                                                                                             the detector
      each is a activate the detector              each is a blicket
      Then                                                                   Then
                                                                             they are asked to
      they are asked to by itself                  Then
                                                   they are asked to
      makehe machine go                                                      makehe machine go
                                                                                 t
          t                                        makehe machine go
                                                        t
            – Two objects: A and B
king Condition
            – Trial 1: A on detector – detector active
            – Trial 2: B on detector – detector inactive
            – Trials 3,4: A B on detector – detector active
e
      Then
            – 3, 4-year-olds judge whether each object is a blicket
       Children are asked if
                  Object Aactivates the
      each is a blicket
                     detector by itself
                                                     Children are asked if
                                                    each is a blicket
      they are asked to                             Then
                                                    they are asked to
      makehe machine go
          t
                         • A: a blicket             makehe machine go
                                                        t


                         • B: not a blicket
       A deductive inference?
• Causal law: detector activates if and only if
  one or more objects on top of it are blickets.
• Premises:
  – Trial 1: A on detector – detector active
  – Trial 2: B on detector – detector inactive
  – Trials 3,4: A B on detector – detector active
• Conclusions deduced from premises and
  causal law:
  – A: a blicket
  – B: not a blicket
                                                                  Both objects activate                Object A does not                Children are a


                               “Backwards blocking”
                                                                       the detector                   activate the detector            each is a blicke
                                                                                                             by itself                 Then
                                                                                                                                       they are asked t
                                                                                                                                       makehe machin
                                                                                                                                           t
el et al. (2002), Experiment 2
                             Figure 13: Procedure used in Sobel et al. (2002), Experiment 2
                            (Sobel, Tenenbaum & Gopnik, 2004)
                                               Backward Blocking Condition
                             One-Cause Condition




t A does not                A                B
                         Children are asked if objects activate
                                          Both                    Trial 1
                                                                  Both objects activate
                                                                              Object A does not
                                                                       the detector
                                                                                                        Trial 2
                                                                                                     Object Aactivates the
                                                                                                               Children are asked if
                                                                                                       detector by itself
                                                                                                                                        Children are a
                                                                                                                                       each is a blicke
e the detector          each is a blicket      the detector                  activate the detector            each is a blicket
by itself               Then
                        they are asked to                                                                                              Then
                                                                                                                                       they are asked t
                                                                                    by itself                 Then
                                                                                                              they are asked to
                        makehe machine go
                            t                                                                                                          makehe machin
                                                                                                                                           t
                                                                                                              makehe machine go
                                                                                                                   t

                 –                   A and B
                     Two objects:Blocking Condition
                            Backward
                 –   Trial 1: A B on detector – detector active
                 –   Trial 2: A on detector – detector active
                 –   4-year-olds judge whether each object is a blicket
                      • each isablicket Both the detector
Aactivates the           Children are asked if
 tor by itself           A: a blicket (100% of judgments)
                        Then
                                             objects activate Object Aactivates the
                                                                detector by itself
                                                                                     Children are asked if
                                                                                    each is a blicket
                        they are asked to                                           Then
                                                                                    they are asked to
                      • B: probably not a blicket (66% of judgments)
                        makehe machine go
                            t                                                       makehe machine go
                                                                                        t
                        Theory
• Ontology
  – Types: Block, Detector, Trial
  – Predicates:
     Contact(Block, Detector, Trial)
     Active(Detector, Trial)
• Constraints on causal relations
  – For any Block b and Detector d, with prior probability q :
      Cause(Contact(b,d,t), Active(d,t))
• Functional form of causal relations
  – Causes of Active(d,t) are independent mechanisms, with
    causal strengths wi. A background cause has strength w0.
    Assume a near-deterministic mechanism: wi ~ 1, w0 ~ 0.
                        Theory
• Ontology
  – Types: Block, Detector, Trial
  – Predicates:
                                       A       B
     Contact(Block, Detector, Trial)
     Active(Detector, Trial)
                                           E
                         Theory
• Ontology
  – Types: Block, Detector, Trial
  – Predicates:
                                                A       B
     Contact(Block, Detector, Trial)
     Active(Detector, Trial)
                                                    E


             A = 1 if Contact(block A, detector, trial), else 0
             B = 1 if Contact(block B, detector, trial), else 0
             E = 1 if Active(detector, trial), else 0
                               Theory
    • Constraints on causal relations
         – For any Block b and Detector d, with prior probability q :
             Cause(Contact(b,d,t), Active(d,t))
                                    P(h00) = (1 – q)2       P(h10) = q(1 – q)

No hypotheses with          h00 :      A       B         h10 :   A        B
E    B, E A,
A    B, etc.                               E                          E

                                    P(h01) = (1 – q) q           P(h11) = q2
A
    = “A is a blicket”                 A       B                  A       B
                            h01 :                        h11 :
     E
                                           E                          E
                                    Theory
   • Functional form of causal relations
       – Causes of Active(d,t) are independent mechanisms, with
         causal strengths wb. A background cause has strength w0.
         Assume a near-deterministic mechanism: wb ~ 1, w0 ~ 0.
           P(h00) = (1 – q)2     P(h01) = (1 – q) q   P(h10) = q(1 – q)   P(h11) = q2

                     A       B     A       B            A       B          A       B


                         E             E                    E                  E
P(E=1 | A=0, B=0):       0             0                    0                  0
P(E=1 | A=1, B=0):       0             0                    1                  1
P(E=1 | A=0, B=1):       0             1                    0                  1
P(E=1 | A=1, B=1):       0             1                    1                  1

                     “Activation law”: E=1 if and only if A=1 or B=1.
           Bayesian inference
• Evaluating causal models in light of data:
                       P(d | hi ) P(hi )
         P(hi | d ) 
                       P(d | h j ) P(h j )
                     h H
                      j
• Inferring a particular causal relation:

      P( A  E | d )     P( A  E | h j ) P ( h j | d )
                         h H
                          j
      Modeling backwards blocking
           P(h00) = (1 – q)2     P(h01) = (1 – q) q   P(h10) = q(1 – q)   P(h11) = q2

                     A       B     A       B            A       B          A       B


                         E             E                    E                  E
P(E=1 | A=0, B=0):       0             0                    0                  0
P(E=1 | A=1, B=0):       0             0                    1                  1
P(E=1 | A=0, B=1):       0             1                    0                  1
P(E=1 | A=1, B=1):       0             1                    1                  1

                             P ( B  E | d ) P(h01 )  P(h11 )    q
                                                              
                             P( B    E | d ) P (h00 )  P(h10 ) 1  q
      Modeling backwards blocking
           P(h00) = (1 – q)2     P(h01) = (1 – q) q   P(h10) = q(1 – q)   P(h11) = q2

                     A       B     A       B            A       B          A       B


                         E             E                    E                  E




P(E=1 | A=1, B=1):       0             1                    1                  1

                             P ( B  E | d ) P (h01 )  P(h11 )    1
                                                               
                             P( B    E | d)       P (h10 )        1 q
      Modeling backwards blocking
                     P(h01) = (1 – q) q   P(h10) = q(1 – q)   P(h11) = q2

                       A       B            A       B          A       B


                           E                    E                  E


P(E=1 | A=1, B=0):         0                    1                  1

P(E=1 | A=1, B=1):         1                    1                  1

                               P( B  E | d ) P(h11 )   q
                                                     
                               P( B   E | d ) P(h10 ) 1  q
                         Both objects activate                         Object A does not                          Children are asked if
                              the detector


                               et al. 13:
                                           Manipulating the prior
                                               One-Cause
                                              Figure 13: Condition (2002), Experiment 2
esed in Sobel et al. (2002), Experiment 2Procedure used 2Procedure al.13: in Sobel et used in Sobel et al. (2002),
  13: Procedure used in SobelFigure(2002), Experiment in Sobel et used Procedure al. (2002), Experiment 2
                                                              Figure
                                                                                Figure 13: Procedure used in Sobel et al. (2002), Experiment 2
                                                                      activate the detector
                                                                             by itself
                                                                                                  each is a blicket
                                                                                                                 Then
                                                                                                                 they are asked to
                                                                                                                 makehe machine go
                                                                                                                     t


           Backward Blocking Condition One-Cause Condition
Cause Condition           One-Cause Condition       One-Cause Condition
                    I. Pre-training phase: Blickets are rare . . . .

                                                                                      Both objects activate                       Object A does not                          Children are a
                                                                                           the detector                          activate the detector                      each is a blicke
                                                                                                                                        by itself                           Then
                                                                                                                                                                           they are asked t
  Figure 13: Procedure used in Sobel et al. (2002), Experiment 2
              Object      Both objects activate A does notobjects activateAactivates the areactivate A does not are asked if A does not
h objects activate A does not            Object       Both
                                                        Children areObject if
                                                                     asked       Both objects asked if
                                                                                  Children         Object Both objects activate
                                                                                                                Children        Object        Children are asked ifA doesChildren are aske
                                                                                                                                                           Object          not the machin
                                                                                                                                                                           make
sed al. (2002), Experiment 2
el et in Sobel et al. (2002), Experiment 2
            activate the detector detector
  the detector                 the      activate the detector a detector
                                                      each is blicket detector byeach is a detector
                                                             the                        the
                                                                                  itself blicket                  the         activate the detector a blicket
                                                                                                 activate the detector detector
                                                                                                               each is a blicket             each is      activate the detector a blicket
                                                                                                                                                                         each is
  One-Cause Condition
                   by itself           Figure 13: Procedure used in Sobel et al. (2002), Experiment 2
                                                      Then
                                               by itself are asked to
                                                      they                       Then
                                                                                 they are asked to             Then
                                                                                                        by itself are asked to
                                                                                                              they                           Then
                                                                                                                                            they
                                                                                                                                     by itself are asked to by itself    Then
                                                                                                                                                                        they are asked to

                    II. Backwards blocking phase: Backward Blocking Condition
                                                      makehe machine go
                                       One-Cause Condition
                                                           t                     makehe machine go
                                                                                      t                       makehe machine go
                                                                                                                   t                        makehe machine go
                                                                                                                                                  t                     makehe machine g
                                                                                                                                                                             t


ward Blocking Condition
ondition                                        Backward Blocking Condition
                                Backward Blocking Condition Backward Blocking Condition




       Both objects activate                        Object A does not                           Children are asked if
t A does not the detector not
           Object A does
e the detector
                                       A                 B        detector
                                                   activate theare asked if
                                                     Children
                                   Children are asked if
           activate the detector each is a blicket each is blicket
                                                                                       Trial 1 each is a blicket
                                                                                     Both objects activate
                                                                                               Then
                                                                                               detector
                                                                                                                                     Trial 2
                                                                                                                                 Object Aactivates the
                                                           byaitselfBoth objects activate the they are asked to Object A does notdetector by itself
                                                                                                                                                                                 Children are a
                                                                                                                                                              Children are asked if is a blicke
                                                                                                                                                                                each
                   by itself      they are asked to they are asked to the detector
                                  Then              Then                                       makehe machine activate the detector
                                                                                                   t             go                                                             Then
                                                                                                                                                             each is a blicket they are asked t
by itself
         After each trial, adults judge the probability that each
                                                     Both objects asked
h objects activate Aactivates the makehe machine goChildren areactivate
           Object
  the detector detector by itself
                                      Object Aactivates tthe machine go if
                                      t             makehe
                                         detector by itself is a detector
                                                     each the blicket
                                                                                   Both objects activate Aactivates objects activate Aactivates the are askedAactivatesmakehe are aske
                                                                                                  Object
                                                                                    Children are asked if
                                                                                         the blicket
                                                                                                                Both the itself
                                                                                                                     the
                                                                                                                        by     Object          Children Object if
                                                                                                                                              each
                                                                                                                                                             they are asked to the t machin
                                                                                                                                                             Then             Children
                                                                                   each is a detector detector by itself detector detector by itself is a blicket the machine go is a blicket
                                                                                                                                                                             each
                                                                                                                                                                detector by itself
                                                                                                                                                             make
  Backward Blocking Condition                        Then
                                                     they are asked to             Then
                                                                                  they are asked to                                           Then
                                                                                                                                             they are asked to               Then
                                                                                                                                                                            they are asked to
 ndition    object is a blicket. Blocking Condition
                              Backward
                                                     makehe machine go
                                                          t                       makehe machine go
                                                                                       t                                                     makehe machine go
                                                                                                                                                   t                        makehe machine g
                                                                                                                                                                                 t
• “Rare” condition: First observe 12 objects
  on detector, of which 2 set it off.
• “Common” condition: First observe 12
  objects on detector, of which 10 set it off.
                         Both objects activate                         Object A does not                          Children are asked if
                                                                     Figure 13: Procedure usedeach is a blicketFigure 13: Procedure used in Sobel
                                                                                                in Sobel et al. (2002), Experiment 2
                     Inferences from ambiguous data
                              the detector                           activate the detector
                                                                             by itself

esed in Sobel et al. (2002), Experiment 2Procedure used 2Procedure al.13: in Sobel et usedConditionet al. (2002),
                                                                              One-Cause in
                               et al. 13: One-Cause ConditionFigure (2002), Experiment 2Sobel
  13: Procedure used in SobelFigure(2002), Experiment in Sobel et used Procedure al. (2002), Experiment 2
                                              Figure 13:
                                                                                                                 Then
                                                                                                                 they are asked to
                                                                                                                 makehe machine go
                                                                                                                     t


           Backward Blocking Condition One-Cause Condition
Cause Condition           One-Cause Condition       One-Cause Condition
                    I. Pre-training phase: Blickets are rare . . . .

                                                                           Both objects activate                       Object A doesBoth objects activate
                                                                                                                                        not                                          Object
                                                                                                                                                                   Children are asked if
                                                                                the detector                         activate the detector the detector           each is a blicket activate
                                                                                                                             by itself                            Then
                                                                                                                                                                  they are asked to        by
                              Figure 13: Procedure used
e 13: Procedure used in Sobel et al. (2002), Experiment 2 in Sobel et al. (2002), Experiment 2
              Object                                  Both
                          Both objects activate A does notobjects activateAactivates the areactivate A does not are asked if A does not
h objects activate A does not            Object         Children areObject if
                                                                     asked       Both objects asked if
                                                                                  Children         Object Both objects activate
                                                                                                                Children        Object          Children are asked ifAthe machine go aske
                                                                                                                                                             Object doesChildren are
                                                                                                                                                                  make       not
    Experiment 2
2), (2002), (2002), Experiment 2
 obel
al. et al. Experiment 2
            activate the detector detector
  the detector                 the                                                               activate the detector detector
                                        activate the detector a detector
                                                      each is blicket detector byeach is a detector
                                                             the                        the
                                                                                  itself blicket                  the         activate the detector a blicket
                                                                                                               each is a blicket              each is       activate the detector a blicket
                                                                                                                                                                           each is
  Figure 13: Procedure 13: in Sobel et al. in Sobel et al. (2002),
Cause Condition FigureusedProcedure used(2002), Experiment 2 Experiment 2
                   by itself  One-Cause Condition     Then
                                                      they
                                               by itself are asked to            Then
                                                                                 they are asked to             Then asked to
                                                                                                        by itself are
                                                                                                              they                            Then asked to by itself
                                                                                                                                              they
                                                                                                                                       by itself are                       Then
                                                                                                                                                                          they are asked to
                                                                              Backward Blocking Condition
                    II. Two trials: A B
                  One-Cause Condition
  One-Cause Condition
                                         Backward Blocking Condition
                                                      makehe machine go
                                                           t
                                                                                      detector, B C
                                                                                 makehe machine go
                                                                                      t                       makehe machine go
                                                                                                                   t
                                                                                                                                           detector
                                                                                                                                              makehe machine go
                                                                                                                                                   t                      makehe machine g
                                                                                                                                                                               t


ward Blocking Condition
ondition                                      Backward Blocking Condition
                            Backward Blocking Condition Backward Blocking Condition




 h objects activate                          Object A does Both objects activate
                                                              not                                         Object A does not
                                                                                       Children are asked if                                        Children are asked if
                   A
    the detector not
es not A does Children areChildren areChildren
Object
etector the detector is activate is a blicket isby blicket
ctivate Both objects a blicket Both objects activate
                   each
                                 B
                                 asked if    C                              Trial 1
                                            activate if are asked if detectorBoth objects activate
                                              asked  the detector the
                                            each a itself                             Then
                                                                                   Object are asked to
                                                                                  thethey A does not            by itself
                                                                                                                          Object          Trial 2  each is a blicket Children are asked if Aa
                                                                                      each is a blicket activate the detector Aactivates the objects activate
                                                                                                                                           Both
                                                                                                                                                   Then
                                                                                                                                                                                      Object
                                                                                                                              Children itself the they are asked toeach is a blicket detecto
                                each                       Object A does not          detector       Children are asked if detector byare asked ifdetector
 f                 Then         Then
                   they are asked to        Then
        by itself the detector they are asked to are asked to
                                            they
                                           the detector activate the detector         makehe detector
                                                                                 activate tthe machine go a blicket          each is a blicket     makehe machine they are asked to
                                                                                                                                                       t             go
                                                                                                                                                                     Then
                                                                                                    each is
                   make
             Object the machine the Objecthe
                                make                       Both objects activate
                                                            Children are
                                            make Aactivates go
h objects activate Aactivates thego machine tgo machine the itselfasked if           Both objectsthey areifAactivates objects activate Aactivates the are askedAactivates the go aske
                                                                                                    activateasked Both the they Object
                                                                                                     Object
                                                                                      Children are asked
                                                                                         by itself Then            to        Then asked to
                                                                                                                                  are                                 if t
                                                                                                                                                 Children Object makehe machine are
                                                                                                                                                                               Children
                                                                  by
         After each trial, adults judge the probability that each
   the detector detector by itself
ward Blocking Condition
                                              detector by itself is a detector
                            Backward Blocking Condition
                                                           each the blicket
                                                           Then
                                                          they are asked to
                                                                                     each is a detectortdetector by itself detector detector by itself
                                                                                     Then
                                                                                            the blicket he machine gothe makehe machine go
                                                                                                   make
                                                                                     they are asked to
                                                                                                                                 t              each is a blicket
                                                                                                                                                Then
                                                                                                                                                they are asked to
                                                                                                                                                                              each is a blicket
                                                                                                                                                                 detector by itself
                                                                                                                                                                              Then
                                                                                                                                                                             they are asked to
                                                          makehe machine go
                                                               t                     makehe machine go
                                                                                          t                                                     makehe machine go
                                                                                                                                                     t                       makehe machine g
                                                                                                                                                                                  t
             object isBlocking Condition
                Backward a
   Backward Blocking Conditionblicket.
Same domain theory generates hypothesis
space for 3 objects:
                             A   B   C            A    B      C

• Hypotheses:       h000 =       E       h100 =        E
                             A   B   C            A    B      C
                    h010 =       E
                                         h001 =        E
                             A   B   C            A    B      C
                    h110 =               h011 =
                                 E                     E
                             A   B   C            A    B      C
                    h101 =               h111 =
                                 E                     E
• Likelihoods: P(E=1| A, B, C; h) = 1 if A = 1 and A       E exists,
                                      or B = 1 and B       E exists,
                                      or C = 1 and C        E exists,
                                      else 0.
• “Rare” condition: First observe 12 objects
  on detector, of which 2 set it off.
  The role of causal mechanism
           knowledge
• Is mechanism knowledge necessary?
  – Constraint-based learning using 2 tests of
    conditional independence.

• How important is the deterministic functional
  form of causal relations?
  – Bayes with “noisy sufficient causes” theory (c.f.,
    Cheng’s causal power theory).
Bayes with correct theory:




Bayes with “noisy sufficient causes” theory:
   Theory-based causal induction
• Explains one-shot causal inferences about
  physical systems: blicket detectors
• Captures a spectrum of inferences:
  – unambiguous data: adults and children make all-
    or-none inferences
  – ambiguous data: adults and children make more
    graded inferences
• Extends to more complex cases with hidden
  variables, dynamic systems: come to my talk!
                 Summary
• Causal graphical models provide a language
  for asking questions about causality
• Key issues in modeling causal induction:
  – what do we mean by causal induction?
  – how do knowledge and statistics interact?
• Bayesian approach allows exploration of
  different answers to these questions
                     Outline
• Morning
  – Introduction (Josh)
  – Basic case study #1: Flipping coins (Tom)
  – Basic case study #2: Rules and similarity (Josh)
• Afternoon
  – Advanced case study #1: Causal induction (Tom)
  – Advanced case study #2: Property induction (Josh)
  – Quick tour of more advanced topics (Tom)
Property induction
             Collaborators
Charles Kemp       Neville Sanjana
Lauren Schmidt     Amy Perfors
Fei Xu             Liz Baraff
Pat Shafto
           The Big Question
• How can we generalize new concepts
  reliably from just one or a few examples?
  – Learning word meanings




 “horse”        “horse”        “horse”
            The Big Question
• How can we generalize new concepts
  reliably from just one or a few examples?
  – Learning word meanings, causal relations,
    social rules, ….
  – Property induction
        Gorillas have T4 cells.
        Squirrels have T4 cells.
        All mammals have T4 cells.


  How probable is the the conclusion (target) given
   the premises (examples)?
            The Big Question
• How can we generalize new concepts
  reliably from just one or a few examples?
  – Learning word meanings, causal relations,
    social rules, ….
  – Property induction
        Gorillas have T4 cells.         Gorillas have T4 cells.
        Squirrels have T4 cells.        Chimps have T4 cells.
        All mammals have T4 cells.      All mammals have T4 cells.


 More diverse examples               stronger generalization
  Is rational inference the answer?
• Everyday induction often appears to follow
  principles of rational scientific inference.
  – Could that explain its success?

• Goal of this work: a rational computational
  model of human inductive generalization.
  – Explain people’s judgments as approximations to
    optimal inference in natural environments.
  – Close quantitative fits to people’s judgments with
    a minimum of free parameters or assumptions.
  Theory-Based Bayesian Models
• Rational statistical inference (Bayes):
                        p ( d | h) p ( h)
           p(h | d ) 
                        p(d | h) p(h)
                      hH

• Learners’ domain theories generate their
  hypothesis space H and prior p(h).
  – Well-matched to structure of the natural world.
  – Learnable from limited data.
  – Computationally tractable inference.
                 The plan
• Similarity-based models
• Theory-based model
• Bayesian models
  – “Empiricist” Bayes
  – Theory-based Bayes, with different theories
• Connectionist (PDP) models
• Advanced Theory-based Bayes
  – Learning with multiple domain theories
  – Learning domain theories
                 The plan
• Similarity-based models
• Theory-based model
• Bayesian models
  – “Empiricist” Bayes
  – Theory-based Bayes, with different theories
• Connectionist (PDP) models
• Advanced Theory-based Bayes
  – Learning with multiple domain theories
  – Learning domain theories
               An experiment
               (Osherson et al., 1990)

• 20 subjects rated the strength of 45 arguments:
      X1 have property P.
      X2 have property P.
      X3 have property P.

      All mammals have property P.


• 40 different subjects rated the similarity of all
  pairs of 10 mammals.
   Similarity-based models
              (Osherson et al.)

strength(“all mammals” | X )           sim(i, X )
                                    imammals




                            x
                    x

                        x



               Mammals:
               Examples:    x
   Similarity-based models
              (Osherson et al.)

strength(“all mammals” | X )           sim(i, X )
                                    imammals




                            x
                    x

                        x



               Mammals:
               Examples:    x
   Similarity-based models
              (Osherson et al.)

strength(“all mammals” | X )           sim(i, X )
                                    imammals




                            x
                    x

                        x



               Mammals:
               Examples:    x
   Similarity-based models
              (Osherson et al.)

strength(“all mammals” | X )           sim(i, X )
                                    imammals




                            x
                    x

                        x



               Mammals:
               Examples:    x
       Similarity-based models
                      (Osherson et al.)

   strength(“all mammals” | X )             sim(i, X )
                                         imammals


• Sum-Similarity:
   sim(i, X )         sim(i, j )          x
                                                    x

                  jX
                                                x



                                     Mammals:
                                     Examples:      x
        Similarity-based models
                       (Osherson et al.)

   strength(“all mammals” | X )            sim(i, X )
                                        imammals


• Max-Similarity:
                                                   x
   sim( i, X )  max sim( i, j )           x
                j X
                                               x



                                    Mammals:
                                    Examples:      x
        Similarity-based models
                       (Osherson et al.)

   strength(“all mammals” | X )            sim(i, X )
                                        imammals


• Max-Similarity:
                                                   x
   sim( i, X )  max sim( i, j )           x
                j X
                                               x



                                    Mammals:
                                    Examples:      x
        Similarity-based models
                       (Osherson et al.)

   strength(“all mammals” | X )            sim(i, X )
                                        imammals


• Max-Similarity:
                                                   x
   sim( i, X )  max sim( i, j )           x
                j X
                                               x



                                    Mammals:
                                    Examples:      x
        Similarity-based models
                       (Osherson et al.)

   strength(“all mammals” | X )            sim(i, X )
                                        imammals


• Max-Similarity:
                                                   x
   sim( i, X )  max sim( i, j )           x
                j X
                                               x



                                    Mammals:
                                    Examples:      x
        Similarity-based models
                       (Osherson et al.)

   strength(“all mammals” | X )            sim(i, X )
                                        imammals


• Max-Similarity:
                                                   x
   sim( i, X )  max sim( i, j )           x
                j X
                                               x



                                    Mammals:
                                    Examples:      x
     Sum-sim versus Max-sim
• Two models appear functionally similar:
  – Both increase monotonically as new examples
    are observed.
• Reasons to prefer Sum-sim:
  – Standard form of exemplar models of
    categorization, memory, and object recognition.
  – Analogous to kernel density estimation
    techniques in statistical pattern recognition.
• Reasons to prefer Max-sim:
  – Fit to generalization judgments . . . .
                  Data vs. models
Data




                 Model
                                           X1 have property P.
             .
       Each “ ” represents one argument:   X2 have property P.
                                           X3 have property P.
                                           All mammals have property P.
                       Three data sets

Max-sim



Sum-sim



Conclusion
      kind:   “all mammals”   “horses”   “horses”

Number of
examples:          3             2       1, 2, or 3
          Feature rating data
            (Osherson and Wilkie)

• People were given 48 animals, 85 features,
  and asked to rate whether each animal had
  each feature.
• E.g., elephant: 'gray' 'hairless' 'toughskin'
                    'big' 'bulbous' 'longleg'
                    'tail' 'chewteeth' 'tusks'
                    'smelly' 'walks' 'slow'
                    'strong' 'muscle’ 'quadrapedal'
                    'inactive' 'vegetation' 'grazer'
                    'oldworld' 'bush' 'jungle'
                    'ground' 'timid' 'smart'
                    'group'
                              ?


  Species 1
  Species 2                           ?
  Species 3                           ?
  Species 4                           ?
  Species 5                           ?
  Species 6                           ?
  Species 7                           ?
  Species 8                           ?
  Species 9
  Species 10                          ?

               Features           New property


• Compute similarity based on Hamming
  distance,  A  B   A  B  or cosine.
• Generalize based on Max-sim or Sum-sim.
                       Three data sets
                r = 0.77       r = 0.75    r = 0.94

Max-Sim


               r = – 0.21      r = 0.63    r = 0.19

Sum-Sim



Conclusion
      kind:   “all mammals”   “horses”    “horses”

Number of
examples:          3             2        1, 2, or 3
Problems for sim-based approach
• No principled explanation for why Max-Sim works so
  well on this task, and Sum-Sim so poorly, when Sum-
  Sim is the standard in other similarity-based models.
• Free parameters mixing similarity and coverage terms,
  and possibly Max-Sim and Sum-Sim terms.
• Does not extend to induction with other kinds of
  properties, e.g., from Smith et al., 1993:
        Dobermanns can bite through wire.

        German shepherds can bite through wire.

        Poodles can bite through wire.

        German shepherds can bite through wire.
Marr’s Three Levels of Analysis
• Computation:
   “What is the goal of the computation, why is it
    appropriate, and what is the logic of the
    strategy by which it can be carried out?”

• Representation and algorithm:
    Max-sim, Sum-sim

• Implementation:
    Neurobiology
                 The plan
• Similarity-based models
• Theory-based model
• Bayesian models
  – “Empiricist” Bayes
  – Theory-based Bayes, with different theories
• Connectionist (PDP) models
• Advanced Theory-based Bayes
  – Learning with multiple domain theories
  – Learning domain theories
       Theory-based induction
• Scientific biology: species generated by an
  evolutionary branching process.
  – A tree-structured taxonomy of species.




• Taxonomy also central in folkbiology (Atran).
     Theory-based induction
Begin by reconstructing intuitive taxonomy
 from similarity judgments:




       clustering
How taxonomy constrains induction
 • Atran (1998): “Fundamental principle of
   systematic induction” (Warburton 1967,
   Bock 1973)
   – Given a property found among members of any
     two species, the best initial hypothesis is that
     the property is also present among all species
     that are included in the smallest higher-order
     taxon containing the original pair of species.
                               “all mammals”




Cows have property P.
Dolphins have property P.
Squirrels have property P.

All mammals have property P.

Strong (0.76 [max = 0.82])
                               “large herbivores”




Cows have property P.                Cows have property P.
Dolphins have property P.            Horses have property P.
Squirrels have property P.           Rhinos have property P.

All mammals have property P.         All mammals have property P.

Strong: 0.76 [max = 0.82])           Weak: 0.17 [min = 0.14]
                                    “all mammals”




Cows have property P.          Seals have property P.
Dolphins have property P.      Dolphins have property P.
Squirrels have property P.     Squirrels have property P.

All mammals have property P.   All mammals have property P.

Strong: 0.76 [max = 0.82]      Weak: 0.30 [min = 0.14]
Taxonomic
 distance



 Max-sim



 Sum-sim



Conclusion
      kind:   “all mammals”   “horses”   “horses”

 Number of
 examples:         3             2       1, 2, or 3
                The challenge
• Can we build models with the best of both
  traditional approaches?
  – Quantitatively accurate predictions.
  – Strong rational basis.

• Will require novel ways of integrating
  structured knowledge with statistical inference.
                 The plan
• Similarity-based models
• Theory-based model
• Bayesian models
  – “Empiricist” Bayes
  – Theory-based Bayes, with different theories
• Connectionist (PDP) models
• Advanced Theory-based Bayes
  – Learning with multiple domain theories
  – Learning domain theories
             The Bayesian approach
                              ?


Species 1
Species 2                             ?
Species 3                             ?
Species 4                             ?
Species 5                             ?
Species 6                             ?
Species 7                             ?
Species 8                             ?
Species 9
Species 10                            ?

                   Features       New property
             The Bayesian approach
                            ?


Species 1
Species 2                                           ?
Species 3                                           ?
Species 4                                           ?
Species 5                                           ?
Species 6                                           ?
Species 7                                           ?
Species 8                                           ?
Species 9
Species 10                                          ?

                 Features       Generalization   New property
                                 Hypothesis
             The Bayesian approach
                            ?


Species 1
Species 2                                           ?
Species 3                                           ?
Species 4                                           ?
Species 5                                           ?
Species 6                                           ?
Species 7                                           ?
Species 8                                           ?
Species 9
Species 10                                          ?

                 Features       Generalization   New property
                                 Hypothesis
             The Bayesian approach
                            ?


Species 1
Species 2                                           ?
Species 3                                           ?
Species 4                                           ?
Species 5                                           ?
Species 6                                           ?
Species 7                                           ?
Species 8                                           ?
Species 9
Species 10                                          ?

                 Features       Generalization   New property
                                 Hypothesis
             The Bayesian approach
                            ?


Species 1
Species 2                                           ?
Species 3                                           ?
Species 4                                           ?
Species 5                                           ?
Species 6                                           ?
Species 7                                           ?
Species 8                                           ?
Species 9
Species 10                                          ?

                 Features       Generalization   New property
                                 Hypothesis
             The Bayesian approach
                            ?


Species 1
Species 2                                           ?
Species 3                                           ?
Species 4                                           ?
Species 5                                           ?
Species 6                                           ?
Species 7                                           ?
Species 8                                           ?
Species 9
Species 10                                          ?

                 Features       Generalization   New property
                                 Hypothesis
             The Bayesian approach
                            ?


Species 1
Species 2                                           ?
Species 3                                           ?
Species 4                                           ?
Species 5                                           ?
Species 6                                           ?
Species 7                                           ?
Species 8                                           ?
Species 9
Species 10                                          ?

                 Features       Generalization   New property
                                 Hypothesis
             The Bayesian approach
                            p(h)
                                              p(d |h)
                                         h              d
Species 1
Species 2                                               ?
Species 3                                               ?
Species 4                                               ?
Species 5                                               ?
Species 6                                               ?
Species 7                                               ?
Species 8                                               ?
Species 9
Species 10                                              ?

                 Features          Generalization   New property
                                    Hypothesis
                                 p ( d | h) p ( h)
    Bayes’ rule:    p(h | d ) 
                                 p(d | h) p(h)
                              hH

                               p(h)
                                                 p(d |h)
                                            h              d
Species 1
Species 2                                                  ?
Species 3                                                  ?
Species 4                                                  ?
Species 5                                                  ?
Species 6                                                  ?
Species 7                                                  ?
Species 8                                                  ?
Species 9
Species 10                                                 ?

                   Features           Generalization   New property
                                       Hypothesis
Probability that property Q
holds for species x: p(Q( x) | d )                   p(h | d )
                                       h consistent
                                        with Q ( x )
                              p(h)
                                                p(d |h)
                                           h              d
 Species 1
 Species 2                                                ?
 Species 3                                                ?
 Species 4                                                ?
 Species 5                                                ?
 Species 6                                                ?
 Species 7                                                ?
 Species 8                                                ?
 Species 9
 Species 10                                               ?

                  Features           Generalization    New property
                                      Hypothesis
                                                     1   if d is
             “Size principle”:          p ( d | h) 
                                                     h   consistent
              |h| = # of positive                        with h
                      instances of h
                                                  0     otherwise
                                 p(h)
                                                   p(d |h)
                                              h              d
Species 1
Species 2                                                    ?
Species 3                                                    ?
Species 4                                                    ?
Species 5                                                    ?
Species 6                                                    ?
Species 7                                                    ?
Species 8                                                    ?
Species 9
Species 10                                                   ?

                 Features               Generalization   New property
                                         Hypothesis
                 The size principle
           h1      2    4    6    8 10     h2
“even numbers”     12   14   16   18 20    “multiples of 10”
                   22   24   26   28 30
                   32   34   36   38 40
                   42   44   46   48 50
                   52   54   56   58 60
                   62   64   66   68 70
                   72   74   76   78 80
                   82   84   86   88 90
                   92   94   96   98 100
                 The size principle
           h1      2    4    6    8 10     h2
“even numbers”     12   14   16   18 20    “multiples of 10”
                   22   24   26   28 30
                   32   34   36   38 40
                   42   44   46   48 50
                   52   54   56   58 60
                   62   64   66   68 70
                   72   74   76   78 80
                   82   84   86   88 90
                   92   94   96   98 100


      Data slightly more of a coincidence under h1
                 The size principle
           h1      2    4    6    8 10     h2
“even numbers”     12   14   16   18 20    “multiples of 10”
                   22   24   26   28 30
                   32   34   36   38 40
                   42   44   46   48 50
                   52   54   56   58 60
                   62   64   66   68 70
                   72   74   76   78 80
                   82   84   86   88 90
                   92   94   96   98 100


      Data much more of a coincidence under h1
  Illustrating the size principle
Which argument is stronger?

         Grizzly bears have property P.

         All mammals have property P.


                              “Non-monotonicity”
        Grizzly bears have property P.
        Brown bears have property P.
        Polar bears have property P.

        All mammals have property P.
Probability that property Q holds
for species x: p(Q( x) | d )   p(h) / h                              p ( h) / h
                               h consistent              h consistent
                               with Q ( x ), d              with d


                                      p(h)
                                                         p(d |h)
                        p(Q(x)|d)                    h              d
           Species 1
           Species 2                                                ?
           Species 3                                                ?
           Species 4                                                ?
           Species 5
           Species 6            ...                                 ?
                                                                    ?
           Species 7                                                ?
           Species 8                                                ?
           Species 9
           Species 10                                               ?

                                        Generalization          New property
                                         Hypotheses
Probability that property Q holds
for species x: p(Q( x) | d )   p(h) / h                               p ( h) / h
                             h consistent                 h consistent
                             with Q ( x ), d                 with d


                                    p(h)
                                                          p(d |h)
                                                     h               d
 Species 1
 Species 2                                                           ?
 Species 3                                                           ?
 Species 4                                                           ?
 Species 5                                                           ?
 Species 6                                                           ?
 Species 7                                                           ?
 Species 8                                                           ?
 Species 9
 Species 10                                                          ?

                  Features                     Generalization    New property
                                                Hypothesis
       Specifying the prior p(h)
• A good prior must focus on a small subset of
  all 2n possible hypotheses, in order to:
  – Match the distribution of properties in the world.
  – Be learnable from limited data.
  – Be efficiently computationally.
• We consider two approaches:
  – “Empiricist” Bayes: unstructured prior based
    directly on known features.
  – “Theory-based” Bayes: structured prior based on
    rational domain theory, tuned to known features.
“Empiricist”
                            h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12
               Species 1
Bayes:         Species 2
               Species 3
(Heit, 1998)   Species 4
               Species 5
               Species 6
               Species 7
               Species 8
               Species 9
               Species 10

               p(h) =        1
                            15
                                  1 2 1 3
                                 15 15 15 15
                                                1 1 1
                                               15 15 15
                                                           1 1
                                                          15 15
                                                                   1
                                                                  15
                                                                        1
                                                                       15


                                                              h                d
  Species 1
  Species 2                                                                    ?
  Species 3                                                                    ?
  Species 4                                                                    ?
  Species 5                                                                    ?
  Species 6                                                                    ?
  Species 7                                                                    ?
  Species 8                                                                    ?
  Species 9
  Species 10                                                                   ?

                    Features                          Generalization        New property
                                                       Hypothesis
                          Results
               r = 0.38     r = 0.16   r = 0.79

“Empiricist”
  Bayes



               r = 0.77     r = 0.75   r = 0.94

 Max-Sim
     Why doesn’t “Empiricist”
          Bayes work?
• With no structural bias, requires too many
  features to estimate the prior reliably.
• An analogy: Estimating a smooth probability
  density function by local interpolation.




     N=5           N = 100         N = 500
     Why doesn’t “Empiricist”
          Bayes work?
• With no structural bias, requires too many
  features to estimate the prior reliably.
• An analogy: Estimating a smooth probability
  density function by local interpolation.

                             Assuming an appropriately
                             structured form for density
                             (e.g., Gaussian) leads
                             to better generalization
                             from sparse data.
     N=5           N=5
         “Theory-based” Bayes
Theory: Two principles based on the structure of
 species and properties in the natural world.
1. Species generated by an evolutionary
  branching process.
  – A tree-structured taxonomy of species (Atran,
    1998).
2. Features generated by stochastic mutation
  process and passed on to descendants.
  – Novel features can appear anywhere in tree, but
    some distributions are more likely than others.
Mutation process generates p(h|T):
   – Choose label for root.
   – Probability that label mutates
     along branch b : 1  e 2 l b
                                          s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
                              2
   l = mutation rate
                                      T       p(h|T)
   |b| = length of branch b
                                                   h               d
 Species 1
 Species 2                                                         ?
 Species 3                                                         ?
 Species 4                                                         ?
 Species 5                                                         ?
 Species 6                                                         ?
 Species 7                                                         ?
 Species 8                                                         ?
 Species 9
 Species 10                                                        ?

                       Features              Generalization   New property
                                              Hypothesis
Mutation process generates p(h|T):
   – Choose label for root.                  x
                                                                  x
   – Probability that label mutates
                                                     x
     along branch b : 1  e 2 l b
                              2
   l = mutation rate
                                      T    p(h|T)
   |b| = length of branch b
                                                 h            d
 Species 1
 Species 2                                                    ?
 Species 3                                                    ?
 Species 4                                                    ?
 Species 5                                                    ?
 Species 6                                                    ?
 Species 7                                                    ?
 Species 8                                                    ?
 Species 9
 Species 10                                                   ?

                       Features           Generalization   New property
                                           Hypothesis
       Samples from the prior
• Labelings that cut the data along fewer
  branches are more probable:


                      >


    “monophyletic”            “polyphyletic”
        Samples from the prior
• Labelings that cut the data along longer
  branches are more probable:


                        >


   “more distinctive”         “less distinctive”
• Mutation process over tree T
  generates p(h|T).
• Message passing over tree T
  efficiently sums over all h.
• How do we know which tree T        s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
  to use?
                                 T       p(h|T)
                                              h               d
  Species 1
  Species 2                                                   ?
  Species 3                                                   ?
  Species 4                                                   ?
  Species 5                                                   ?
  Species 6                                                   ?
  Species 7                                                   ?
  Species 8                                                   ?
  Species 9
  Species 10                                                  ?

                   Features             Generalization   New property
                                         Hypothesis
The same mutation process
 generates p(Features|T):
 – Assume each feature generated
   independently over the tree.
 – Use MCMC to infer most likely
                                          s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
   tree T and mutation rate l given
   observed features.                 T       p(h|T)
 – No free parameters!
                                                   h               d
  Species 1
  Species 2                                                        ?
  Species 3                                                        ?
  Species 4                                                        ?
  Species 5                                                        ?
  Species 6                                                        ?
  Species 7                                                        ?
  Species 8                                                        ?
  Species 9
  Species 10                                                       ?

                      Features               Generalization   New property
                                              Hypothesis
                            Results
                 r = 0.91     r = 0.95   r = 0.91

“Theory-based”
    Bayes

                 r = 0.38     r = 0.16   r = 0.79

“Empiricist”
  Bayes

                 r = 0.77     r = 0.75   r = 0.94

  Max-Sim
     Grounding in similarity
Reconstruct intuitive taxonomy from
 similarity judgments:




       clustering
Theory-based
   Bayes



  Max-sim



  Sum-sim



  Conclusion
        kind:   “all mammals”   “horses”   “horses”

  Number of
  examples:          3             2       1, 2, or 3
        Explaining similarity
• Why does Max-sim fit so well?
  – An efficient and accurate approximation to
     this Theory-Based Bayesian model.
  – Correlation with
                              Mean r = 0.94
    Bayes on three-
    premise general
    arguments, over 100
    simulated trees:
                                Correlation (r)

  – Theorem. Nearest neighbor classification
    approximates evolutionary Bayes in the limit of
    high mutation rate, if domain is tree-structured.
Alternative feature-based models
• Taxonomic Bayes (strictly taxonomic
  hypotheses, with no mutation process)




                      >


    “monophyletic”           “polyphyletic”
    Alternative feature-based models
   • Taxonomic Bayes (strictly taxonomic
     hypotheses, with no mutation process)
   • PDP network (Rogers and McClelland)




Species


           Features
                          Results
               r = 0.91     r = 0.95   r = 0.91
                                                  Bias is
Theory-based                                        just
   Bayes                                           right!

               r = 0.51     r = 0.53   r = 0.85
                                                  Bias is
Taxonomic                                          too
  Bayes                                           strong

               r = 0.41     r = 0.62   r = 0.71
                                                  Bias is
PDP network                                        too
                                                   weak
     Mutation principle versus
      pure Occam’s Razor
• Mutation principle provides a version of
  Occam’s Razor, by favoring hypotheses that
  span fewer disjoint clusters.
• Could we use a more generic Bayesian
  Occam’s Razor, without the biological
  motivation of mutation?
Mutation process generates p(h|T):
   – Choose label for root.
   – Probability that label mutates
     along branch b : 1  e 2 l b
                                          s1 s2 s3 s4 s5 s6 s7 s8 s9 s10
                              2
   l = mutation rate
                                      T       p(h|T)
   |b| = length of branch b
                                                   h               d
 Species 1
 Species 2                                                         ?
 Species 3                                                         ?
 Species 4                                                         ?
 Species 5                                                         ?
 Species 6                                                         ?
 Species 7                                                         ?
 Species 8                                                         ?
 Species 9
 Species 10                                                        ?

                       Features              Generalization   New property
                                              Hypothesis
Mutation process generates p(h|T):
   – Choose label for root.
   – Probability that label mutates
     along branch b :
                         l                s1 s2 s3 s4 s5 s6 s7 s8 s9 s10

   l = mutation rate
                                      T       p(h|T)
   |b| = length of branch b
                                                   h               d
 Species 1
 Species 2                                                         ?
 Species 3                                                         ?
 Species 4                                                         ?
 Species 5                                                         ?
 Species 6                                                         ?
 Species 7                                                         ?
 Species 8                                                         ?
 Species 9
 Species 10                                                        ?

                       Features              Generalization   New property
                                              Hypothesis
   Bayes
(taxonomy+
                               Premise typicality effect (Rips,
  mutation)
                                1975; Osherson et al., 1990):

   Bayes                       Strong:
(taxonomy+
                                 Horses have property P.
  Occam)
                                 All mammals have property P.


 Max-sim
                               Weak:
                                Seals have property P.
 Conclusion                     All mammals have property P.
       kind:   “all mammals”

 Number of
 examples:          1
    Typicality meets hierarchies
• Collins and Quillian: semantic memory structured
  hierarchically




• Traditional story: Simple hierarchical structure
  uncomfortable with typicality effects & exceptions.
• New story: Typicality & exceptions compatible with
  rational statistical inference over hierarchy.
      Intuitive versus scientific
         theories of biology
• Same structure for how species are related.
  – Tree-structured taxonomy.
• Same probabilistic model for traits
  – Small probability of occurring along any branch
    at any time, plus inheritance.
• Different features
  – Scientist: genes
  – People: coarse anatomy and behavior
  Induction in Biology: summary
• Theory-based Bayesian inference explains
  taxonomic inductive reasoning in folk biology.
• Insight into processing-level accounts.
  – Why Max-sim over Sum-sim in this domain?
  – How is hierarchical representation compatible
    with typicality effects & exceptions?
• Reveals essential principles of domain theory.
  – Category structure: taxonomic tree.
  – Feature distribution: stochastic mutation process +
                          inheritance.
                 The plan
• Similarity-based models
• Theory-based model
• Bayesian models
  – “Empiricist” Bayes
  – Theory-based Bayes, with different theories
• Connectionist (PDP) models
• Advanced Theory-based Bayes
  – Learning with multiple domain theories
  – Learning domain theories
Property type
  Generic “essence”

Theory Structure
  Taxonomic Tree

          Lion
          Cheetah
          Hyena
          Giraffe
          Gazelle
          Gorilla
          Monkey


Lion
Cheetah
Hyena
Giraffe             ...
Gazelle
Gorilla
Monkey
Property type
  Generic “essence”       Size-related                  Food-carried

Theory Structure
  Taxonomic Tree          Dimensional                   Directed Acyclic
                                                         Network
                            Giraffe
          Lion                                                          Giraffe
          Cheetah            Lion                      Lion
          Hyena             Gorilla
                                                                        Gazelle
          Giraffe            Hyena
                                               Hyena          Cheetah
          Gazelle           Gazelle                                     Monkey
          Gorilla           Cheetah
          Monkey
                            Monkey                                      Gorilla

Lion
Cheetah
Hyena
Giraffe             ...                  ...                            ...
Gazelle
Gorilla
Monkey
    One-dimensional predicates
• Q = “Have skins that are more resistant to
  penetration than most synthetic fibers”.
  – Unknown relevant property: skin toughness
  – Model influence of known properties via judged
    prior probability that each species has Q.

                          threshold for Q




                                       Skin toughness

 House cat   Camel Elephant   Rhino
           One-dimensional predicates

   Bayes
(taxonomy+
  mutation)



 Max-sim




  Bayes
(1D model)
Disease    Food web model fits (Shafto et al.)



                       r = 0.77                  r = 0.82
Property




                       r = -0.35                 r = -0.05

                 Mammals                Island
Disease
          Taxonomic tree model fits (Shafto et al.)



                           r = -0.12            r = 0.16
   Property




                           r = 0.81             r = 0.62

                 Mammals               Island
                 The plan
• Similarity-based models
• Theory-based model
• Bayesian models
  – “Empiricist” Bayes
  – Theory-based Bayes, with different theories
• Connectionist (PDP) models
• Advanced Theory-based Bayes
  – Learning with multiple domain theories
  – Learning domain theories
Theory      • Species organized in taxonomic tree structure
            • Feature i generated by mutation process with rate li
                                    p(S|T)
                                                            F9
                                                    F8
Domain                                    F7 F11                 F14
                                                                 F13
Structure                           F12
                                               F6
                                                           F14   F10
                                    F3      F1      F2      F4   F5
                                              F10        F10
                                   S3 S4 S1 S2 S9 S10 S5 S6 S7 S8

                                    p(D|S)
               Species 1
               Species 2
               Species 3
               Species 4
               Species 5
               Species 6
Data           Species 7
               Species 8
               Species 9
               Species 10

                                             l10 high ~ weight low
Theory      • Species organized in taxonomic tree structure
            • Feature i generated by mutation process with rate li
                                              p(S|T)
                                                                        F9
                                                                F8
Domain                                              F7 F11                       F14
                                                                                 F13
Structure                                     F12
                                                           F6
                                                                       F14       F10
                                              F3        F1      F2      F4       F5
                                                          F10        F10
                                          S3 S4 S1 S2 S9 S10 S5 S6 S7 S8

                                              p(D|S)
               Species 1
               Species 2
               Species 3
               Species 4
               Species 5
               Species 6
Data           Species 7
               Species 8
               Species 9
               Species 10

               Species X    ?   ?   ?   ? ?    ?    ?     ?     ?      ?     ?   ?     ?
Theory      • Species organized in taxonomic tree structure
            • Feature i generated by mutation process with rate li
                                    p(S|T)
                                                            F9
                                                    F8
Domain                                    F7 F11                 F14
                                                                 F13
Structure                           F12
                                               F6
                                                           F14   F10
                                    F3      F1      F2      F4   F5
                                              F10        F10
                                   S3 S4 S1 S2 S9 S10 S5 S6 S7 S8

                                    p(D|S)          SX
               Species 1
               Species 2
               Species 3
               Species 4
               Species 5
               Species 6
Data           Species 7
               Species 8
               Species 9
               Species 10

               Species X
  Where does the domain theory
          come from?
• Innate.
  – Atran (1998): The tendency to group living
    kinds into hierarchies reflects an “innately
    determined cognitive structure”.

• Emerges (only approximately) through
  learning in unstructured connectionist
  networks.
  – McClelland and Rogers (2003).
  Bayesian inference to theories
• Challenge to the nativist-empiricist
  dichotomy.
  – We really do have structured domain theories.
  – We really do learn them.

• Bayesian framework applies over multiple
  levels:
  – Given hypothesis space + data, infer concepts.
  – Given theory + data, infer hypothesis space.
  – Given X + data, infer theory.
    Bayesian inference to theories
• Candidate theories for biological species and
  their features:
  – T0: Features generated independently for each species. (c.f.
    naive Bayes, Anderson’s rational model.)
  – T1: Features generated by mutation in tree-structured
    taxonomy of species.
  – T2: Features generated by mutation in a one-dimensional
    chain of species.
• Score theories by likelihood on object-feature
  matrix: p( D | T )   p( D | S , T ) p(S | T )
                              S
                           max p( D | S , T ) p( S | T )
                               S
T0:
• No organizational structure
  for species.
• Features distributed
  independently over species.

          F1
          F2
          F3            F1              F2
      F2 F5 F2          F6 F2           F4
 F1   F4 F7 F4 F1       F7 F4 F2 F1 F8
 F2   F6 F8 F7 F5       F8 F5 F3 F6 F9
 F5   F7 F10 F9 F7      F9 F12 F6 F8 F10
 F8   F9 F12 F12 F13    F10 F13 F11 F9 F11
 F9   F14 F13 F14 F14   F13 F14 F13 F12 F14
 S1   S2   S3 S4   S5 S6    S7   S8 S9 S10


                   Species 1
                   Species 2
                   Species 3
                   Species 4
Data               Species 5
                   Species 6
                   Species 7
                   Species 8
                   Species 9
                   Species 10

                                              Features
T0:
• No organizational structure
  for species.
• Features distributed
  independently over species.


       F1 F3 F3
 F1    F6 F7 F7                      F2    F2
 F6    F7 F8 F8              F5 F5 F6      F6
 F7    F8 F9 F9              F9 F9 F7      F7
 F8    F9 F11 F11 F4    F4   F10 F10 F8    F8
 F9    F10 F12 F12 F8   F8   F13 F13 F9    F9
 F11   F11 F14 F14 F9   F9   F14 F14 F11   F11
 S1 S2     S3 S4   S5 S6     S7   S8 S9 S10


                   Species 1
                   Species 2
                   Species 3
                   Species 4
Data               Species 5
                   Species 6
                   Species 7
                   Species 8
                   Species 9
                   Species 10

                                                 Features
T0:                                                         T1:
• No organizational structure                               • Species organized in
  for species.                                                taxonomic tree structure.
• Features distributed                                      • Features distributed via
  independently over species.                                 stochastic mutation process.

                                                                                       F9

       F1 F3 F3                                                                   F8
 F1    F6 F7 F7                      F2    F2                           F7 F11               F14
 F6    F7 F8 F8              F5 F5 F6      F6
 F7    F8 F9 F9                                                                              F13
                             F9 F9 F7      F7                                F6
 F8    F9 F11 F11 F4    F4   F10 F10 F8    F8                     F12                  F14   F10
 F9    F10 F12 F12 F8   F8   F13 F13 F9    F9                     F3      F1      F2   F4    F5
 F11   F11 F14 F14 F9   F9   F14 F14 F11   F11                              F10
 S1 S2     S3 S4   S5 S6     S7   S8 S9 S10                      S3 S4 S1 S2 S9 S10 S5 S6 S7 S8


                   Species 1
                   Species 2
                   Species 3
                   Species 4
Data               Species 5
                   Species 6
                   Species 7
                   Species 8
                   Species 9
                   Species 10

                                                 Features
T0: p(Data|T1) ~ 1.83 x 10-41                               T1: p(Data|T2) ~ 2.42 x 10-32
• No organizational structure                               • Species organized in
  for species.                                                taxonomic tree structure.
• Features distributed                                      • Features distributed via
  independently over species.                                 stochastic mutation process.

                                                                                       F9

       F1 F3 F3                                                                   F8
 F1    F6 F7 F7                      F2    F2                           F7 F11               F14
 F6    F7 F8 F8              F5 F5 F6      F6
 F7    F8 F9 F9                                                                              F13
                             F9 F9 F7      F7                                F6
 F8    F9 F11 F11 F4    F4   F10 F10 F8    F8                     F12                  F14   F10
 F9    F10 F12 F12 F8   F8   F13 F13 F9    F9                     F3      F1      F2   F4    F5
 F11   F11 F14 F14 F9   F9   F14 F14 F11   F11                              F10
 S1 S2     S3 S4   S5 S6     S7   S8 S9 S10                      S3 S4 S1 S2 S9 S10 S5 S6 S7 S8


                   Species 1
                   Species 2
                   Species 3
                   Species 4
Data               Species 5
                   Species 6
                   Species 7
                   Species 8
                   Species 9
                   Species 10

                                                 Features
T0:                                                      T1:
• No organizational structure                            • Species organized in
  for species.                                             taxonomic tree structure.
• Features distributed                                   • Features distributed via
  independently over species.                              stochastic mutation process.

          F1                                                               F2
          F2                                                         F4                        F1
                                                                                                F5 F7 F13
          F3            F1              F2                           F14              F8 F9
                                                                                             F12
      F2 F5 F2          F6 F2           F4                     F9
                                                                                         F13 F10
 F1   F4 F7 F4 F1       F7 F4 F2 F1 F8                         F7       F11 F13
 F2   F6 F8 F7 F5       F8 F5 F3 F6 F9                                                   F10        F8
                                                                    F13 F10 F11
 F5   F7 F10 F9 F7      F9 F12 F6 F8 F10                                              F12 F7        F3
 F8   F9 F12 F12 F13    F10 F13 F11 F9 F11                       F12 F12   F9 F6   F5
 F9   F14 F13 F14 F14   F13 F14 F13 F12 F14                   F6     F5    F8 F3   F2 F6 F6         F2   F14

 S1   S2   S3 S4   S5 S6    S7   S8 S9 S10                    S2 S4 S7 S10 S8 S1 S9 S6 S3 S5


                   Species 1
                   Species 2
                   Species 3
                   Species 4
Data               Species 5
                   Species 6
                   Species 7
                   Species 8
                   Species 9
                   Species 10

                                              Features
T0: p(Data|T1) ~ 2.29 x 10-42                            T1: p(Data|T2) ~ 4.38 x 10-53
• No organizational structure                            • Species organized in
  for species.                                             taxonomic tree structure.
• Features distributed                                   • Features distributed via
  independently over species.                              stochastic mutation process.

          F1                                                               F2
          F2                                                         F4                        F1
                                                                                                F5 F7 F13
          F3            F1              F2                           F14              F8 F9
                                                                                             F12
      F2 F5 F2          F6 F2           F4                     F9
                                                                                         F13 F10
 F1   F4 F7 F4 F1       F7 F4 F2 F1 F8                         F7       F11 F13
 F2   F6 F8 F7 F5       F8 F5 F3 F6 F9                                                   F10        F8
                                                                    F13 F10 F11
 F5   F7 F10 F9 F7      F9 F12 F6 F8 F10                                              F12 F7        F3
 F8   F9 F12 F12 F13    F10 F13 F11 F9 F11                       F12 F12   F9 F6   F5
 F9   F14 F13 F14 F14   F13 F14 F13 F12 F14                   F6     F5    F8 F3   F2 F6 F6         F2   F14

 S1   S2   S3 S4   S5 S6    S7   S8 S9 S10                    S2 S4 S7 S10 S8 S1 S9 S6 S3 S5


                   Species 1
                   Species 2
                   Species 3
                   Species 4
Data               Species 5
                   Species 6
                   Species 7
                   Species 8
                   Species 9
                   Species 10

                                              Features
              Empirical tests
• Synthetic data: 32 objects, 120 features
  – tree-structured generative model
  – linear chain generative model
  – unconstrained (independent features).
• Real data
  – Animal feature judgments: 48 species, 85
    features.
  – US Supreme Court decisions, 1981-1985: 9
    people, 637 cases.
Results   Preferred
           Model
            Null
            Tree
           Linear
            Tree
           Linear
  Theory acquisition: summary
• So far, just a computational proof of concept.
• Future work:
  – Experimental studies of theory acquisition in the
    lab, with adult and child subjects.
  – Modeling developmental or historical trajectories
    of theory change.
• Sources of hypotheses for candidate theories:
  – What is innate?
  – Role of analogy?
                     Outline
• Morning
  – Introduction (Josh)
  – Basic case study #1: Flipping coins (Tom)
  – Basic case study #2: Rules and similarity (Josh)
• Afternoon
  – Advanced case study #1: Causal induction (Tom)
  – Advanced case study #2: Property induction (Josh)
  – Quick tour of more advanced topics (Tom)
Advanced topics
       Structure and statistics
• Statistical language modeling
  – topic models


• Relational categorization
  – attributes and relations
       Structure and statistics
• Statistical language modeling
  – topic models


• Relational categorization
  – attributes and relations
      Statistical language modeling
• A variety of approaches to statistical language
  modeling are used in cognitive science
  – e.g. LSA                        (Landauer & Dumais, 1997)
  – distributional clustering (Redington, Chater, & Finch, 1998)
• Generative models have unique advantages
  – identify assumed causal structure of language
  – make use of standard tools of Bayesian statistics
  – easily extended to capture more complex structure
Generative models for language


          latent structure




          observed data
Generative models for language


            meaning




           sentences
                    Topic models
• Each document a mixture of topics
• Each word chosen from a single topic



• Introduced by Blei, Ng, and Jordan (2001),
  reinterpretation of PLSI (Hofmann, 1999)
• Idea of probabilistic topics widely used
  (eg. Bigi et al., 1997; Iyer & Ostendorf, 1996; Ueda & Saito, 2003)
    Generating a document

      q        distribution over topics


z     z    z     topic assignments


w     w    w       observed words
w     P(w|z = 1) = f (1)   w    P(w|z = 2) = f (2)
HEART              0.2     HEART             0.0
LOVE               0.2     LOVE              0.0
SOUL               0.2     SOUL              0.0
TEARS              0.2     TEARS             0.0
JOY                0.2     JOY               0.0
SCIENTIFIC         0.0     SCIENTIFIC        0.2
KNOWLEDGE          0.0     KNOWLEDGE         0.2
WORK               0.0     WORK              0.2
RESEARCH           0.0     RESEARCH          0.2
MATHEMATICS        0.0     MATHEMATICS       0.2
     topic 1                     topic 2
 Choose mixture weights for each document, generate “bag of words”
q = {P(z = 1), P(z = 2)}
                           MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS
        {0, 1}                 RESEARCH WORK SCIENTIFIC MATHEMATICS WORK

                               SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC
     {0.25, 0.75}                   HEART LOVE TEARS KNOWLEDGE HEART


                               MATHEMATICS HEART RESEARCH LOVE MATHEMATICS
      {0.5, 0.5}                    WORK TEARS SOUL KNOWLEDGE HEART


     {0.75, 0.25}                    WORK JOY SOUL TEARS MATHEMATICS
                                        TEARS LOVE LOVE LOVE SOUL


        {1, 0}               TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY
       A selection of topics (from 500)
    THEORY          SPACE         ART     STUDENTS      BRAIN      CURRENT       NATURE       THIRD
  SCIENTISTS       EARTH         PAINT    TEACHER       NERVE    ELECTRICITY      WORLD        FIRST
 EXPERIMENT         MOON         ARTIST    STUDENT      SENSE      ELECTRIC       HUMAN      SECOND
OBSERVATIONS       PLANET      PAINTING   TEACHERS     SENSES       CIRCUIT    PHILOSOPHY     THREE
  SCIENTIFIC      ROCKET        PAINTED   TEACHING        ARE          IS         MORAL      FOURTH
EXPERIMENTS         MARS        ARTISTS      CLASS    NERVOUS     ELECTRICAL   KNOWLEDGE       FOUR
 HYPOTHESIS         ORBIT      MUSEUM    CLASSROOM     NERVES      VOLTAGE      THOUGHT       GRADE
    EXPLAIN    ASTRONAUTS        WORK       SCHOOL      BODY         FLOW        REASON        TWO
   SCIENTIST        FIRST     PAINTINGS   LEARNING      SMELL      BATTERY        SENSE       FIFTH
  OBSERVED     SPACECRAFT        STYLE       PUPILS     TASTE        WIRE           OUR     SEVENTH
EXPLANATION       JUPITER     PICTURES    CONTENT      TOUCH         WIRES        TRUTH       SIXTH
     BASED       SATELLITE       WORKS  INSTRUCTION   MESSAGES      SWITCH       NATURAL     EIGHTH
OBSERVATION     SATELLITES        OWN      TAUGHT     IMPULSES    CONNECTED     EXISTENCE      HALF
      IDEA     ATMOSPHERE    SCULPTURE       GROUP       CORD     ELECTRONS        BEING      SEVEN
   EVIDENCE     SPACESHIP       PAINTER     GRADE      ORGANS     RESISTANCE        LIFE        SIX
   THEORIES       SURFACE         ARTS     SHOULD      SPINAL       POWER          MIND      SINGLE
   BELIEVED     SCIENTISTS   BEAUTIFUL     GRADES       FIBERS   CONDUCTORS     ARISTOTLE     NINTH
 DISCOVERED    ASTRONAUT       DESIGNS     CLASSES    SENSORY      CIRCUITS     BELIEVED        END
   OBSERVE        SATURN      PORTRAIT        PUPIL      PAIN        TUBE      EXPERIENCE     TENTH
     FACTS          MILES      PAINTERS      GIVEN         IS      NEGATIVE      REALITY    ANOTHER
         A selection of topics (from 500)
     DISEASE                 MIND       STORY     FIELD     SCIENCE      BALL         JOB
                 WATER
    BACTERIA                WORLD     STORIES  MAGNETIC      STUDY       GAME        WORK
                   FISH
     DISEASES               DREAM        TELL   MAGNET    SCIENTISTS     TEAM        JOBS
                   SEA
      GERMS                 DREAMS   CHARACTER    WIRE    SCIENTIFIC FOOTBALL       CAREER
                  SWIM
       FEVER   SWIMMING    THOUGHT CHARACTERS    NEEDLE  KNOWLEDGE BASEBALL EXPERIENCE
       CAUSE      POOL   IMAGINATION  AUTHOR    CURRENT      WORK      PLAYERS EMPLOYMENT
     CAUSED                MOMENT       READ       COIL   RESEARCH       PLAY   OPPORTUNITIES
                  LIKE
      SPREAD              THOUGHTS       TOLD     POLES   CHEMISTRY      FIELD     WORKING
                 SHELL
     VIRUSES                  OWN     SETTING      IRON  TECHNOLOGY PLAYER         TRAINING
                 SHARK
    INFECTION                REAL       TALES   COMPASS      MANY    BASKETBALL     SKILLS
                  TANK
       VIRUS                  LIFE       PLOT     LINES  MATHEMATICS COACH         CAREERS
                SHELLS
MICROORGANISMS SHARKS      IMAGINE    TELLING     CORE     BIOLOGY     PLAYED     POSITIONS
      PERSON                 SENSE     SHORT    ELECTRIC     FIELD     PLAYING       FIND
                 DIVING
                                               DIRECTION    PHYSICS       HIT      POSITION
   INFECTIOUS  DOLPHINS CONSCIOUSNESS FICTION
     COMMON                STRANGE     ACTION    FORCE   LABORATORY     TENNIS       FIELD
                 SWAM
     CAUSING                FEELING      TRUE   MAGNETS     STUDIES     TEAMS    OCCUPATIONS
                  LONG
    SMALLPOX                WHOLE      EVENTS       BE      WORLD       GAMES      REQUIRE
                  SEAL
       BODY                  BEING      TELLS  MAGNETISM SCIENTIST      SPORTS   OPPORTUNITY
                  DIVE
   INFECTIONS                MIGHT       TALE      POLE   STUDYING        BAT        EARN
                DOLPHIN
     CERTAIN                 HOPE      NOVEL    INDUCED    SCIENCES     TERRY        ABLE
              UNDERWATER
         A selection of topics (from 500)
     DISEASE                 MIND       STORY     FIELD     SCIENCE      BALL          JOB
                 WATER
    BACTERIA                WORLD     STORIES  MAGNETIC      STUDY       GAME        WORK
                   FISH
     DISEASES               DREAM        TELL   MAGNET    SCIENTISTS     TEAM         JOBS
                   SEA
      GERMS                 DREAMS   CHARACTER    WIRE    SCIENTIFIC FOOTBALL       CAREER
                  SWIM
       FEVER   SWIMMING    THOUGHT CHARACTERS    NEEDLE  KNOWLEDGE BASEBALL EXPERIENCE
       CAUSE      POOL   IMAGINATION  AUTHOR    CURRENT      WORK      PLAYERS EMPLOYMENT
     CAUSED                MOMENT       READ       COIL   RESEARCH       PLAY   OPPORTUNITIES
                  LIKE
      SPREAD              THOUGHTS       TOLD     POLES   CHEMISTRY      FIELD     WORKING
                 SHELL
     VIRUSES                  OWN     SETTING      IRON  TECHNOLOGY PLAYER         TRAINING
                 SHARK
    INFECTION                REAL       TALES   COMPASS      MANY    BASKETBALL     SKILLS
                  TANK
       VIRUS                  LIFE       PLOT     LINES  MATHEMATICS COACH         CAREERS
                SHELLS
MICROORGANISMS SHARKS      IMAGINE    TELLING     CORE     BIOLOGY     PLAYED     POSITIONS
      PERSON                 SENSE     SHORT    ELECTRIC     FIELD     PLAYING        FIND
                 DIVING
                                               DIRECTION    PHYSICS       HIT      POSITION
   INFECTIOUS  DOLPHINS CONSCIOUSNESS FICTION
     COMMON                STRANGE     ACTION    FORCE   LABORATORY     TENNIS       FIELD
                 SWAM
     CAUSING                FEELING      TRUE   MAGNETS     STUDIES     TEAMS    OCCUPATIONS
                  LONG
    SMALLPOX                WHOLE      EVENTS       BE      WORLD       GAMES      REQUIRE
                  SEAL
       BODY                  BEING      TELLS  MAGNETISM SCIENTIST      SPORTS   OPPORTUNITY
                  DIVE
   INFECTIONS                MIGHT       TALE      POLE   STUDYING        BAT        EARN
                DOLPHIN
     CERTAIN                 HOPE      NOVEL    INDUCED    SCIENCES     TERRY        ABLE
              UNDERWATER
Learning topic hiearchies




       (Blei, Griffiths, Jordan, & Tenenbaum, 2004)
  Syntax and semantics from statistics
Factorization of language based on   semantics: probabilistic topics
 statistical dependency patterns:
                                                      q
 long-range, document specific,
         dependencies                      z          z      z


                                           w          w      w

short-range dependencies constant
       across all documents                x          x      x

                                  syntax: probabilistic regular grammar

(Griffiths, Steyvers, Blei, & Tenenbaum, submitted)
                                                            x=2

                                                       OF      0.6
              x=1                         0.8          FOR     0.3
                                                       BETWEEN 0.1
 z = 1 0.4              z = 2 0.6
HEART   0.2         SCIENTIFIC      0.2
LOVE    0.2         KNOWLEDGE       0.2         0.7
SOUL    0.2         WORK            0.2
                                                      0.3         0.1
TEARS   0.2         RESEARCH        0.2
JOY     0.2         MATHEMATICS     0.2
                                                0.2
                                                            x=3

                                                       THE  0.6
                                          0.9          A    0.3
                                                       MANY 0.1
                                                            x=2

                                                       OF      0.6
              x=1                         0.8          FOR     0.3
                                                       BETWEEN 0.1
 z = 1 0.4              z = 2 0.6
HEART   0.2         SCIENTIFIC      0.2
LOVE    0.2         KNOWLEDGE       0.2         0.7
SOUL    0.2         WORK            0.2
                                                      0.3         0.1
TEARS   0.2         RESEARCH        0.2
JOY     0.2         MATHEMATICS     0.2
                                                0.2
                                                            x=3

                                                       THE  0.6
                                          0.9          A    0.3
                                                       MANY 0.1




THE ………………………………
                                                            x=2

                                                       OF      0.6
              x=1                         0.8          FOR     0.3
                                                       BETWEEN 0.1
 z = 1 0.4              z = 2 0.6
HEART   0.2         SCIENTIFIC      0.2
LOVE    0.2         KNOWLEDGE       0.2         0.7
SOUL    0.2         WORK            0.2
                                                      0.3         0.1
TEARS   0.2         RESEARCH        0.2
JOY     0.2         MATHEMATICS     0.2
                                                0.2
                                                            x=3

                                                       THE  0.6
                                          0.9          A    0.3
                                                       MANY 0.1




THE LOVE……………………
                                                            x=2

                                                       OF      0.6
              x=1                         0.8          FOR     0.3
                                                       BETWEEN 0.1
 z = 1 0.4              z = 2 0.6
HEART   0.2         SCIENTIFIC      0.2
LOVE    0.2         KNOWLEDGE       0.2         0.7
SOUL    0.2         WORK            0.2
                                                      0.3         0.1
TEARS   0.2         RESEARCH        0.2
JOY     0.2         MATHEMATICS     0.2
                                                0.2
                                                            x=3

                                                       THE  0.6
                                          0.9          A    0.3
                                                       MANY 0.1




THE LOVE OF………………
                                                            x=2

                                                       OF      0.6
              x=1                         0.8          FOR     0.3
                                                       BETWEEN 0.1
 z = 1 0.4              z = 2 0.6
HEART   0.2         SCIENTIFIC      0.2
LOVE    0.2         KNOWLEDGE       0.2         0.7
SOUL    0.2         WORK            0.2
                                                      0.3         0.1
TEARS   0.2         RESEARCH        0.2
JOY     0.2         MATHEMATICS     0.2
                                                0.2
                                                            x=3

                                                       THE  0.6
                                          0.9          A    0.3
                                                       MANY 0.1




THE LOVE OF RESEARCH ……
                     Semantic categories
      FOOD       MAP        DOCTOR       BOOK      GOLD      BEHAVIOR      CELLS     PLANTS
     FOODS     NORTH         PATIENT     BOOKS     IRON         SELF        CELL      PLANT
      BODY     EARTH         HEALTH    READING    SILVER    INDIVIDUAL ORGANISMS     LEAVES
   NUTRIENTS   SOUTH        HOSPITAL INFORMATION COPPER   PERSONALITY     ALGAE       SEEDS
      DIET      POLE        MEDICAL    LIBRARY    METAL      RESPONSE   BACTERIA       SOIL
       FAT      MAPS          CARE      REPORT   METALS        SOCIAL  MICROSCOPE     ROOTS
     SUGAR    EQUATOR       PATIENTS      PAGE    STEEL     EMOTIONAL   MEMBRANE    FLOWERS
    ENERGY      WEST          NURSE      TITLE     CLAY      LEARNING   ORGANISM      WATER
      MILK      LINES       DOCTORS    SUBJECT     LEAD       FEELINGS     FOOD       FOOD
     EATING     EAST       MEDICINE      PAGES    ADAM   PSYCHOLOGISTS    LIVING      GREEN
     FRUITS  AUSTRALIA      NURSING      GUIDE      ORE    INDIVIDUALS     FUNGI       SEED
  VEGETABLES   GLOBE      TREATMENT     WORDS   ALUMINUM PSYCHOLOGICAL     MOLD       STEMS
     WEIGHT    POLES         NURSES    MATERIAL  MINERAL   EXPERIENCES MATERIALS     FLOWER
      FATS   HEMISPHERE    PHYSICIAN    ARTICLE    MINE   ENVIRONMENT    NUCLEUS       STEM
     NEEDS    LATITUDE     HOSPITALS   ARTICLES   STONE        HUMAN      CELLED       LEAF
CARBOHYDRATES PLACES            DR       WORD   MINERALS     RESPONSES STRUCTURES   ANIMALS
    VITAMINS    LAND           SICK      FACTS      POT      BEHAVIORS  MATERIAL      ROOT
   CALORIES    WORLD       ASSISTANT    AUTHOR   MINING      ATTITUDES STRUCTURE     POLLEN
    PROTEIN   COMPASS     EMERGENCY   REFERENCE  MINERS   PSYCHOLOGY      GREEN     GROWING
   MINERALS  CONTINENTS    PRACTICE       NOTE      TIN        PERSON     MOLDS       GROW
                    Syntactic categories
    SAID     THE       MORE         ON        GOOD        ONE         HE         BE
   ASKED      HIS       SUCH        AT       SMALL      SOME         YOU       MAKE
 THOUGHT    THEIR       LESS       INTO       NEW       MANY        THEY        GET
   TOLD     YOUR       MUCH       FROM    IMPORTANT      TWO           I       HAVE
    SAYS     HER      KNOWN        WITH      GREAT       EACH        SHE         GO
  MEANS       ITS       JUST    THROUGH      LITTLE       ALL        WE        TAKE
  CALLED      MY      BETTER       OVER      LARGE      MOST          IT         DO
   CRIED     OUR      RATHER     AROUND         *        ANY       PEOPLE      FIND
  SHOWS      THIS    GREATER    AGAINST        BIG      THREE     EVERYONE      USE
ANSWERED    THESE     HIGHER     ACROSS       LONG       THIS      OTHERS       SEE
   TELLS       A      LARGER       UPON       HIGH      EVERY    SCIENTISTS    HELP
  REPLIED     AN      LONGER     TOWARD    DIFFERENT   SEVERAL    SOMEONE      KEEP
 SHOUTED     THAT     FASTER      UNDER     SPECIAL      FOUR       WHO        GIVE
EXPLAINED    NEW     EXACTLY      ALONG        OLD       FIVE      NOBODY      LOOK
 LAUGHED    THOSE    SMALLER       NEAR     STRONG       BOTH        ONE       COME
  MEANT     EACH    SOMETHING    BEHIND      YOUNG        TEN    SOMETHING     WORK
  WROTE       MR      BIGGER        OFF     COMMON        SIX      ANYONE      MOVE
 SHOWED      ANY       FEWER      ABOVE      WHITE      MUCH     EVERYBODY      LIVE
 BELIEVED    MRS      LOWER       DOWN       SINGLE    TWENTY       SOME        EAT
WHISPERED     ALL    ALMOST      BEFORE     CERTAIN     EIGHT       THEN      BECOME
  Statistical language modeling
• Generative models provide
  – transparent assumptions about causal process
  – opportunities to combine and extend models
• Richer generative models...
  – probabilistic context-free grammars
  – paragraph or sentence-level dependencies
  – more complex semantics
       Structure and statistics
• Statistical language modeling
  – topic models


• Relational categorization
  – attributes and relations
     Relational categorization
• Most approaches to categorization in
  psychology and machine learning focus on
  attributes - properties of objects
  – words in titles of CogSci posters
• But… a significant portion of knowledge is
  organized in terms of relations
  – co-authors on posters
  – who talks to whom
                      (Kemp, Griffiths, & Tenenbaum, 2004)
                   Attributes and relations
             Data                        Model
             objects
attributes




                           P(X) = ik z P(xik|zi) i P(zi)
               X
                            mixture model (c.f. Anderson, 1990)

             objects
  objects




               Y           P(Y) = ij z P(yij|zi) i P(zi)
                                   stochastic blockmodel
          Stochastic blockmodels
• For any pair of objects, (i,j), probability of
  relation is determined by classes, (zi, zj)
           To type j
         l11   l12     l13        Each entity has a type = Z
From
type i   l21   l22     l23   L
         l31   l32     l33
                              P(Z,L|Y)  P(Y|Z,LP(Z)P(L

• Allows types of objects and class
  probabilities to be learned from data
    Stochastic blockmodels

                        A       B
    A       B

        C               D       C


    A       B   C       A B C       D
                    A
A                   B
B                   C
C                   D
         Categorizing words
• Relational data: word association norms
                  (Nelson, McEvoy, & Schreiber, 1998)


• 5018 x 5018 matrix of associations
  – symmetrized
  – all words with < 50 and > 10 associates
  – 2513 nodes, 34716 links
       Categorizing words

    BAND        TIE       SEW       WASH
INSTRUMENT     COAT    MATERIAL     LIQUID
    BLOW      SHOES      WOOL     BATHROOM
    HORN       ROPE      YARN        SINK
   FLUTE     LEATHER     WEAR      CLEANER
   BRASS       SHOE       TEAR      STAIN
   GUITAR       HAT       FRAY      DRAIN
   PIANO      PANTS      JEANS      DISHES
    TUBA     WEDDING    COTTON       TUB
  TRUMPET     STRING    CARPET      SCRUB
          Categorizing actors
• Internet Movie Database (IMDB) data, from
  the start of cinema to 1960 (Jeremy Kubica)
• Relational data: collaboration
• 5000 x 5000 matrix of most prolific actors
  – all actors with < 400 and > 1 collaborators
  – 2275 nodes, 204761 links
                   Categorizing actors

   Albert Lieven     Moore Marriott        Gino Cervi        Archie Ricks
  Karel Stepanek    Laurence Hanray        Nadia Gray        Helen Gibson
    Walter Rilla    Gus McNaughton        Enrico Glori       Oscar Gahan
  Anton Walbrook     Gordon Harker        Paolo Stoppa       Buck Moulton
                      Helen Haye         Bernardi Nerio      Buck Connors
                     Alfred Goddard     Amedeo Nazzari      Clyde McClary
                    Morland Graham     Gina Lollobrigida    Barney Beasley
                   Margaret Lockwood      Aldo Silvani       Buck Morgan
                      Hal Gordon       Franco Interlenghi     Tex Phelps
                   Bromley Davenport     Guido Celano       George Sowards


Germany  UK        British comedy          Italian          US Westerns
       Structure and statistics
• Bayesian approach allows us to specify
  structured probabilistic models
• Explore novel representations and domains
  – topics for semantic representation
  – relational categorization
• Use powerful methods for inference,
  developed in statistics and machine learning
        Other methods and tools...
• Inference algorithms
   –   belief propagation
   –   dynamic programming
   –   the EM algorithm and variational methods
   –   Markov chain Monte Carlo
• More complex models
   – Dirichlet processes and Bayesian non-parametrics
   – Gaussian processes and kernel methods

Reading list at http://www.bayesiancognition.com
Taking stock
 Bayesian models of inductive learning
  • Inductive leaps can be explained with
    hierarchical Theory-based Bayesian models:


                  Domain Theory
Probabilistic
                                        Bayesian
Generative      Structural Hypotheses
Model                                   inference

                       Data
Bayesian models of inductive learning
• Inductive leaps can be explained with
  hierarchical Theory-based Bayesian models:


                   T

        S          S          S    ...
     D D D      D D D       D D   D ...
Bayesian models of inductive learning
• Inductive leaps can be explained with
  hierarchical Theory-based Bayesian models.
• What the approach offers:
  – Strong quantitative models of generalization
    behavior.
  – Flexibility to model different patterns of reasoning
    that in different tasks and domains, using
    differently structured theories, but the same
    general-purpose Bayesian engine.
  – Framework for explaining why inductive
    generalization works, where knowledge comes
    from as well as how it is used.
Bayesian models of inductive learning
• Inductive leaps can be explained with
  hierarchical Theory-based Bayesian models.
• Challenges:
  – Theories are hard.
Bayesian models of inductive learning
• Inductive leaps can be explained with
  hierarchical Theory-based Bayesian models:
• The interaction between structure and
  statistics is crucial.
  – How structured knowledge supports statistical
    learning, by constraining hypothesis spaces.
  – How statistics supports reasoning with and
    learning structured knowledge.
  – How complex structures can grow from data,
    rather than being fully specified in advance.

				
DOCUMENT INFO