Document Sample

Bayesian models of inductive learning Josh Tenenbaum & Tom Griffiths MIT Computational Cognitive Science Group Department of Brain and Cognitive Sciences Computer Science and AI Lab (CSAIL) What to expect • What you’ll get out of this tutorial: – Our view of what Bayesian models have to offer cognitive science. – In-depth examples of basic and advanced models: how the math works & what it buys you. – Some comparison to other approaches. – Opportunities to ask questions. • What you won’t get: – Detailed, hands-on how-to. – Where you can learn more: http://bayesiancognition.com Outline • Morning – Introduction (Josh) – Basic case study #1: Flipping coins (Tom) – Basic case study #2: Rules and similarity (Josh) • Afternoon – Advanced case study #1: Causal induction (Tom) – Advanced case study #2: Property induction (Josh) – Quick tour of more advanced topics (Tom) Outline • Morning – Introduction (Josh) – Basic case study #1: Flipping coins (Tom) – Basic case study #2: Rules and similarity (Josh) • Afternoon – Advanced case study #1: Causal induction (Tom) – Advanced case study #2: Property induction (Josh) – Quick tour of more advanced topics (Tom) Bayesian models in cognitive science • Vision • Motor control • Memory • Language • Inductive learning and reasoning…. Everyday inductive leaps • Learning concepts and words from examples “horse” “horse” “horse” Learning concepts and words “tufa” “tufa” “tufa” Can you pick out the tufas? Inductive reasoning Input: Cows can get Hick’s disease. (premises) Gorillas can get Hick’s disease. All mammals can get Hick’s disease. (conclusion) Task: Judge how likely conclusion is to be true, given that premises are true. Inferring causal relations Input: Took vitamin B23 Headache Day 1 yes no Day 2 yes yes Day 3 no yes Day 4 yes no ... ... ... Does vitamin B23 cause headaches? Task: Judge probability of a causal link given several joint observations. Everyday inductive leaps How can we learn so much about . . . – Properties of natural kinds – Meanings of words – Future outcomes of a dynamic process – Hidden causal properties of an object – Causes of a person’s action (beliefs, goals) – Causal laws governing a domain . . . from such limited data? The Challenge • How do we generalize successfully from very limited data? – Just one or a few examples – Often only positive examples • Philosophy: – Induction is a “problem”, a “riddle”, a “paradox”, a “scandal”, or a “myth”. • Machine learning and statistics: – Focus on generalization from many examples, both positive and negative. Rational statistical inference (Bayes, Laplace) Posterior Likelihood Prior probability probability p ( d | h) p ( h) p(h | d ) p(d | h) p(h) hH Sum over space of hypotheses Bayesian models of inductive learning: some recent history • Shepard (1987) – Analysis of one-shot stimulus generalization, to explain the universal exponential law. • Anderson (1990) – Models of categorization and causal induction. • Oaksford & Chater (1994) – Model of conditional reasoning (Wason selection task). • Heit (1998) – Framework for category-based inductive reasoning. Theory-Based Bayesian Models • Rational statistical inference (Bayes): p ( d | h) p ( h) p(h | d ) p(d | h) p(h) hH • Learners’ domain theories generate their hypothesis space H and prior p(h). – Well-matched to structure of the natural world. – Learnable from limited data. – Computationally tractable inference. What is a theory? • Working definition – An ontology and a system of abstract principles that generates a hypothesis space of candidate world structures along with their relative probabilities. • Analogy to grammar in language. • Example: Newton’s laws Structure and statistics • A framework for understanding how structured knowledge and statistical inference interact. – How structured knowledge guides statistical inference, and is itself acquired through higher-order statistical learning. – How simplicity trades off with fit to the data in evaluating structural hypotheses. – How increasingly complex structures may grow as required by new data, rather than being pre-specified in advance. Structure and statistics • A framework for understanding how structured knowledge and statistical inference interact. – How structured knowledge guides statistical inference, and is itself acquired through higher-order statistical learning. Hierarchical Bayes. – How simplicity trades off with fit to the data in evaluating structural hypotheses. Bayesian Occam’s Razor. – How increasingly complex structures may grow as required by new data, rather than being pre-specified in advance. Non-parametric Bayes. Alternative approaches to inductive generalization • Associative learning • Connectionist networks • Similarity to examples • Toolkit of simple heuristics • Constraint satisfaction • Analogical mapping Marr’s Three Levels of Analysis • Computation: “What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out?” • Representation and algorithm: Cognitive psychology • Implementation: Neurobiology Why Bayes? • A framework for explaining cognition. – How people can learn so much from such limited data. – Why process-level models work the way that they do. – Strong quantitative models with minimal ad hoc assumptions. • A framework for understanding how structured knowledge and statistical inference interact. – How structured knowledge guides statistical inference, and is itself acquired through higher-order statistical learning. – How simplicity trades off with fit to the data in evaluating structural hypotheses (Occam’s razor). – How increasingly complex structures may grow as required by new data, rather than being pre-specified in advance. Outline • Morning – Introduction (Josh) – Basic case study #1: Flipping coins (Tom) – Basic case study #2: Rules and similarity (Josh) • Afternoon – Advanced case study #1: Causal induction (Tom) – Advanced case study #2: Property induction (Josh) – Quick tour of more advanced topics (Tom) Coin flipping Coin flipping HHTHT HHHHH What process produced these sequences? Bayes’ rule For data D and a hypothesis H, we have: P( H ) P( D | H ) P( H | D) P( D) • “Posterior probability”: P( H | D) • “Prior probability”: P(H ) • “Likelihood”: P( D | H ) The origin of Bayes’ rule • A simple consequence of using probability to represent degrees of belief • For any two random variables: p ( A & B) p ( A) p ( B | A) p( A & B) p( B) p( A | B) p( B) p( A | B) p( A) p( B | A) p( A) p( B | A) p( A | B) p( B) Why represent degrees of belief with probabilities? • Good statistics – consistency, and worst-case error bounds. • Cox Axioms – necessary to cohere with common sense • “Dutch Book” + Survival of the Fittest – if your beliefs do not accord with the laws of probability, then you can always be out-gambled by someone whose beliefs do so accord. • Provides a theory of learning – a common currency for combining prior knowledge and the lessons of experience. Bayes’ rule For data D and a hypothesis H, we have: P( H ) P( D | H ) P( H | D) P( D) • “Posterior probability”: P( H | D) • “Prior probability”: P(H ) • “Likelihood”: P( D | H ) Hypotheses in Bayesian inference • Hypotheses H refer to processes that could have generated the data D • Bayesian inference provides a distribution over these hypotheses, given D • P(D|H) is the probability of D being generated by the process identified by H • Hypotheses H are mutually exclusive: only one process could have generated D Hypotheses in coin flipping Describe processes by which D could be generated D = HHTHT • Fair coin, P(H) = 0.5 • Coin with P(H) = p statistical models • Markov model • Hidden Markov model • ... Hypotheses in coin flipping Describe processes by which D could be generated D = HHTHT • Fair coin, P(H) = 0.5 • Coin with P(H) = p generative models • Markov model • Hidden Markov model • ... Representing generative models • Graphical model notation – Pearl (1988), Jordan (1998) d1 d2 d3 d4 • Variables are nodes, edges Fair coin, P(H) = 0.5 indicate dependency • Directed edges show causal d1 d2 d3 d4 process of data generation Markov model HHTHT d1 d2 d3 d4 d5 Models with latent structure p • Not all nodes in a graphical model need to be observed • Some variables reflect latent d1 d2 d3 d4 structure, used in generating P(H) = p D but unobserved s1 s2 s3 s4 HHTHT d1 d2 d3 d4 d1 d2 d3 d4 d5 Hidden Markov model Coin flipping • Comparing two simple hypotheses – P(H) = 0.5 vs. P(H) = 1.0 • Comparing simple and complex hypotheses – P(H) = 0.5 vs. P(H) = p • Comparing infinitely many hypotheses – P(H) = p • Psychology: Representativeness Coin flipping • Comparing two simple hypotheses – P(H) = 0.5 vs. P(H) = 1.0 • Comparing simple and complex hypotheses – P(H) = 0.5 vs. P(H) = p • Comparing infinitely many hypotheses – P(H) = p • Psychology: Representativeness Comparing two simple hypotheses • Contrast simple hypotheses: – H1: “fair coin”, P(H) = 0.5 – H2:“always heads”, P(H) = 1.0 • Bayes’ rule: P( H ) P( D | H ) P( H | D) P( D) • With two hypotheses, use odds form Bayes’ rule in odds form P(H1|D) P(D|H1) P(H1) = x P(H2|D) P(D|H2) P(H2) D: data H1, H2: models P(H1|D): posterior probability H1 generated the data P(D|H1): likelihood of data under model H1 P(H1): prior probability H1 generated the data Coin flipping HHTHT HHHHH What process produced these sequences? Comparing two simple hypotheses P(H1|D) P(D|H1) P(H1) = x P(H2|D) P(D|H2) P(H2) D: HHTHT H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 0 P(H2) = 1/1000 P(H1|D) / P(H2|D) = infinity Comparing two simple hypotheses P(H1|D) P(D|H1) P(H1) = x P(H2|D) P(D|H2) P(H2) D: HHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/25 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000 P(H1|D) / P(H2|D) 30 Comparing two simple hypotheses P(H1|D) P(D|H1) P(H1) = x P(H2|D) P(D|H2) P(H2) D: HHHHHHHHHH H1, H2: “fair coin”, “always heads” P(D|H1) = 1/210 P(H1) = 999/1000 P(D|H2) = 1 P(H2) = 1/1000 P(H1|D) / P(H2|D) 1 Comparing two simple hypotheses • Bayes’ rule tells us how to combine prior beliefs with new data – top-down and bottom-up influences • As a model of human inference – predicts conclusions drawn from data – identifies point at which prior beliefs are overwhelmed by new experiences • But… more complex cases? Coin flipping • Comparing two simple hypotheses – P(H) = 0.5 vs. P(H) = 1.0 • Comparing simple and complex hypotheses – P(H) = 0.5 vs. P(H) = p • Comparing infinitely many hypotheses – P(H) = p • Psychology: Representativeness Comparing simple and complex hypotheses p d1 d2 d3 d4 vs. d1 d2 d3 d4 Fair coin, P(H) = 0.5 P(H) = p • Which provides a better account of the data: the simple hypothesis of a fair coin, or the complex hypothesis that P(H) = p? Comparing simple and complex hypotheses • P(H) = p is more complex than P(H) = 0.5 in two ways: – P(H) = 0.5 is a special case of P(H) = p – for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5 Comparing simple and complex hypotheses Probability Comparing simple and complex hypotheses Probability HHHHH p = 1.0 Comparing simple and complex hypotheses Probability HHTHT p = 0.6 Comparing simple and complex hypotheses • P(H) = p is more complex than P(H) = 0.5 in two ways: – P(H) = 0.5 is a special case of P(H) = p – for any observed sequence X, we can choose p such that X is more probable than if P(H) = 0.5 • How can we deal with this? – frequentist: hypothesis testing – information theorist: minimum description length – Bayesian: just use probability theory! Comparing simple and complex hypotheses P(H1|D) P(D|H1) P(H1) = x P(H2|D) P(D|H2) P(H2) Computing P(D|H1) is easy: P(D|H1) = 1/2N Compute P(D|H2) by averaging over p: Comparing simple and complex hypotheses Probability Distribution is an average over all values of p Comparing simple and complex hypotheses Probability Distribution is an average over all values of p Comparing simple and complex hypotheses • Simple and complex hypotheses can be compared directly using Bayes’ rule – requires summing over latent variables • Complex hypotheses are penalized for their greater flexibility: “Bayesian Occam’s razor” • This principle is used in model selection methods in psychology (e.g. Myung & Pitt, 1997) Coin flipping • Comparing two simple hypotheses – P(H) = 0.5 vs. P(H) = 1.0 • Comparing simple and complex hypotheses – P(H) = 0.5 vs. P(H) = p • Comparing infinitely many hypotheses – P(H) = p • Psychology: Representativeness Comparing infinitely many hypotheses • Assume data are generated from a model: p d1 d2 d3 d4 P(H) = p • What is the value of p? – each value of p is a hypothesis H – requires inference over infinitely many hypotheses Comparing infinitely many hypotheses • Flip a coin 10 times and see 5 heads, 5 tails. • P(H) on next flip? 50% • Why? 50% = 5 / (5+5) = 5/10. • “Future will be like the past.” • Suppose we had seen 4 heads and 6 tails. • P(H) on next flip? Closer to 50% than to 40%. • Why? Prior knowledge. Integrating prior knowledge and data P( H ) P( D | H ) P( H | D) P( D) P(p | D) P(D | p) P(p) • Posterior distribution P(p | D) is a probability density over p = P(H) • Need to work out likelihood P(D | p) and specify prior distribution P(p) Likelihood and prior • Likelihood: P(D | p) = pNH (1-p)NT – NH: number of heads – NT: number of tails • Prior: P(p) pFH-1 (1-p)FT-1 ? A simple method of specifying priors • Imagine some fictitious trials, reflecting a set of previous experiences – strategy often used with neural networks • e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair • In fact, this is a sensible statistical idea... Likelihood and prior • Likelihood: P(D | p) = pNH (1-p)NT – NH: number of heads – NT: number of tails • Prior: P(p) pFH-1 (1-p)FT-1 Beta(FH,FT) – FH: fictitious observations of heads – FT: fictitious observations of tails Conjugate priors • Exist for many standard distributions – formula for exponential family conjugacy • Define prior in terms of fictitious observations • Beta is conjugate to Bernoulli (coin-flipping) FH = FT = 1 FH = FT = 3 FH = FT = 1000 Likelihood and prior • Likelihood: P(D | p) = pNH (1-p)NT – NH: number of heads – NT: number of tails • Prior: P(p) pFH-1 (1-p)FT-1 – FH: fictitious observations of heads – FT: fictitious observations of tails Comparing infinitely many hypotheses P(p | D) P(D | p) P(p) = pNH+FH-1 (1-p)NT+FT-1 • Posterior is Beta(NH+FH,NT+FT) – same form as conjugate prior • Posterior mean: • Posterior predictive distribution: Some examples • e.g., F ={1000 heads, 1000 tails} ~ strong expectation that any new coin will be fair • After seeing 4 heads, 6 tails, P(H) on next flip = 1004 / (1004+1006) = 49.95% • e.g., F ={3 heads, 3 tails} ~ weak expectation that any new coin will be fair • After seeing 4 heads, 6 tails, P(H) on next flip = 7 / (7+9) = 43.75% Prior knowledge too weak But… flipping thumbtacks • e.g., F ={4 heads, 3 tails} ~ weak expectation that tacks are slightly biased towards heads • After seeing 2 heads, 0 tails, P(H) on next flip = 6 / (6+3) = 67% • Some prior knowledge is always necessary to avoid jumping to hasty conclusions... • Suppose F = { }: After seeing 2 heads, 0 tails, P(H) on next flip = 2 / (2+0) = 100% Origin of prior knowledge • Tempting answer: prior experience • Suppose you have previously seen 2000 coin flips: 1000 heads, 1000 tails • By assuming all coins (and flips) are alike, these observations of other coins are as good as observations of the present coin Problems with simple empiricism • Haven’t really seen 2000 coin flips, or any flips of a thumbtack – Prior knowledge is stronger than raw experience justifies • Haven’t seen exactly equal number of heads and tails – Prior knowledge is smoother than raw experience justifies • Should be a difference between observing 2000 flips of a single coin versus observing 10 flips each for 200 coins, or 1 flip each for 2000 coins – Prior knowledge is more structured than raw experience A simple theory • “Coins are manufactured by a standardized procedure that is effective but not perfect.” – Justifies generalizing from previous coins to the present coin. – Justifies smoother and stronger prior than raw experience alone. – Explains why seeing 10 flips each for 200 coins is more valuable than seeing 2000 flips of one coin. • “Tacks are asymmetric, and manufactured to less exacting standards.” Limitations • Can all domain knowledge be represented so simply, in terms of an equivalent number of fictional observations? • Suppose you flip a coin 25 times and get all heads. Something funny is going on… • But with F ={1000 heads, 1000 tails}, P(H) on next flip = 1025 / (1025+1000) = 50.6%. Looks like nothing unusual Hierarchical priors • Higher-order hypothesis: is this coin fair or unfair? fair • Example probabilities: – P(fair) = 0.99 p – P(p|fair) is Beta(1000,1000) – P(p|unfair) is Beta(1,1) d1 d2 d3 d4 • 25 heads in a row propagates up, affecting p and then P(fair|D) P(fair|25 heads) P(25 heads|fair) P(fair) = 9 x 10-5 = P(unfair|25 heads) P(25 heads|unfair) P(unfair) More hierarchical priors • Latent structure can capture coin variability p ~ Beta(FH,FT) FH,FT Coin 1 p Coin 2 p ... p Coin 200 d1 d2 d3 d4 d1 d2 d3 d4 d1 d2 d3 d4 • 10 flips from 200 coins is better than 2000 flips from a single coin: allows estimation of FH, FT Yet more hierarchical priors physical knowledge FH,FT p p p d1 d2 d3 d4 d1 d2 d3 d4 d1 d2 d3 d4 • Discrete beliefs (e.g. symmetry) can influence estimation of continuous properties (e.g. FH, FT) Comparing infinitely many hypotheses • Apply Bayes’ rule to obtain posterior probability density • Requires prior over all hypotheses – computation simplified by conjugate priors – richer structure with hierarchical priors • Hierarchical priors indicate how simple theories can inform statistical inferences – one step towards structure and statistics Coin flipping • Comparing two simple hypotheses – P(H) = 0.5 vs. P(H) = 1.0 • Comparing simple and complex hypotheses – P(H) = 0.5 vs. P(H) = p • Comparing infinitely many hypotheses – P(H) = p • Psychology: Representativeness Psychology: Representativeness Which sequence is more likely from a fair coin? HHTHT more representative of a fair coin (Kahneman & Tversky, 1972) HHHHH What might representativeness mean? Evidence for a random generating process P(H1|D) P(D|H1) P(H1) = x P(H2|D) P(D|H2) P(H2) likelihood ratio H1: random process (fair coin) H2: alternative processes A constrained hypothesis space Four hypotheses: h1 fair coin HHTHTTTH h2 “always alternates” HTHTHTHT h3 “mostly heads” HHTHTHHH h4 “always heads” HHHHHHHH Representativeness judgments Results • Good account of representativeness data, with three pseudo-free parameters, = 0.91 – “always alternates” means 99% of the time – “mostly heads” means P(H) = 0.85 – “always heads” means P(H) = 0.99 • With scaling parameter, r = 0.95 (Tenenbaum & Griffiths, 2001) The role of theories The fact that HHTHT looks representative of a fair coin and HHHHH does not reflects our implicit theories of how the world works. – Easy to imagine how a trick all-heads coin could work: high prior probability. – Hard to imagine how a trick “HHTHT” coin could work: low prior probability. Summary • Three kinds of Bayesian inference – comparing two simple hypotheses – comparing simple and complex hypotheses – comparing an infinite number of hypotheses • Critical notions: – generative models, graphical models – Bayesian Occam’s razor – priors: conjugate, hierarchical (theories) Outline • Morning – Introduction (Josh) – Basic case study #1: Flipping coins (Tom) – Basic case study #2: Rules and similarity (Josh) • Afternoon – Advanced case study #1: Causal induction (Tom) – Advanced case study #2: Property induction (Josh) – Quick tour of more advanced topics (Tom) Rules and similarity Structure versus statistics Rules Statistics Logic Similarity Symbols Typicality A better metaphor A better metaphor Structure and statistics Statistics Similarity Typicality Rules Logic Symbols Structure and statistics • Basic case study #1: Flipping coins – Learning and reasoning with structured statistical models. • Basic case study #2: Rules and similarity – Statistical learning with structured representations. The number game • Program input: number between 1 and 100 • Program output: “yes” or “no” The number game • Learning task: – Observe one or more positive (“yes”) examples. – Judge whether other numbers are “yes” or “no”. The number game Examples of Generalization “yes” numbers judgments (N = 20) 60 Diffuse similarity The number game Examples of Generalization “yes” numbers judgments (n = 20) 60 Diffuse similarity 60 80 10 30 Rule: “multiples of 10” The number game Examples of Generalization “yes” numbers judgments (N = 20) 60 Diffuse similarity 60 80 10 30 Rule: “multiples of 10” 60 52 57 55 Focused similarity: numbers near 50-60 The number game Examples of Generalization “yes” numbers judgments (N = 20) 16 Diffuse similarity 16 8 2 64 Rule: “powers of 2” 16 23 19 20 Focused similarity: numbers near 20 The number game 60 Diffuse similarity 60 80 10 30 Rule: “multiples of 10” 60 52 57 55 Focused similarity: numbers near 50-60 Main phenomena to explain: – Generalization can appear either similarity- based (graded) or rule-based (all-or-none). – Learning from just a few positive examples. Rule/similarity hybrid models • Category learning – Nosofsky, Palmeri et al.: RULEX – Erickson & Kruschke: ATRIUM Divisions into “rule” and “similarity” subsystems • Category learning – Nosofsky, Palmeri et al.: RULEX – Erickson & Kruschke: ATRIUM • Language processing – Pinker, Marcus et al.: Past tense morphology • Reasoning – Sloman – Rips – Nisbett, Smith et al. Rule/similarity hybrid models • Why two modules? • Why do these modules work the way that they do, and interact as they do? • How do people infer a rule or similarity metric from just a few positive examples? Bayesian model • H: Hypothesis space of possible concepts: – h1 = {2, 4, 6, 8, 10, 12, …, 96, 98, 100} (“even numbers”) – h2 = {10, 20, 30, 40, …, 90, 100} (“multiples of 10”) – h3 = {2, 4, 8, 16, 32, 64} (“powers of 2”) – h4 = {50, 51, 52, …, 59, 60} (“numbers between 50 and 60”) – ... Representational interpretations for H: – Candidate rules – Features for similarity – “Consequential subsets” (Shepard, 1987) Inferring hypotheses from similarity judgment Additive clustering (Shepard & Arabie, 1977): sij wk fik f jk k sij : similarity of stimuli i, j wk : weight of cluster k f ik : membership of stimulus i in cluster k (1 if stimulus i in cluster k, 0 otherwise) Equivalent to similarity as a weighted sum of common features (Tversky, 1977). Additive clustering for the integers 0-9: sij wk fik f jk k Rank Weight Stimuli in cluster Interpretation 0 1 2 3 4 5 6 7 8 9 1 .444 * * * powers of two 2 .345 * * * small numbers 3 .331 * * * multiples of three 4 .291 * * * * large numbers 5 .255 * * * * * middle numbers 6 .216 * * * * * odd numbers 7 .214 * * * * smallish numbers 8 .172 * * * * * largish numbers Three hypothesis subspaces for number concepts • Mathematical properties (24 hypotheses): – Odd, even, square, cube, prime numbers – Multiples of small integers – Powers of small integers • Raw magnitude (5050 hypotheses): – All intervals of integers with endpoints between 1 and 100. • Approximate magnitude (10 hypotheses): – Decades (1-10, 10-20, 20-30, …) Hypothesis spaces and theories • Why a hypothesis space is like a domain theory: – Represents one particular way of classifying entities in a domain. – Not just an arbitrary collection of hypotheses, but a principled system. • What’s missing? – Explicit representation of the principles. • Hypothesis spaces (and priors) are generated by theories. Some analogies: – Grammars generate languages (and priors over structural descriptions) – Hierarchical Bayesian modeling Bayesian model • H: Hypothesis space of possible concepts: – Mathematical properties: even, odd, square, prime, . . . . – Approximate magnitude: {1-10}, {10-20}, {20-30}, . . . . – Raw magnitude: all intervals between 1 and 100. • X = {x1, . . . , xn}: n examples of a concept C. • Evaluate hypotheses given data: p ( X | h) p ( h) p(h | X ) p( X ) – p(h) [“prior”]: domain knowledge, pre-existing biases – p(X|h) [“likelihood”]: statistical information in examples. – p(h|X) [“posterior”]: degree of belief that h is the true extension of C. Bayesian model • H: Hypothesis space of possible concepts: – Mathematical properties: even, odd, square, prime, . . . . – Approximate magnitude: {1-10}, {10-20}, {20-30}, . . . . – Raw magnitude: all intervals between 1 and 100. • X = {x1, . . . , xn}: n examples of a concept C. • Evaluate hypotheses given data: p ( X | h) p ( h) p(h | X ) p( X | h) p(h) hH – p(h) [“prior”]: domain knowledge, pre-existing biases – p(X|h) [“likelihood”]: statistical information in examples. – p(h|X) [“posterior”]: degree of belief that h is the true extension of C. Likelihood: p(X|h) • Size principle: Smaller hypotheses receive greater likelihood, and exponentially more so as n increases. n 1 p ( X | h) if x1 , , xn h size(h) 0 if any xi h • Follows from assumption of randomly sampled examples. • Captures the intuition of a representative sample. Illustrating the size principle h1 2 4 6 8 10 h2 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 Illustrating the size principle h1 2 4 6 8 10 h2 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 Data slightly more of a coincidence under h1 Illustrating the size principle h1 2 4 6 8 10 h2 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 Data much more of a coincidence under h1 Bayesian Occam’s Razor Law of M1 “Conservation of Belief” p(D = d | M ) M2 All possible data sets d For any model M, p(D d | M ) 1 all d D Comparing simple and complex hypotheses Probability Distribution is an average over all values of p Prior: p(h) • Choice of hypothesis space embodies a strong prior: effectively, p(h) ~ 0 for many logically possible but conceptually unnatural hypotheses. • Prevents overfitting by highly specific but unnatural hypotheses, e.g. “multiples of 10 except 50 and 70”. Prior: p(h) • Choice of hypothesis space embodies a strong prior: effectively, p(h) ~ 0 for many logically possible but conceptually unnatural hypotheses. • Prevents overfitting by highly specific but unnatural hypotheses, e.g. “multiples of 10 except 50 and 70”. • p(h) encodes relative weights of alternative theories: H: Total hypothesis space p(H1) = 1/5 p(H3) = 1/5 p(H2) = 3/5 H1: Math properties (24) H2: Raw magnitude (5050) H3: Approx. magnitude (10) • even numbers • 10-15 • 10-20 • powers of two • 20-32 • 20-30 • multiples of three • 37-54 • 30-40 …. p(h) = p(H1) / 24 …. p(h) = p(H2) / 5050 …. p(h) = p(H3) / 10 A more complex approach to priors • Start with a base set of regularities R and combination operators C. • Hypothesis space = closure of R under C. – C = {and, or}: H = unions and intersections of regularities in R (e.g., “multiples of 10 between 30 and 70”). – C = {and-not}: H = regularities in R with exceptions (e.g., “multiples of 10 except 50 and 70”). • Two qualitatively similar priors: – Description length: number of combinations in C needed to generate hypothesis from R. – Bayesian Occam’s Razor, with model classes defined by number of combinations: more combinations more hypotheses lower prior p ( X | h) p ( h) Posterior: p(h | X ) p( X | h) p(h) hH • X = {60, 80, 10, 30} • Why prefer “multiples of 10” over “even numbers”? p(X|h). • Why prefer “multiples of 10” over “multiples of 10 except 50 and 20”? p(h). • Why does a good generalization need both high prior and high likelihood? p(h|X) ~ p(X|h) p(h) Bayesian Occam’s Razor Probabilities provide a common currency for balancing model complexity with fit to the data. Generalizing to new objects Given p(h|X), how do we compute p( y C | X ) , p( y the probability that C applies to some new hH stimulus y? Generalizing to new objects Hypothesis averaging: Compute the probability that C applies to some new object y by averaging the predictions of all hypotheses h, weighted by p(h|X): p( y C | X ) p(h) p(h | X ) y C | hH 1 if yh 0 if yh p(h | X ) h { y , X } Examples: 16 Connection to feature-based similarity • Additive clustering model of similarity: sij wk fik f jk k • Bayesian hypothesis averaging: p( y C | X ) y C ( h ) |X (hp|(X ) p( p | X| p ) h ) h h{ y , X } hH • Equivalent if we identify features fk with hypotheses h, and weights wk with p(h | X ) . Examples: 16 8 2 64 Examples: 16 23 19 20 Model fits Examples of Generalization Bayesian Model “yes” numbers judgments (N = 20) (r = 0.96) 60 60 80 10 30 60 52 57 55 Model fits Examples of Generalization Bayesian Model “yes” numbers judgments (N = 20) (r = 0.93) 16 16 8 2 64 16 23 19 20 Summary of the Bayesian model • How do the statistics of the examples interact with prior knowledge to guide generalization? posterior likelihood prior • Why does generalization appear rule-based or similarity-based? hypothesisaveraging size principle broad p(h|X): similarity gradient narrow p(h|X): all-or-none rule Summary of the Bayesian model • How do the statistics of the examples interact with prior knowledge to guide generalization? posterior likelihood prior • Why does generalization appear rule-based or similarity-based? hypothesisaveraging size principle Many h of similar size: broad p(h|X) One h much smaller: narrow p(h|X) Alternative models • Neural networks even multiple multiple power of 10 of 3 of 2 60 80 10 30 Alternative models • Neural networks • Hypothesis ranking and elimination Hypothesis ranking: 1 2 3 4 …. even multiple multiple power …. of 10 of 3 of 2 60 80 10 30 Alternative models • Neural networks • Hypothesis ranking and elimination • Similarity to exemplars 1 – Average similarity: p( y C | X ) sim( y, x j ) | X | x X j 60 60 80 10 30 60 52 57 55 Data Model (r = 0.80) Alternative models • Neural networks • Hypothesis ranking and elimination • Similarity to exemplars – Max similarity: p( y C | X ) max sim( y, x j ) x j X 60 60 80 10 30 60 52 57 55 Data Model (r = 0.64) Alternative models • Neural networks • Hypothesis ranking and elimination • Similarity to exemplars – Average similarity – Max similarity – Flexible similarity? Bayes. Alternative models • Neural networks • Hypothesis ranking and elimination • Similarity to exemplars • Toolbox of simple heuristics – 60: “general” similarity – 60 80 10 30: most specific rule (“subset principle”). – 60 52 57 55: similarity in magnitude Why these heuristics? When to use which heuristic? Bayes. Summary • Generalization from limited data possible via the interaction of structured knowledge and statistics. – Structured knowledge: space of candidate rules, theories generate hypothesis space (c.f. hierarchical priors) – Statistics: Bayesian Occam’s razor. • Better understand the interactions between traditionally opposing concepts: – Rules and statistics – Rules and representativeness – Rules and similarity • Explains why central but notoriously slippery processing-level concepts work the way they do. – Similarity – Representativeness Why Bayes? • A framework for explaining cognition. – How people can learn so much from such limited data. – Why process-level models work the way that they do. – Strong quantitative models with minimal ad hoc assumptions. • A framework for understanding how structured knowledge and statistical inference interact. – How structured knowledge guides statistical inference, and is itself acquired through higher-order statistical learning. – How simplicity trades off with fit to the data in evaluating structural hypotheses (Occam’s razor). – How increasingly complex structures may grow as required by new data, rather than being pre-specified in advance. Theory-Based Bayesian Models • Rational statistical inference (Bayes): p ( d | h) p ( h) p(h | d ) p(d | h) p(h) hH • Learners’ domain theories generate their hypothesis space H and prior p(h). – Well-matched to structure of the natural world. – Learnable from limited data. – Computationally tractable inference. Looking towards the afternoon • How do we apply these ideas to more natural and complex aspects of cognition? • Where do the hypothesis spaces come from? • Can we formalize the contributions of domain theories? Outline • Morning – Introduction (Josh) – Basic case study #1: Flipping coins (Tom) – Basic case study #2: Rules and similarity (Josh) • Afternoon – Advanced case study #1: Causal induction (Tom) – Advanced case study #2: Property induction (Josh) – Quick tour of more advanced topics (Tom) Outline • Morning – Introduction (Josh) – Basic case study #1: Flipping coins (Tom) – Basic case study #2: Rules and similarity (Josh) • Afternoon – Advanced case study #1: Causal induction (Tom) – Advanced case study #2: Property induction (Josh) – Quick tour of more advanced topics (Tom) Marr’s Three Levels of Analysis • Computation: “What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out?” • Representation and algorithm: Cognitive psychology • Implementation: Neurobiology Working at the computational level statistical • What is the computational problem? – input: data – output: solution Working at the computational level statistical • What is the computational problem? – input: data – output: solution • What knowledge is available to the learner? • Where does that knowledge come from? Theory-Based Bayesian Models • Rational statistical inference (Bayes): p ( d | h) p ( h) p(h | d ) p(d | h) p(h) hH • Learners’ domain theories generate their hypothesis space H and prior p(h). – Well-matched to structure of the natural world. – Learnable from limited data. – Computationally tractable inference. Causality Bayes nets and beyond... • Increasingly popular approach to studying human causal inferences (e.g. Glymour, 2001; Gopnik et al., 2004) • Three reactions: – Bayes nets are the solution! – Bayes nets are missing the point, not sure why… – what is a Bayes net? Bayes nets and beyond... • What are Bayes nets? – graphical models – causal graphical models • An example: elemental causal induction • Beyond Bayes nets… – other knowledge in causal induction – formalizing causal theories Bayes nets and beyond... • What are Bayes nets? – graphical models – causal graphical models • An example: elemental causal induction • Beyond Bayes nets… – other knowledge in causal induction – formalizing causal theories Graphical models • Express the probabilistic dependency structure among a set of variables (Pearl, 1988) • Consist of – a set of nodes, corresponding to variables – a set of edges, indicating dependency – a set of functions defined on the graph that defines a probability distribution Undirected graphical models X3 X4 X1 • Consist of – a set of nodes X2 X5 – a set of edges – a potential for each clique, multiplied together to yield the distribution over variables • Examples – statistical physics: Ising model, spinglasses – early neural networks (e.g. Boltzmann machines) Directed graphical models X3 X4 X1 • Consist of – a set of nodes X2 X5 – a set of edges – a conditional probability distribution for each node, conditioned on its parents, multiplied together to yield the distribution over variables • Constrained to directed acyclic graphs (DAG) • AKA: Bayesian networks, Bayes nets Bayesian networks and Bayes • Two different problems – Bayesian statistics is a method of inference – Bayesian networks are a form of representation • There is no necessary connection – many users of Bayesian networks rely upon frequentist statistical methods (e.g. Glymour) – many Bayesian inferences cannot be easily represented using Bayesian networks Properties of Bayesian networks • Efficient representation and inference – exploiting dependency structure makes it easier to represent and compute with probabilities • Explaining away – pattern of probabilistic reasoning characteristic of Bayesian networks, especially early use in AI Efficient representation and inference • Three binary variables: Cavity, Toothache, Catch Efficient representation and inference • Three binary variables: Cavity, Toothache, Catch • Specifying P(Cavity, Toothache, Catch) requires 7 parameters (1 for each set of values, minus 1 because it’s a probability distribution) • With n variables, we need 2n -1 parameters • Here n=3. Realistically, many more: X-ray, diet, oral hygiene, personality, . . . . Conditional independence • All three variables are dependent, but Toothache and Catch are independent given the presence or absence of Cavity • In probabilistic terms: P(ache catch | cav) P(ache | cav) P(catch | cav) P(ache catch | cav) P(ache | cav) P(catch | cav) 1 P(ache | cav)P(catch | cav) • With n evidence variables, x1, …, xn, we need 2 n conditional probabilities: P( xi | cav), P( xi | cav) A simple Bayesian network • Graphical representation of relations between a set of random variables: Cavity Toothache Catch • Probabilistic interpretation: factorizing complex terms P( A, B, C ) P(V | parents[V ]) V { A, B,C} P( Ache, Catch, Cav) P( Ache, Catch | Cav) P(Cav) P( Ache | Cav) P(Catch | Cav) P(Cav) A more complex system Battery Radio Ignition Gas Starts On time to work • Joint distribution sufficient for any inference: P( B, R, I , G, S , O) P( B) P( R | B) P( I | B) P(G) P(S | I , G) P(O | S ) P( B) P( R | B) P( I | B) P(G) P(S | I , G) P(O | S ) P(O, G ) B, R, I , S P(O | G ) P(G ) P(G ) A more complex system Battery Radio Ignition Gas Starts On time to work • Joint distribution sufficient for any inference: P( B, R, I , G, S , O) P( B) P( R | B) P( I | B) P(G) P(S | I , G) P(O | S ) P( B) P( I | B) P( S | I , G ) P(O | S ) P(O, G ) P(O | G ) P(G ) S B, I A more complex system Battery Radio Ignition Gas Starts On time to work • Joint distribution sufficient for any inference: P( B, R, I , G, S , O) P( B) P( R | B) P( I | B) P(G) P(S | I , G) P(O | S ) • General inference algorithm: local message passing (belief propagation; Pearl, 1988) – efficiency depends on sparseness of graph structure Explaining away Rain Sprinkler Grass Wet P( R, S ,W ) P( R) P(S ) P(W | S , R) • Assume grass will be wet if and only if it rained last night, or if the sprinklers were left on: P(W w | S , R) 1 if S s or R r 0 if R r and S s. Explaining away Rain Sprinkler Grass Wet P( R, S ,W ) P( R) P(S ) P(W | S , R) P(W w | S , R) 1 if S s or R r 0 if R r and S s. Compute probability it P( w | r ) P(r ) P (r | w) rained last night, given P ( w) that the grass is wet: Explaining away Rain Sprinkler Grass Wet P( R, S ,W ) P( R) P(S ) P(W | S , R) P(W w | S , R) 1 if S s or R r 0 if R r and S s. Compute probability it P( w | r ) P(r ) P(r | w) rained last night, given that the grass is wet: P(w | r, s) P(r, s) r , s Explaining away Rain Sprinkler Grass Wet P( R, S ,W ) P( R) P(S ) P(W | S , R) P(W w | S , R) 1 if S s or R r 0 if R r and S s. Compute probability it P(r ) P(r | w) rained last night, given P(r , s ) P(r , s ) P(r , s) that the grass is wet: Explaining away Rain Sprinkler Grass Wet P( R, S ,W ) P( R) P(S ) P(W | S , R) P(W w | S , R) 1 if S s or R r 0 if R r and S s. Compute probability it P(r ) P(r | w) rained last night, given P(r ) P(r , s) that the grass is wet: Explaining away Rain Sprinkler Grass Wet P( R, S ,W ) P( R) P(S ) P(W | S , R) P(W w | S , R) 1 if S s or R r 0 if R r and S s. Compute probability it P(r ) P(r | w) P(r ) rained last night, given P ( r ) P ( r ) P ( s ) that the grass is wet: Between 1 and P(s) Explaining away Rain Sprinkler Grass Wet P( R, S ,W ) P( R) P(S ) P(W | S , R) P(W w | S , R) 1 if S s or R r 0 if R r and S s. Compute probability it P( w | r , s) P(r | s) P(r | w, s ) rained last night, given P( w | s) that the grass is wet and sprinklers were left on: Both terms = 1 Explaining away Rain Sprinkler Grass Wet P( R, S ,W ) P( R) P(S ) P(W | S , R) P(W w | S , R) 1 if S s or R r 0 if R r and S s. Compute probability it rained last night, given P(r | w, s) P(r | s) P(r ) that the grass is wet and sprinklers were left on: Explaining away Rain Sprinkler Grass Wet P( R, S ,W ) P( R) P(S ) P(W | S , R) P(W w | S , R) 1 if S s or R r 0 if R r and S s. P(r ) P(r | w) P(r ) P ( r ) P ( r ) P ( s ) “Discounting” to P(r | w, s) P(r | s) P(r ) prior probability. Contrast w/ production system Rain Sprinkler Grass Wet • Formulate IF-THEN rules: – IF Rain THEN Wet – IF Wet THEN Rain IF Wet AND NOT Sprinkler THEN Rain • Rules do not distinguish directions of inference • Requires combinatorial explosion of rules Contrast w/ spreading activation Rain Sprinkler Grass Wet • Excitatory links: Rain Wet, Sprinkler Wet • Observing rain, Wet becomes more active. • Observing grass wet, Rain and Sprinkler become more active. • Observing grass wet and sprinkler, Rain cannot become less active. No explaining away! Contrast w/ spreading activation Rain Sprinkler Grass Wet • Excitatory links: Rain Wet, Sprinkler Wet • Inhibitory link: Rain Sprinkler • Observing grass wet, Rain and Sprinkler become more active. • Observing grass wet and sprinkler, Rain becomes less active: explaining away. Contrast w/ spreading activation Rain Burst pipe Sprinkler Grass Wet • Each new variable requires more inhibitory connections. • Interactions between variables are not causal. • Not modular. – Whether a connection exists depends on what other connections exist, in non-transparent ways. – Big holism problem. – Combinatorial explosion. Graphical models • Capture dependency structure in distributions • Provide an efficient means of representing and reasoning with probabilities • Allow kinds of inference that are problematic for other representations: explaining away – hard to capture in a production system – hard to capture with spreading activation Bayes nets and beyond... • What are Bayes nets? – graphical models – causal graphical models • An example: causal induction • Beyond Bayes nets… – other knowledge in causal induction – formalizing causal theories Causal graphical models • Graphical models represent statistical dependencies among variables (ie. correlations) – can answer questions about observations • Causal graphical models represent causal dependencies among variables – express underlying causal structure – can answer questions about both observations and interventions (actions upon a variable) Observation and intervention Battery Radio Ignition Gas Starts On time to work Graphical model: P(Radio|Ignition) Causal graphical model: P(Radio|do(Ignition)) Observation and intervention Battery Radio Ignition Gas Starts On time to work Graphical model: P(Radio|Ignition) Causal graphical model: P(Radio|do(Ignition)) “graph surgery” produces “mutilated graph” Assessing interventions • To compute P(Y|do(X=x)), delete all edges coming into X and reason with the resulting Bayesian network (“do calculus”; Pearl, 2000) • Allows a single structure to make predictions about both observations and interventions Causality simplifies inference • Using a representation in which the direction of causality is correct produces sparser graphs • Suppose we get the direction of causality wrong, thinking that “symptoms” causes “diseases”: Ache Catch Cavity • Does not capture the correlation between symptoms: falsely believe P(Ache, Catch) = P(Ache) P(Catch). Causality simplifies inference • Using a representation in which the direction of causality is correct produces sparser graphs • Suppose we get the direction of causality wrong, thinking that “symptoms” causes “diseases”: Ache Catch Cavity • Inserting a new arrow allows us to capture this correlation. • This model is too complex: do not believe that P( Ache, Catch | Cav) P( Ache | Cav) P(Catch | Cav) Causality simplifies inference • Using a representation in which the direction of causality is correct produces sparser graphs • Suppose we get the direction of causality wrong, thinking that “symptoms” causes “diseases”: Ache X-ray Catch Cavity • New symptoms require a combinatorial proliferation of new arrows. This reduces efficiency of inference. Learning causal graphical models B C B C E E • Strength: how strong is a relationship? • Structure: does a relationship exist? Causal structure vs. causal strength B C B C E E • Strength: how strong is a relationship? Causal structure vs. causal strength B C B C w0 w1 w0 E E • Strength: how strong is a relationship? – requires defining nature of relationship Parameterization • Structures: h1 = B C h0 = B C E E • Parameterization: Generic C B h1: P(E = 1 | C, B) h0: P(E = 1| C, B) 0 0 p00 p0 1 0 p10 p0 0 1 p01 p1 1 1 p11 p1 Parameterization • Structures: h1 = B C h0 = B C w0 w1 w0 E E w0, w1: strength parameters for B, C • Parameterization: Linear C B h1: P(E = 1 | C, B) h0: P(E = 1| C, B) 0 0 0 0 1 0 w1 0 0 1 w0 w0 1 1 w1+ w0 w0 Parameterization • Structures: h1 = B C h0 = B C w0 w1 w0 E E w0, w1: strength parameters for B, C • Parameterization: “Noisy-OR” C B h1: P(E = 1 | C, B) h0: P(E = 1| C, B) 0 0 0 0 1 0 w1 0 0 1 w0 w0 1 1 w1+ w0 – w1 w0 w0 Parameter estimation • Maximum likelihood estimation: maximize i P(bi,ci,ei; w0, w1) • Bayesian methods: as in the “Comparing infinitely many hypotheses” example… Causal structure vs. causal strength B C B C E E • Structure: does a relationship exist? Approaches to structure learning • Constraint-based B C – dependency from statistical tests (eg. 2) – deduce structure from dependencies E (Pearl, 2000; Spirtes et al., 1993) Approaches to structure learning • Constraint-based: B C – dependency from statistical tests (eg. 2) – deduce structure from dependencies E (Pearl, 2000; Spirtes et al., 1993) Approaches to structure learning • Constraint-based: B C – dependency from statistical tests (eg. 2) – deduce structure from dependencies E (Pearl, 2000; Spirtes et al., 1993) Approaches to structure learning • Constraint-based: B C – dependency from statistical tests (eg. 2) – deduce structure from dependencies E (Pearl, 2000; Spirtes et al., 1993) Attempts to reduce inductive problem to deductive problem Approaches to structure learning • Constraint-based: B C – dependency from statistical tests (eg. 2) – deduce structure from dependencies E (Pearl, 2000; Spirtes et al., 1993) • Bayesian: B C B C – compute posterior probability of structures, E E given observed data P(S1|data) P(S0|data) P(S|data) P(data|S) P(S) (Heckerman, 1998; Friedman, 1999) Causal graphical models • Extend graphical models to deal with interventions as well as observations • Respecting the direction of causality results in efficient representation and inference • Two steps in learning causal models – parameter estimation – structure learning Bayes nets and beyond... • What are Bayes nets? – graphical models – causal graphical models • An example: elemental causal induction • Beyond Bayes nets… – other knowledge in causal induction – formalizing causal theories Elemental causal induction C present C absent E present a c E absent b d “To what extent does C cause E?” Causal structure vs. causal strength B C B C w0 w1 w0 E E • Strength: how strong is a relationship? • Structure: does a relationship exist? Causal strength • Assume structure: B C w0 w1 E • Leading models (DP and causal power) are maximum likelihood estimates of the strength parameter w1, under different parameterizations for P(E|B,C): – linear DP, Noisy-OR causal power Causal structure • Hypotheses: h1 = B C h0 = B C E E • Bayesian causal inference: 1 1 P(data | h1) 0 P(data | w0 , w1) p(w0 , w1 | h1 support = 0 1 P(data | h0 ) P(data | w0 ) p(w0 | h0 ) dw0 0 Buehner and Cheng (1997) People DP (r = 0.89) Power (r = 0.88) Support (r = 0.97) The importance of parameterization • Noisy-OR incorporates mechanism assumptions: – generativity: causes increase probability of effects – each cause is sufficient to produce the effect – causes act via independent mechanisms (Cheng, 1997) • Consider other models: – statistical dependence: 2 test – generic parameterization (Anderson, computer science) People Support (Noisy-OR) 2 Support (generic) Generativity is essential P(e+|c+) 8/8 6/8 4/8 2/8 0/8 P(e+|c-) 8/8 6/8 4/8 2/8 0/8 100 Support 50 0 • Predictions result from “ceiling effect” – ceiling effects only matter if you believe a cause increases the probability of an effect Bayes nets and beyond... • What are Bayes nets? – graphical models – causal graphical models • An example: elemental causal induction • Beyond Bayes nets… – other knowledge in causal induction – formalizing causal theories chemicals genes Clofibrate Wyeth 14,643 Gemfibrozil Phenobarbital p450 2B1 Carnitine Palmitoyl Transferase 1 Hamadeh et al. (2002) Toxicological sciences. chemicals X genes Clofibrate Wyeth 14,643 Gemfibrozil Phenobarbital p450 2B1 Carnitine Palmitoyl Transferase 1 Hamadeh et al. (2002) Toxicological sciences. chemicals Chemical X genes peroxisome proliferators Clofibrate Wyeth 14,643 Gemfibrozil Phenobarbital + + + p450 2B1 Carnitine Palmitoyl Transferase 1 Hamadeh et al. (2002) Toxicological sciences. Using causal graphical models • Three questions (usually solved by researcher) – what are the variables? – what structures are plausible? – how do variables interact? • How are these questions answered if causal graphical models are used in cognition? Bayes nets and beyond... • What are Bayes nets? – graphical models – causal graphical models • An example: elemental causal induction • Beyond Bayes nets… – other knowledge in causal induction – formalizing causal theories Theory-based causal induction Causal theory – Ontology P(h|data) P(data|h) P(h) – Plausible relations Evaluated by statistical inference – Functional form P(h1) = P(h0) =1 – h1: B X Y h0: X B Y Generates Z Z Hypothesis space of causal graphical models Blicket detector Both objects activate Object A does not Chi the detector activate the detector each by itself Then they maket (Gopnik, Sobel, and colleagues) Procedure used in Sobel et al. (2002), Experiment 2 Backward Blocking Condition e Condition Both objects activate Object Aactivates the Chi See this? It’s a Let’s put this one Oooh, it’s a s activate Object A does not Children are asked if the detector detector by itself each tector activate the detector each is a blicket Then they by itself Then they are asked to maket blicket machine. on the machine. makehe machine go t blicket! Blickets Blocking Condition make it go. s activate Object Aactivates the Children are asked if tector detector by itself each is a blicket Then they are asked to Both objects activate Object A does not Children are asked if Both objects activate the detector activate the detector each is a blicket the detector by itself Then they are asked to makehe machine go “Blocking” t Experiment Sobel et al. (2002), Experiment 2 edure used in2 Backward Blocking Condition Backward Blocking C dition e A Both objects asked B activate Children areObject if does not detector A theblicket Trial 1 Object Aactivates the detector Children are asked if by itself Trial 2 Children are asked if each is a blicket Trials 3, 4 Both objects activate the detector each is a activate the detector each is a blicket Then Then they are asked to they are asked to by itself Then they are asked to makehe machine go makehe machine go t t makehe machine go t – Two objects: A and B king Condition – Trial 1: A on detector – detector active – Trial 2: B on detector – detector inactive – Trials 3,4: A B on detector – detector active e Then – 3, 4-year-olds judge whether each object is a blicket Children are asked if Object Aactivates the each is a blicket detector by itself Children are asked if each is a blicket they are asked to Then they are asked to makehe machine go t • A: a blicket makehe machine go t • B: not a blicket A deductive inference? • Causal law: detector activates if and only if one or more objects on top of it are blickets. • Premises: – Trial 1: A on detector – detector active – Trial 2: B on detector – detector inactive – Trials 3,4: A B on detector – detector active • Conclusions deduced from premises and causal law: – A: a blicket – B: not a blicket Both objects activate Object A does not Children are a “Backwards blocking” the detector activate the detector each is a blicke by itself Then they are asked t makehe machin t el et al. (2002), Experiment 2 Figure 13: Procedure used in Sobel et al. (2002), Experiment 2 (Sobel, Tenenbaum & Gopnik, 2004) Backward Blocking Condition One-Cause Condition t A does not A B Children are asked if objects activate Both Trial 1 Both objects activate Object A does not the detector Trial 2 Object Aactivates the Children are asked if detector by itself Children are a each is a blicke e the detector each is a blicket the detector activate the detector each is a blicket by itself Then they are asked to Then they are asked t by itself Then they are asked to makehe machine go t makehe machin t makehe machine go t – A and B Two objects:Blocking Condition Backward – Trial 1: A B on detector – detector active – Trial 2: A on detector – detector active – 4-year-olds judge whether each object is a blicket • each isablicket Both the detector Aactivates the Children are asked if tor by itself A: a blicket (100% of judgments) Then objects activate Object Aactivates the detector by itself Children are asked if each is a blicket they are asked to Then they are asked to • B: probably not a blicket (66% of judgments) makehe machine go t makehe machine go t Theory • Ontology – Types: Block, Detector, Trial – Predicates: Contact(Block, Detector, Trial) Active(Detector, Trial) • Constraints on causal relations – For any Block b and Detector d, with prior probability q : Cause(Contact(b,d,t), Active(d,t)) • Functional form of causal relations – Causes of Active(d,t) are independent mechanisms, with causal strengths wi. A background cause has strength w0. Assume a near-deterministic mechanism: wi ~ 1, w0 ~ 0. Theory • Ontology – Types: Block, Detector, Trial – Predicates: A B Contact(Block, Detector, Trial) Active(Detector, Trial) E Theory • Ontology – Types: Block, Detector, Trial – Predicates: A B Contact(Block, Detector, Trial) Active(Detector, Trial) E A = 1 if Contact(block A, detector, trial), else 0 B = 1 if Contact(block B, detector, trial), else 0 E = 1 if Active(detector, trial), else 0 Theory • Constraints on causal relations – For any Block b and Detector d, with prior probability q : Cause(Contact(b,d,t), Active(d,t)) P(h00) = (1 – q)2 P(h10) = q(1 – q) No hypotheses with h00 : A B h10 : A B E B, E A, A B, etc. E E P(h01) = (1 – q) q P(h11) = q2 A = “A is a blicket” A B A B h01 : h11 : E E E Theory • Functional form of causal relations – Causes of Active(d,t) are independent mechanisms, with causal strengths wb. A background cause has strength w0. Assume a near-deterministic mechanism: wb ~ 1, w0 ~ 0. P(h00) = (1 – q)2 P(h01) = (1 – q) q P(h10) = q(1 – q) P(h11) = q2 A B A B A B A B E E E E P(E=1 | A=0, B=0): 0 0 0 0 P(E=1 | A=1, B=0): 0 0 1 1 P(E=1 | A=0, B=1): 0 1 0 1 P(E=1 | A=1, B=1): 0 1 1 1 “Activation law”: E=1 if and only if A=1 or B=1. Bayesian inference • Evaluating causal models in light of data: P(d | hi ) P(hi ) P(hi | d ) P(d | h j ) P(h j ) h H j • Inferring a particular causal relation: P( A E | d ) P( A E | h j ) P ( h j | d ) h H j Modeling backwards blocking P(h00) = (1 – q)2 P(h01) = (1 – q) q P(h10) = q(1 – q) P(h11) = q2 A B A B A B A B E E E E P(E=1 | A=0, B=0): 0 0 0 0 P(E=1 | A=1, B=0): 0 0 1 1 P(E=1 | A=0, B=1): 0 1 0 1 P(E=1 | A=1, B=1): 0 1 1 1 P ( B E | d ) P(h01 ) P(h11 ) q P( B E | d ) P (h00 ) P(h10 ) 1 q Modeling backwards blocking P(h00) = (1 – q)2 P(h01) = (1 – q) q P(h10) = q(1 – q) P(h11) = q2 A B A B A B A B E E E E P(E=1 | A=1, B=1): 0 1 1 1 P ( B E | d ) P (h01 ) P(h11 ) 1 P( B E | d) P (h10 ) 1 q Modeling backwards blocking P(h01) = (1 – q) q P(h10) = q(1 – q) P(h11) = q2 A B A B A B E E E P(E=1 | A=1, B=0): 0 1 1 P(E=1 | A=1, B=1): 1 1 1 P( B E | d ) P(h11 ) q P( B E | d ) P(h10 ) 1 q Both objects activate Object A does not Children are asked if the detector et al. 13: Manipulating the prior One-Cause Figure 13: Condition (2002), Experiment 2 esed in Sobel et al. (2002), Experiment 2Procedure used 2Procedure al.13: in Sobel et used in Sobel et al. (2002), 13: Procedure used in SobelFigure(2002), Experiment in Sobel et used Procedure al. (2002), Experiment 2 Figure Figure 13: Procedure used in Sobel et al. (2002), Experiment 2 activate the detector by itself each is a blicket Then they are asked to makehe machine go t Backward Blocking Condition One-Cause Condition Cause Condition One-Cause Condition One-Cause Condition I. Pre-training phase: Blickets are rare . . . . Both objects activate Object A does not Children are a the detector activate the detector each is a blicke by itself Then they are asked t Figure 13: Procedure used in Sobel et al. (2002), Experiment 2 Object Both objects activate A does notobjects activateAactivates the areactivate A does not are asked if A does not h objects activate A does not Object Both Children areObject if asked Both objects asked if Children Object Both objects activate Children Object Children are asked ifA doesChildren are aske Object not the machin make sed al. (2002), Experiment 2 el et in Sobel et al. (2002), Experiment 2 activate the detector detector the detector the activate the detector a detector each is blicket detector byeach is a detector the the itself blicket the activate the detector a blicket activate the detector detector each is a blicket each is activate the detector a blicket each is One-Cause Condition by itself Figure 13: Procedure used in Sobel et al. (2002), Experiment 2 Then by itself are asked to they Then they are asked to Then by itself are asked to they Then they by itself are asked to by itself Then they are asked to II. Backwards blocking phase: Backward Blocking Condition makehe machine go One-Cause Condition t makehe machine go t makehe machine go t makehe machine go t makehe machine g t ward Blocking Condition ondition Backward Blocking Condition Backward Blocking Condition Backward Blocking Condition Both objects activate Object A does not Children are asked if t A does not the detector not Object A does e the detector A B detector activate theare asked if Children Children are asked if activate the detector each is a blicket each is blicket Trial 1 each is a blicket Both objects activate Then detector Trial 2 Object Aactivates the byaitselfBoth objects activate the they are asked to Object A does notdetector by itself Children are a Children are asked if is a blicke each by itself they are asked to they are asked to the detector Then Then makehe machine activate the detector t go Then each is a blicket they are asked t by itself After each trial, adults judge the probability that each Both objects asked h objects activate Aactivates the makehe machine goChildren areactivate Object the detector detector by itself Object Aactivates tthe machine go if t makehe detector by itself is a detector each the blicket Both objects activate Aactivates objects activate Aactivates the are askedAactivatesmakehe are aske Object Children are asked if the blicket Both the itself the by Object Children Object if each they are asked to the t machin Then Children each is a detector detector by itself detector detector by itself is a blicket the machine go is a blicket each detector by itself make Backward Blocking Condition Then they are asked to Then they are asked to Then they are asked to Then they are asked to ndition object is a blicket. Blocking Condition Backward makehe machine go t makehe machine go t makehe machine go t makehe machine g t • “Rare” condition: First observe 12 objects on detector, of which 2 set it off. • “Common” condition: First observe 12 objects on detector, of which 10 set it off. Both objects activate Object A does not Children are asked if Figure 13: Procedure usedeach is a blicketFigure 13: Procedure used in Sobel in Sobel et al. (2002), Experiment 2 Inferences from ambiguous data the detector activate the detector by itself esed in Sobel et al. (2002), Experiment 2Procedure used 2Procedure al.13: in Sobel et usedConditionet al. (2002), One-Cause in et al. 13: One-Cause ConditionFigure (2002), Experiment 2Sobel 13: Procedure used in SobelFigure(2002), Experiment in Sobel et used Procedure al. (2002), Experiment 2 Figure 13: Then they are asked to makehe machine go t Backward Blocking Condition One-Cause Condition Cause Condition One-Cause Condition One-Cause Condition I. Pre-training phase: Blickets are rare . . . . Both objects activate Object A doesBoth objects activate not Object Children are asked if the detector activate the detector the detector each is a blicket activate by itself Then they are asked to by Figure 13: Procedure used e 13: Procedure used in Sobel et al. (2002), Experiment 2 in Sobel et al. (2002), Experiment 2 Object Both Both objects activate A does notobjects activateAactivates the areactivate A does not are asked if A does not h objects activate A does not Object Children areObject if asked Both objects asked if Children Object Both objects activate Children Object Children are asked ifAthe machine go aske Object doesChildren are make not Experiment 2 2), (2002), (2002), Experiment 2 obel al. et al. Experiment 2 activate the detector detector the detector the activate the detector detector activate the detector a detector each is blicket detector byeach is a detector the the itself blicket the activate the detector a blicket each is a blicket each is activate the detector a blicket each is Figure 13: Procedure 13: in Sobel et al. in Sobel et al. (2002), Cause Condition FigureusedProcedure used(2002), Experiment 2 Experiment 2 by itself One-Cause Condition Then they by itself are asked to Then they are asked to Then asked to by itself are they Then asked to by itself they by itself are Then they are asked to Backward Blocking Condition II. Two trials: A B One-Cause Condition One-Cause Condition Backward Blocking Condition makehe machine go t detector, B C makehe machine go t makehe machine go t detector makehe machine go t makehe machine g t ward Blocking Condition ondition Backward Blocking Condition Backward Blocking Condition Backward Blocking Condition h objects activate Object A does Both objects activate not Object A does not Children are asked if Children are asked if A the detector not es not A does Children areChildren areChildren Object etector the detector is activate is a blicket isby blicket ctivate Both objects a blicket Both objects activate each B asked if C Trial 1 activate if are asked if detectorBoth objects activate asked the detector the each a itself Then Object are asked to thethey A does not by itself Object Trial 2 each is a blicket Children are asked if Aa each is a blicket activate the detector Aactivates the objects activate Both Then Object Children itself the they are asked toeach is a blicket detecto each Object A does not detector Children are asked if detector byare asked ifdetector f Then Then they are asked to Then by itself the detector they are asked to are asked to they the detector activate the detector makehe detector activate tthe machine go a blicket each is a blicket makehe machine they are asked to t go Then each is make Object the machine the Objecthe make Both objects activate Children are make Aactivates go h objects activate Aactivates thego machine tgo machine the itselfasked if Both objectsthey areifAactivates objects activate Aactivates the are askedAactivates the go aske activateasked Both the they Object Object Children are asked by itself Then to Then asked to are if t Children Object makehe machine are Children by After each trial, adults judge the probability that each the detector detector by itself ward Blocking Condition detector by itself is a detector Backward Blocking Condition each the blicket Then they are asked to each is a detectortdetector by itself detector detector by itself Then the blicket he machine gothe makehe machine go make they are asked to t each is a blicket Then they are asked to each is a blicket detector by itself Then they are asked to makehe machine go t makehe machine go t makehe machine go t makehe machine g t object isBlocking Condition Backward a Backward Blocking Conditionblicket. Same domain theory generates hypothesis space for 3 objects: A B C A B C • Hypotheses: h000 = E h100 = E A B C A B C h010 = E h001 = E A B C A B C h110 = h011 = E E A B C A B C h101 = h111 = E E • Likelihoods: P(E=1| A, B, C; h) = 1 if A = 1 and A E exists, or B = 1 and B E exists, or C = 1 and C E exists, else 0. • “Rare” condition: First observe 12 objects on detector, of which 2 set it off. The role of causal mechanism knowledge • Is mechanism knowledge necessary? – Constraint-based learning using 2 tests of conditional independence. • How important is the deterministic functional form of causal relations? – Bayes with “noisy sufficient causes” theory (c.f., Cheng’s causal power theory). Bayes with correct theory: Bayes with “noisy sufficient causes” theory: Theory-based causal induction • Explains one-shot causal inferences about physical systems: blicket detectors • Captures a spectrum of inferences: – unambiguous data: adults and children make all- or-none inferences – ambiguous data: adults and children make more graded inferences • Extends to more complex cases with hidden variables, dynamic systems: come to my talk! Summary • Causal graphical models provide a language for asking questions about causality • Key issues in modeling causal induction: – what do we mean by causal induction? – how do knowledge and statistics interact? • Bayesian approach allows exploration of different answers to these questions Outline • Morning – Introduction (Josh) – Basic case study #1: Flipping coins (Tom) – Basic case study #2: Rules and similarity (Josh) • Afternoon – Advanced case study #1: Causal induction (Tom) – Advanced case study #2: Property induction (Josh) – Quick tour of more advanced topics (Tom) Property induction Collaborators Charles Kemp Neville Sanjana Lauren Schmidt Amy Perfors Fei Xu Liz Baraff Pat Shafto The Big Question • How can we generalize new concepts reliably from just one or a few examples? – Learning word meanings “horse” “horse” “horse” The Big Question • How can we generalize new concepts reliably from just one or a few examples? – Learning word meanings, causal relations, social rules, …. – Property induction Gorillas have T4 cells. Squirrels have T4 cells. All mammals have T4 cells. How probable is the the conclusion (target) given the premises (examples)? The Big Question • How can we generalize new concepts reliably from just one or a few examples? – Learning word meanings, causal relations, social rules, …. – Property induction Gorillas have T4 cells. Gorillas have T4 cells. Squirrels have T4 cells. Chimps have T4 cells. All mammals have T4 cells. All mammals have T4 cells. More diverse examples stronger generalization Is rational inference the answer? • Everyday induction often appears to follow principles of rational scientific inference. – Could that explain its success? • Goal of this work: a rational computational model of human inductive generalization. – Explain people’s judgments as approximations to optimal inference in natural environments. – Close quantitative fits to people’s judgments with a minimum of free parameters or assumptions. Theory-Based Bayesian Models • Rational statistical inference (Bayes): p ( d | h) p ( h) p(h | d ) p(d | h) p(h) hH • Learners’ domain theories generate their hypothesis space H and prior p(h). – Well-matched to structure of the natural world. – Learnable from limited data. – Computationally tractable inference. The plan • Similarity-based models • Theory-based model • Bayesian models – “Empiricist” Bayes – Theory-based Bayes, with different theories • Connectionist (PDP) models • Advanced Theory-based Bayes – Learning with multiple domain theories – Learning domain theories The plan • Similarity-based models • Theory-based model • Bayesian models – “Empiricist” Bayes – Theory-based Bayes, with different theories • Connectionist (PDP) models • Advanced Theory-based Bayes – Learning with multiple domain theories – Learning domain theories An experiment (Osherson et al., 1990) • 20 subjects rated the strength of 45 arguments: X1 have property P. X2 have property P. X3 have property P. All mammals have property P. • 40 different subjects rated the similarity of all pairs of 10 mammals. Similarity-based models (Osherson et al.) strength(“all mammals” | X ) sim(i, X ) imammals x x x Mammals: Examples: x Similarity-based models (Osherson et al.) strength(“all mammals” | X ) sim(i, X ) imammals x x x Mammals: Examples: x Similarity-based models (Osherson et al.) strength(“all mammals” | X ) sim(i, X ) imammals x x x Mammals: Examples: x Similarity-based models (Osherson et al.) strength(“all mammals” | X ) sim(i, X ) imammals x x x Mammals: Examples: x Similarity-based models (Osherson et al.) strength(“all mammals” | X ) sim(i, X ) imammals • Sum-Similarity: sim(i, X ) sim(i, j ) x x jX x Mammals: Examples: x Similarity-based models (Osherson et al.) strength(“all mammals” | X ) sim(i, X ) imammals • Max-Similarity: x sim( i, X ) max sim( i, j ) x j X x Mammals: Examples: x Similarity-based models (Osherson et al.) strength(“all mammals” | X ) sim(i, X ) imammals • Max-Similarity: x sim( i, X ) max sim( i, j ) x j X x Mammals: Examples: x Similarity-based models (Osherson et al.) strength(“all mammals” | X ) sim(i, X ) imammals • Max-Similarity: x sim( i, X ) max sim( i, j ) x j X x Mammals: Examples: x Similarity-based models (Osherson et al.) strength(“all mammals” | X ) sim(i, X ) imammals • Max-Similarity: x sim( i, X ) max sim( i, j ) x j X x Mammals: Examples: x Similarity-based models (Osherson et al.) strength(“all mammals” | X ) sim(i, X ) imammals • Max-Similarity: x sim( i, X ) max sim( i, j ) x j X x Mammals: Examples: x Sum-sim versus Max-sim • Two models appear functionally similar: – Both increase monotonically as new examples are observed. • Reasons to prefer Sum-sim: – Standard form of exemplar models of categorization, memory, and object recognition. – Analogous to kernel density estimation techniques in statistical pattern recognition. • Reasons to prefer Max-sim: – Fit to generalization judgments . . . . Data vs. models Data Model X1 have property P. . Each “ ” represents one argument: X2 have property P. X3 have property P. All mammals have property P. Three data sets Max-sim Sum-sim Conclusion kind: “all mammals” “horses” “horses” Number of examples: 3 2 1, 2, or 3 Feature rating data (Osherson and Wilkie) • People were given 48 animals, 85 features, and asked to rate whether each animal had each feature. • E.g., elephant: 'gray' 'hairless' 'toughskin' 'big' 'bulbous' 'longleg' 'tail' 'chewteeth' 'tusks' 'smelly' 'walks' 'slow' 'strong' 'muscle’ 'quadrapedal' 'inactive' 'vegetation' 'grazer' 'oldworld' 'bush' 'jungle' 'ground' 'timid' 'smart' 'group' ? Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features New property • Compute similarity based on Hamming distance, A B A B or cosine. • Generalize based on Max-sim or Sum-sim. Three data sets r = 0.77 r = 0.75 r = 0.94 Max-Sim r = – 0.21 r = 0.63 r = 0.19 Sum-Sim Conclusion kind: “all mammals” “horses” “horses” Number of examples: 3 2 1, 2, or 3 Problems for sim-based approach • No principled explanation for why Max-Sim works so well on this task, and Sum-Sim so poorly, when Sum- Sim is the standard in other similarity-based models. • Free parameters mixing similarity and coverage terms, and possibly Max-Sim and Sum-Sim terms. • Does not extend to induction with other kinds of properties, e.g., from Smith et al., 1993: Dobermanns can bite through wire. German shepherds can bite through wire. Poodles can bite through wire. German shepherds can bite through wire. Marr’s Three Levels of Analysis • Computation: “What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out?” • Representation and algorithm: Max-sim, Sum-sim • Implementation: Neurobiology The plan • Similarity-based models • Theory-based model • Bayesian models – “Empiricist” Bayes – Theory-based Bayes, with different theories • Connectionist (PDP) models • Advanced Theory-based Bayes – Learning with multiple domain theories – Learning domain theories Theory-based induction • Scientific biology: species generated by an evolutionary branching process. – A tree-structured taxonomy of species. • Taxonomy also central in folkbiology (Atran). Theory-based induction Begin by reconstructing intuitive taxonomy from similarity judgments: clustering How taxonomy constrains induction • Atran (1998): “Fundamental principle of systematic induction” (Warburton 1967, Bock 1973) – Given a property found among members of any two species, the best initial hypothesis is that the property is also present among all species that are included in the smallest higher-order taxon containing the original pair of species. “all mammals” Cows have property P. Dolphins have property P. Squirrels have property P. All mammals have property P. Strong (0.76 [max = 0.82]) “large herbivores” Cows have property P. Cows have property P. Dolphins have property P. Horses have property P. Squirrels have property P. Rhinos have property P. All mammals have property P. All mammals have property P. Strong: 0.76 [max = 0.82]) Weak: 0.17 [min = 0.14] “all mammals” Cows have property P. Seals have property P. Dolphins have property P. Dolphins have property P. Squirrels have property P. Squirrels have property P. All mammals have property P. All mammals have property P. Strong: 0.76 [max = 0.82] Weak: 0.30 [min = 0.14] Taxonomic distance Max-sim Sum-sim Conclusion kind: “all mammals” “horses” “horses” Number of examples: 3 2 1, 2, or 3 The challenge • Can we build models with the best of both traditional approaches? – Quantitatively accurate predictions. – Strong rational basis. • Will require novel ways of integrating structured knowledge with statistical inference. The plan • Similarity-based models • Theory-based model • Bayesian models – “Empiricist” Bayes – Theory-based Bayes, with different theories • Connectionist (PDP) models • Advanced Theory-based Bayes – Learning with multiple domain theories – Learning domain theories The Bayesian approach ? Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features New property The Bayesian approach ? Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features Generalization New property Hypothesis The Bayesian approach ? Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features Generalization New property Hypothesis The Bayesian approach ? Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features Generalization New property Hypothesis The Bayesian approach ? Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features Generalization New property Hypothesis The Bayesian approach ? Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features Generalization New property Hypothesis The Bayesian approach ? Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features Generalization New property Hypothesis The Bayesian approach p(h) p(d |h) h d Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features Generalization New property Hypothesis p ( d | h) p ( h) Bayes’ rule: p(h | d ) p(d | h) p(h) hH p(h) p(d |h) h d Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features Generalization New property Hypothesis Probability that property Q holds for species x: p(Q( x) | d ) p(h | d ) h consistent with Q ( x ) p(h) p(d |h) h d Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features Generalization New property Hypothesis 1 if d is “Size principle”: p ( d | h) h consistent |h| = # of positive with h instances of h 0 otherwise p(h) p(d |h) h d Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features Generalization New property Hypothesis The size principle h1 2 4 6 8 10 h2 “even numbers” 12 14 16 18 20 “multiples of 10” 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 The size principle h1 2 4 6 8 10 h2 “even numbers” 12 14 16 18 20 “multiples of 10” 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 Data slightly more of a coincidence under h1 The size principle h1 2 4 6 8 10 h2 “even numbers” 12 14 16 18 20 “multiples of 10” 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 Data much more of a coincidence under h1 Illustrating the size principle Which argument is stronger? Grizzly bears have property P. All mammals have property P. “Non-monotonicity” Grizzly bears have property P. Brown bears have property P. Polar bears have property P. All mammals have property P. Probability that property Q holds for species x: p(Q( x) | d ) p(h) / h p ( h) / h h consistent h consistent with Q ( x ), d with d p(h) p(d |h) p(Q(x)|d) h d Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 Species 6 ... ? ? Species 7 ? Species 8 ? Species 9 Species 10 ? Generalization New property Hypotheses Probability that property Q holds for species x: p(Q( x) | d ) p(h) / h p ( h) / h h consistent h consistent with Q ( x ), d with d p(h) p(d |h) h d Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features Generalization New property Hypothesis Specifying the prior p(h) • A good prior must focus on a small subset of all 2n possible hypotheses, in order to: – Match the distribution of properties in the world. – Be learnable from limited data. – Be efficiently computationally. • We consider two approaches: – “Empiricist” Bayes: unstructured prior based directly on known features. – “Theory-based” Bayes: structured prior based on rational domain theory, tuned to known features. “Empiricist” h1 h2 h3 h4 h5 h6 h7 h8 h9 h10 h11 h12 Species 1 Bayes: Species 2 Species 3 (Heit, 1998) Species 4 Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 p(h) = 1 15 1 2 1 3 15 15 15 15 1 1 1 15 15 15 1 1 15 15 1 15 1 15 h d Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features Generalization New property Hypothesis Results r = 0.38 r = 0.16 r = 0.79 “Empiricist” Bayes r = 0.77 r = 0.75 r = 0.94 Max-Sim Why doesn’t “Empiricist” Bayes work? • With no structural bias, requires too many features to estimate the prior reliably. • An analogy: Estimating a smooth probability density function by local interpolation. N=5 N = 100 N = 500 Why doesn’t “Empiricist” Bayes work? • With no structural bias, requires too many features to estimate the prior reliably. • An analogy: Estimating a smooth probability density function by local interpolation. Assuming an appropriately structured form for density (e.g., Gaussian) leads to better generalization from sparse data. N=5 N=5 “Theory-based” Bayes Theory: Two principles based on the structure of species and properties in the natural world. 1. Species generated by an evolutionary branching process. – A tree-structured taxonomy of species (Atran, 1998). 2. Features generated by stochastic mutation process and passed on to descendants. – Novel features can appear anywhere in tree, but some distributions are more likely than others. Mutation process generates p(h|T): – Choose label for root. – Probability that label mutates along branch b : 1 e 2 l b s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 2 l = mutation rate T p(h|T) |b| = length of branch b h d Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features Generalization New property Hypothesis Mutation process generates p(h|T): – Choose label for root. x x – Probability that label mutates x along branch b : 1 e 2 l b 2 l = mutation rate T p(h|T) |b| = length of branch b h d Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features Generalization New property Hypothesis Samples from the prior • Labelings that cut the data along fewer branches are more probable: > “monophyletic” “polyphyletic” Samples from the prior • Labelings that cut the data along longer branches are more probable: > “more distinctive” “less distinctive” • Mutation process over tree T generates p(h|T). • Message passing over tree T efficiently sums over all h. • How do we know which tree T s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 to use? T p(h|T) h d Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features Generalization New property Hypothesis The same mutation process generates p(Features|T): – Assume each feature generated independently over the tree. – Use MCMC to infer most likely s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 tree T and mutation rate l given observed features. T p(h|T) – No free parameters! h d Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features Generalization New property Hypothesis Results r = 0.91 r = 0.95 r = 0.91 “Theory-based” Bayes r = 0.38 r = 0.16 r = 0.79 “Empiricist” Bayes r = 0.77 r = 0.75 r = 0.94 Max-Sim Grounding in similarity Reconstruct intuitive taxonomy from similarity judgments: clustering Theory-based Bayes Max-sim Sum-sim Conclusion kind: “all mammals” “horses” “horses” Number of examples: 3 2 1, 2, or 3 Explaining similarity • Why does Max-sim fit so well? – An efficient and accurate approximation to this Theory-Based Bayesian model. – Correlation with Mean r = 0.94 Bayes on three- premise general arguments, over 100 simulated trees: Correlation (r) – Theorem. Nearest neighbor classification approximates evolutionary Bayes in the limit of high mutation rate, if domain is tree-structured. Alternative feature-based models • Taxonomic Bayes (strictly taxonomic hypotheses, with no mutation process) > “monophyletic” “polyphyletic” Alternative feature-based models • Taxonomic Bayes (strictly taxonomic hypotheses, with no mutation process) • PDP network (Rogers and McClelland) Species Features Results r = 0.91 r = 0.95 r = 0.91 Bias is Theory-based just Bayes right! r = 0.51 r = 0.53 r = 0.85 Bias is Taxonomic too Bayes strong r = 0.41 r = 0.62 r = 0.71 Bias is PDP network too weak Mutation principle versus pure Occam’s Razor • Mutation principle provides a version of Occam’s Razor, by favoring hypotheses that span fewer disjoint clusters. • Could we use a more generic Bayesian Occam’s Razor, without the biological motivation of mutation? Mutation process generates p(h|T): – Choose label for root. – Probability that label mutates along branch b : 1 e 2 l b s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 2 l = mutation rate T p(h|T) |b| = length of branch b h d Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features Generalization New property Hypothesis Mutation process generates p(h|T): – Choose label for root. – Probability that label mutates along branch b : l s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 l = mutation rate T p(h|T) |b| = length of branch b h d Species 1 Species 2 ? Species 3 ? Species 4 ? Species 5 ? Species 6 ? Species 7 ? Species 8 ? Species 9 Species 10 ? Features Generalization New property Hypothesis Bayes (taxonomy+ Premise typicality effect (Rips, mutation) 1975; Osherson et al., 1990): Bayes Strong: (taxonomy+ Horses have property P. Occam) All mammals have property P. Max-sim Weak: Seals have property P. Conclusion All mammals have property P. kind: “all mammals” Number of examples: 1 Typicality meets hierarchies • Collins and Quillian: semantic memory structured hierarchically • Traditional story: Simple hierarchical structure uncomfortable with typicality effects & exceptions. • New story: Typicality & exceptions compatible with rational statistical inference over hierarchy. Intuitive versus scientific theories of biology • Same structure for how species are related. – Tree-structured taxonomy. • Same probabilistic model for traits – Small probability of occurring along any branch at any time, plus inheritance. • Different features – Scientist: genes – People: coarse anatomy and behavior Induction in Biology: summary • Theory-based Bayesian inference explains taxonomic inductive reasoning in folk biology. • Insight into processing-level accounts. – Why Max-sim over Sum-sim in this domain? – How is hierarchical representation compatible with typicality effects & exceptions? • Reveals essential principles of domain theory. – Category structure: taxonomic tree. – Feature distribution: stochastic mutation process + inheritance. The plan • Similarity-based models • Theory-based model • Bayesian models – “Empiricist” Bayes – Theory-based Bayes, with different theories • Connectionist (PDP) models • Advanced Theory-based Bayes – Learning with multiple domain theories – Learning domain theories Property type Generic “essence” Theory Structure Taxonomic Tree Lion Cheetah Hyena Giraffe Gazelle Gorilla Monkey Lion Cheetah Hyena Giraffe ... Gazelle Gorilla Monkey Property type Generic “essence” Size-related Food-carried Theory Structure Taxonomic Tree Dimensional Directed Acyclic Network Giraffe Lion Giraffe Cheetah Lion Lion Hyena Gorilla Gazelle Giraffe Hyena Hyena Cheetah Gazelle Gazelle Monkey Gorilla Cheetah Monkey Monkey Gorilla Lion Cheetah Hyena Giraffe ... ... ... Gazelle Gorilla Monkey One-dimensional predicates • Q = “Have skins that are more resistant to penetration than most synthetic fibers”. – Unknown relevant property: skin toughness – Model influence of known properties via judged prior probability that each species has Q. threshold for Q Skin toughness House cat Camel Elephant Rhino One-dimensional predicates Bayes (taxonomy+ mutation) Max-sim Bayes (1D model) Disease Food web model fits (Shafto et al.) r = 0.77 r = 0.82 Property r = -0.35 r = -0.05 Mammals Island Disease Taxonomic tree model fits (Shafto et al.) r = -0.12 r = 0.16 Property r = 0.81 r = 0.62 Mammals Island The plan • Similarity-based models • Theory-based model • Bayesian models – “Empiricist” Bayes – Theory-based Bayes, with different theories • Connectionist (PDP) models • Advanced Theory-based Bayes – Learning with multiple domain theories – Learning domain theories Theory • Species organized in taxonomic tree structure • Feature i generated by mutation process with rate li p(S|T) F9 F8 Domain F7 F11 F14 F13 Structure F12 F6 F14 F10 F3 F1 F2 F4 F5 F10 F10 S3 S4 S1 S2 S9 S10 S5 S6 S7 S8 p(D|S) Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Data Species 7 Species 8 Species 9 Species 10 l10 high ~ weight low Theory • Species organized in taxonomic tree structure • Feature i generated by mutation process with rate li p(S|T) F9 F8 Domain F7 F11 F14 F13 Structure F12 F6 F14 F10 F3 F1 F2 F4 F5 F10 F10 S3 S4 S1 S2 S9 S10 S5 S6 S7 S8 p(D|S) Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Data Species 7 Species 8 Species 9 Species 10 Species X ? ? ? ? ? ? ? ? ? ? ? ? ? Theory • Species organized in taxonomic tree structure • Feature i generated by mutation process with rate li p(S|T) F9 F8 Domain F7 F11 F14 F13 Structure F12 F6 F14 F10 F3 F1 F2 F4 F5 F10 F10 S3 S4 S1 S2 S9 S10 S5 S6 S7 S8 p(D|S) SX Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 Data Species 7 Species 8 Species 9 Species 10 Species X Where does the domain theory come from? • Innate. – Atran (1998): The tendency to group living kinds into hierarchies reflects an “innately determined cognitive structure”. • Emerges (only approximately) through learning in unstructured connectionist networks. – McClelland and Rogers (2003). Bayesian inference to theories • Challenge to the nativist-empiricist dichotomy. – We really do have structured domain theories. – We really do learn them. • Bayesian framework applies over multiple levels: – Given hypothesis space + data, infer concepts. – Given theory + data, infer hypothesis space. – Given X + data, infer theory. Bayesian inference to theories • Candidate theories for biological species and their features: – T0: Features generated independently for each species. (c.f. naive Bayes, Anderson’s rational model.) – T1: Features generated by mutation in tree-structured taxonomy of species. – T2: Features generated by mutation in a one-dimensional chain of species. • Score theories by likelihood on object-feature matrix: p( D | T ) p( D | S , T ) p(S | T ) S max p( D | S , T ) p( S | T ) S T0: • No organizational structure for species. • Features distributed independently over species. F1 F2 F3 F1 F2 F2 F5 F2 F6 F2 F4 F1 F4 F7 F4 F1 F7 F4 F2 F1 F8 F2 F6 F8 F7 F5 F8 F5 F3 F6 F9 F5 F7 F10 F9 F7 F9 F12 F6 F8 F10 F8 F9 F12 F12 F13 F10 F13 F11 F9 F11 F9 F14 F13 F14 F14 F13 F14 F13 F12 F14 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 Species 1 Species 2 Species 3 Species 4 Data Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Features T0: • No organizational structure for species. • Features distributed independently over species. F1 F3 F3 F1 F6 F7 F7 F2 F2 F6 F7 F8 F8 F5 F5 F6 F6 F7 F8 F9 F9 F9 F9 F7 F7 F8 F9 F11 F11 F4 F4 F10 F10 F8 F8 F9 F10 F12 F12 F8 F8 F13 F13 F9 F9 F11 F11 F14 F14 F9 F9 F14 F14 F11 F11 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 Species 1 Species 2 Species 3 Species 4 Data Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Features T0: T1: • No organizational structure • Species organized in for species. taxonomic tree structure. • Features distributed • Features distributed via independently over species. stochastic mutation process. F9 F1 F3 F3 F8 F1 F6 F7 F7 F2 F2 F7 F11 F14 F6 F7 F8 F8 F5 F5 F6 F6 F7 F8 F9 F9 F13 F9 F9 F7 F7 F6 F8 F9 F11 F11 F4 F4 F10 F10 F8 F8 F12 F14 F10 F9 F10 F12 F12 F8 F8 F13 F13 F9 F9 F3 F1 F2 F4 F5 F11 F11 F14 F14 F9 F9 F14 F14 F11 F11 F10 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S3 S4 S1 S2 S9 S10 S5 S6 S7 S8 Species 1 Species 2 Species 3 Species 4 Data Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Features T0: p(Data|T1) ~ 1.83 x 10-41 T1: p(Data|T2) ~ 2.42 x 10-32 • No organizational structure • Species organized in for species. taxonomic tree structure. • Features distributed • Features distributed via independently over species. stochastic mutation process. F9 F1 F3 F3 F8 F1 F6 F7 F7 F2 F2 F7 F11 F14 F6 F7 F8 F8 F5 F5 F6 F6 F7 F8 F9 F9 F13 F9 F9 F7 F7 F6 F8 F9 F11 F11 F4 F4 F10 F10 F8 F8 F12 F14 F10 F9 F10 F12 F12 F8 F8 F13 F13 F9 F9 F3 F1 F2 F4 F5 F11 F11 F14 F14 F9 F9 F14 F14 F11 F11 F10 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S3 S4 S1 S2 S9 S10 S5 S6 S7 S8 Species 1 Species 2 Species 3 Species 4 Data Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Features T0: T1: • No organizational structure • Species organized in for species. taxonomic tree structure. • Features distributed • Features distributed via independently over species. stochastic mutation process. F1 F2 F2 F4 F1 F5 F7 F13 F3 F1 F2 F14 F8 F9 F12 F2 F5 F2 F6 F2 F4 F9 F13 F10 F1 F4 F7 F4 F1 F7 F4 F2 F1 F8 F7 F11 F13 F2 F6 F8 F7 F5 F8 F5 F3 F6 F9 F10 F8 F13 F10 F11 F5 F7 F10 F9 F7 F9 F12 F6 F8 F10 F12 F7 F3 F8 F9 F12 F12 F13 F10 F13 F11 F9 F11 F12 F12 F9 F6 F5 F9 F14 F13 F14 F14 F13 F14 F13 F12 F14 F6 F5 F8 F3 F2 F6 F6 F2 F14 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S2 S4 S7 S10 S8 S1 S9 S6 S3 S5 Species 1 Species 2 Species 3 Species 4 Data Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Features T0: p(Data|T1) ~ 2.29 x 10-42 T1: p(Data|T2) ~ 4.38 x 10-53 • No organizational structure • Species organized in for species. taxonomic tree structure. • Features distributed • Features distributed via independently over species. stochastic mutation process. F1 F2 F2 F4 F1 F5 F7 F13 F3 F1 F2 F14 F8 F9 F12 F2 F5 F2 F6 F2 F4 F9 F13 F10 F1 F4 F7 F4 F1 F7 F4 F2 F1 F8 F7 F11 F13 F2 F6 F8 F7 F5 F8 F5 F3 F6 F9 F10 F8 F13 F10 F11 F5 F7 F10 F9 F7 F9 F12 F6 F8 F10 F12 F7 F3 F8 F9 F12 F12 F13 F10 F13 F11 F9 F11 F12 F12 F9 F6 F5 F9 F14 F13 F14 F14 F13 F14 F13 F12 F14 F6 F5 F8 F3 F2 F6 F6 F2 F14 S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S2 S4 S7 S10 S8 S1 S9 S6 S3 S5 Species 1 Species 2 Species 3 Species 4 Data Species 5 Species 6 Species 7 Species 8 Species 9 Species 10 Features Empirical tests • Synthetic data: 32 objects, 120 features – tree-structured generative model – linear chain generative model – unconstrained (independent features). • Real data – Animal feature judgments: 48 species, 85 features. – US Supreme Court decisions, 1981-1985: 9 people, 637 cases. Results Preferred Model Null Tree Linear Tree Linear Theory acquisition: summary • So far, just a computational proof of concept. • Future work: – Experimental studies of theory acquisition in the lab, with adult and child subjects. – Modeling developmental or historical trajectories of theory change. • Sources of hypotheses for candidate theories: – What is innate? – Role of analogy? Outline • Morning – Introduction (Josh) – Basic case study #1: Flipping coins (Tom) – Basic case study #2: Rules and similarity (Josh) • Afternoon – Advanced case study #1: Causal induction (Tom) – Advanced case study #2: Property induction (Josh) – Quick tour of more advanced topics (Tom) Advanced topics Structure and statistics • Statistical language modeling – topic models • Relational categorization – attributes and relations Structure and statistics • Statistical language modeling – topic models • Relational categorization – attributes and relations Statistical language modeling • A variety of approaches to statistical language modeling are used in cognitive science – e.g. LSA (Landauer & Dumais, 1997) – distributional clustering (Redington, Chater, & Finch, 1998) • Generative models have unique advantages – identify assumed causal structure of language – make use of standard tools of Bayesian statistics – easily extended to capture more complex structure Generative models for language latent structure observed data Generative models for language meaning sentences Topic models • Each document a mixture of topics • Each word chosen from a single topic • Introduced by Blei, Ng, and Jordan (2001), reinterpretation of PLSI (Hofmann, 1999) • Idea of probabilistic topics widely used (eg. Bigi et al., 1997; Iyer & Ostendorf, 1996; Ueda & Saito, 2003) Generating a document q distribution over topics z z z topic assignments w w w observed words w P(w|z = 1) = f (1) w P(w|z = 2) = f (2) HEART 0.2 HEART 0.0 LOVE 0.2 LOVE 0.0 SOUL 0.2 SOUL 0.0 TEARS 0.2 TEARS 0.0 JOY 0.2 JOY 0.0 SCIENTIFIC 0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.0 KNOWLEDGE 0.2 WORK 0.0 WORK 0.2 RESEARCH 0.0 RESEARCH 0.2 MATHEMATICS 0.0 MATHEMATICS 0.2 topic 1 topic 2 Choose mixture weights for each document, generate “bag of words” q = {P(z = 1), P(z = 2)} MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS {0, 1} RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC {0.25, 0.75} HEART LOVE TEARS KNOWLEDGE HEART MATHEMATICS HEART RESEARCH LOVE MATHEMATICS {0.5, 0.5} WORK TEARS SOUL KNOWLEDGE HEART {0.75, 0.25} WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE LOVE SOUL {1, 0} TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY A selection of topics (from 500) THEORY SPACE ART STUDENTS BRAIN CURRENT NATURE THIRD SCIENTISTS EARTH PAINT TEACHER NERVE ELECTRICITY WORLD FIRST EXPERIMENT MOON ARTIST STUDENT SENSE ELECTRIC HUMAN SECOND OBSERVATIONS PLANET PAINTING TEACHERS SENSES CIRCUIT PHILOSOPHY THREE SCIENTIFIC ROCKET PAINTED TEACHING ARE IS MORAL FOURTH EXPERIMENTS MARS ARTISTS CLASS NERVOUS ELECTRICAL KNOWLEDGE FOUR HYPOTHESIS ORBIT MUSEUM CLASSROOM NERVES VOLTAGE THOUGHT GRADE EXPLAIN ASTRONAUTS WORK SCHOOL BODY FLOW REASON TWO SCIENTIST FIRST PAINTINGS LEARNING SMELL BATTERY SENSE FIFTH OBSERVED SPACECRAFT STYLE PUPILS TASTE WIRE OUR SEVENTH EXPLANATION JUPITER PICTURES CONTENT TOUCH WIRES TRUTH SIXTH BASED SATELLITE WORKS INSTRUCTION MESSAGES SWITCH NATURAL EIGHTH OBSERVATION SATELLITES OWN TAUGHT IMPULSES CONNECTED EXISTENCE HALF IDEA ATMOSPHERE SCULPTURE GROUP CORD ELECTRONS BEING SEVEN EVIDENCE SPACESHIP PAINTER GRADE ORGANS RESISTANCE LIFE SIX THEORIES SURFACE ARTS SHOULD SPINAL POWER MIND SINGLE BELIEVED SCIENTISTS BEAUTIFUL GRADES FIBERS CONDUCTORS ARISTOTLE NINTH DISCOVERED ASTRONAUT DESIGNS CLASSES SENSORY CIRCUITS BELIEVED END OBSERVE SATURN PORTRAIT PUPIL PAIN TUBE EXPERIENCE TENTH FACTS MILES PAINTERS GIVEN IS NEGATIVE REALITY ANOTHER A selection of topics (from 500) DISEASE MIND STORY FIELD SCIENCE BALL JOB WATER BACTERIA WORLD STORIES MAGNETIC STUDY GAME WORK FISH DISEASES DREAM TELL MAGNET SCIENTISTS TEAM JOBS SEA GERMS DREAMS CHARACTER WIRE SCIENTIFIC FOOTBALL CAREER SWIM FEVER SWIMMING THOUGHT CHARACTERS NEEDLE KNOWLEDGE BASEBALL EXPERIENCE CAUSE POOL IMAGINATION AUTHOR CURRENT WORK PLAYERS EMPLOYMENT CAUSED MOMENT READ COIL RESEARCH PLAY OPPORTUNITIES LIKE SPREAD THOUGHTS TOLD POLES CHEMISTRY FIELD WORKING SHELL VIRUSES OWN SETTING IRON TECHNOLOGY PLAYER TRAINING SHARK INFECTION REAL TALES COMPASS MANY BASKETBALL SKILLS TANK VIRUS LIFE PLOT LINES MATHEMATICS COACH CAREERS SHELLS MICROORGANISMS SHARKS IMAGINE TELLING CORE BIOLOGY PLAYED POSITIONS PERSON SENSE SHORT ELECTRIC FIELD PLAYING FIND DIVING DIRECTION PHYSICS HIT POSITION INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION COMMON STRANGE ACTION FORCE LABORATORY TENNIS FIELD SWAM CAUSING FEELING TRUE MAGNETS STUDIES TEAMS OCCUPATIONS LONG SMALLPOX WHOLE EVENTS BE WORLD GAMES REQUIRE SEAL BODY BEING TELLS MAGNETISM SCIENTIST SPORTS OPPORTUNITY DIVE INFECTIONS MIGHT TALE POLE STUDYING BAT EARN DOLPHIN CERTAIN HOPE NOVEL INDUCED SCIENCES TERRY ABLE UNDERWATER A selection of topics (from 500) DISEASE MIND STORY FIELD SCIENCE BALL JOB WATER BACTERIA WORLD STORIES MAGNETIC STUDY GAME WORK FISH DISEASES DREAM TELL MAGNET SCIENTISTS TEAM JOBS SEA GERMS DREAMS CHARACTER WIRE SCIENTIFIC FOOTBALL CAREER SWIM FEVER SWIMMING THOUGHT CHARACTERS NEEDLE KNOWLEDGE BASEBALL EXPERIENCE CAUSE POOL IMAGINATION AUTHOR CURRENT WORK PLAYERS EMPLOYMENT CAUSED MOMENT READ COIL RESEARCH PLAY OPPORTUNITIES LIKE SPREAD THOUGHTS TOLD POLES CHEMISTRY FIELD WORKING SHELL VIRUSES OWN SETTING IRON TECHNOLOGY PLAYER TRAINING SHARK INFECTION REAL TALES COMPASS MANY BASKETBALL SKILLS TANK VIRUS LIFE PLOT LINES MATHEMATICS COACH CAREERS SHELLS MICROORGANISMS SHARKS IMAGINE TELLING CORE BIOLOGY PLAYED POSITIONS PERSON SENSE SHORT ELECTRIC FIELD PLAYING FIND DIVING DIRECTION PHYSICS HIT POSITION INFECTIOUS DOLPHINS CONSCIOUSNESS FICTION COMMON STRANGE ACTION FORCE LABORATORY TENNIS FIELD SWAM CAUSING FEELING TRUE MAGNETS STUDIES TEAMS OCCUPATIONS LONG SMALLPOX WHOLE EVENTS BE WORLD GAMES REQUIRE SEAL BODY BEING TELLS MAGNETISM SCIENTIST SPORTS OPPORTUNITY DIVE INFECTIONS MIGHT TALE POLE STUDYING BAT EARN DOLPHIN CERTAIN HOPE NOVEL INDUCED SCIENCES TERRY ABLE UNDERWATER Learning topic hiearchies (Blei, Griffiths, Jordan, & Tenenbaum, 2004) Syntax and semantics from statistics Factorization of language based on semantics: probabilistic topics statistical dependency patterns: q long-range, document specific, dependencies z z z w w w short-range dependencies constant across all documents x x x syntax: probabilistic regular grammar (Griffiths, Steyvers, Blei, & Tenenbaum, submitted) x=2 OF 0.6 x=1 0.8 FOR 0.3 BETWEEN 0.1 z = 1 0.4 z = 2 0.6 HEART 0.2 SCIENTIFIC 0.2 LOVE 0.2 KNOWLEDGE 0.2 0.7 SOUL 0.2 WORK 0.2 0.3 0.1 TEARS 0.2 RESEARCH 0.2 JOY 0.2 MATHEMATICS 0.2 0.2 x=3 THE 0.6 0.9 A 0.3 MANY 0.1 x=2 OF 0.6 x=1 0.8 FOR 0.3 BETWEEN 0.1 z = 1 0.4 z = 2 0.6 HEART 0.2 SCIENTIFIC 0.2 LOVE 0.2 KNOWLEDGE 0.2 0.7 SOUL 0.2 WORK 0.2 0.3 0.1 TEARS 0.2 RESEARCH 0.2 JOY 0.2 MATHEMATICS 0.2 0.2 x=3 THE 0.6 0.9 A 0.3 MANY 0.1 THE ……………………………… x=2 OF 0.6 x=1 0.8 FOR 0.3 BETWEEN 0.1 z = 1 0.4 z = 2 0.6 HEART 0.2 SCIENTIFIC 0.2 LOVE 0.2 KNOWLEDGE 0.2 0.7 SOUL 0.2 WORK 0.2 0.3 0.1 TEARS 0.2 RESEARCH 0.2 JOY 0.2 MATHEMATICS 0.2 0.2 x=3 THE 0.6 0.9 A 0.3 MANY 0.1 THE LOVE…………………… x=2 OF 0.6 x=1 0.8 FOR 0.3 BETWEEN 0.1 z = 1 0.4 z = 2 0.6 HEART 0.2 SCIENTIFIC 0.2 LOVE 0.2 KNOWLEDGE 0.2 0.7 SOUL 0.2 WORK 0.2 0.3 0.1 TEARS 0.2 RESEARCH 0.2 JOY 0.2 MATHEMATICS 0.2 0.2 x=3 THE 0.6 0.9 A 0.3 MANY 0.1 THE LOVE OF……………… x=2 OF 0.6 x=1 0.8 FOR 0.3 BETWEEN 0.1 z = 1 0.4 z = 2 0.6 HEART 0.2 SCIENTIFIC 0.2 LOVE 0.2 KNOWLEDGE 0.2 0.7 SOUL 0.2 WORK 0.2 0.3 0.1 TEARS 0.2 RESEARCH 0.2 JOY 0.2 MATHEMATICS 0.2 0.2 x=3 THE 0.6 0.9 A 0.3 MANY 0.1 THE LOVE OF RESEARCH …… Semantic categories FOOD MAP DOCTOR BOOK GOLD BEHAVIOR CELLS PLANTS FOODS NORTH PATIENT BOOKS IRON SELF CELL PLANT BODY EARTH HEALTH READING SILVER INDIVIDUAL ORGANISMS LEAVES NUTRIENTS SOUTH HOSPITAL INFORMATION COPPER PERSONALITY ALGAE SEEDS DIET POLE MEDICAL LIBRARY METAL RESPONSE BACTERIA SOIL FAT MAPS CARE REPORT METALS SOCIAL MICROSCOPE ROOTS SUGAR EQUATOR PATIENTS PAGE STEEL EMOTIONAL MEMBRANE FLOWERS ENERGY WEST NURSE TITLE CLAY LEARNING ORGANISM WATER MILK LINES DOCTORS SUBJECT LEAD FEELINGS FOOD FOOD EATING EAST MEDICINE PAGES ADAM PSYCHOLOGISTS LIVING GREEN FRUITS AUSTRALIA NURSING GUIDE ORE INDIVIDUALS FUNGI SEED VEGETABLES GLOBE TREATMENT WORDS ALUMINUM PSYCHOLOGICAL MOLD STEMS WEIGHT POLES NURSES MATERIAL MINERAL EXPERIENCES MATERIALS FLOWER FATS HEMISPHERE PHYSICIAN ARTICLE MINE ENVIRONMENT NUCLEUS STEM NEEDS LATITUDE HOSPITALS ARTICLES STONE HUMAN CELLED LEAF CARBOHYDRATES PLACES DR WORD MINERALS RESPONSES STRUCTURES ANIMALS VITAMINS LAND SICK FACTS POT BEHAVIORS MATERIAL ROOT CALORIES WORLD ASSISTANT AUTHOR MINING ATTITUDES STRUCTURE POLLEN PROTEIN COMPASS EMERGENCY REFERENCE MINERS PSYCHOLOGY GREEN GROWING MINERALS CONTINENTS PRACTICE NOTE TIN PERSON MOLDS GROW Syntactic categories SAID THE MORE ON GOOD ONE HE BE ASKED HIS SUCH AT SMALL SOME YOU MAKE THOUGHT THEIR LESS INTO NEW MANY THEY GET TOLD YOUR MUCH FROM IMPORTANT TWO I HAVE SAYS HER KNOWN WITH GREAT EACH SHE GO MEANS ITS JUST THROUGH LITTLE ALL WE TAKE CALLED MY BETTER OVER LARGE MOST IT DO CRIED OUR RATHER AROUND * ANY PEOPLE FIND SHOWS THIS GREATER AGAINST BIG THREE EVERYONE USE ANSWERED THESE HIGHER ACROSS LONG THIS OTHERS SEE TELLS A LARGER UPON HIGH EVERY SCIENTISTS HELP REPLIED AN LONGER TOWARD DIFFERENT SEVERAL SOMEONE KEEP SHOUTED THAT FASTER UNDER SPECIAL FOUR WHO GIVE EXPLAINED NEW EXACTLY ALONG OLD FIVE NOBODY LOOK LAUGHED THOSE SMALLER NEAR STRONG BOTH ONE COME MEANT EACH SOMETHING BEHIND YOUNG TEN SOMETHING WORK WROTE MR BIGGER OFF COMMON SIX ANYONE MOVE SHOWED ANY FEWER ABOVE WHITE MUCH EVERYBODY LIVE BELIEVED MRS LOWER DOWN SINGLE TWENTY SOME EAT WHISPERED ALL ALMOST BEFORE CERTAIN EIGHT THEN BECOME Statistical language modeling • Generative models provide – transparent assumptions about causal process – opportunities to combine and extend models • Richer generative models... – probabilistic context-free grammars – paragraph or sentence-level dependencies – more complex semantics Structure and statistics • Statistical language modeling – topic models • Relational categorization – attributes and relations Relational categorization • Most approaches to categorization in psychology and machine learning focus on attributes - properties of objects – words in titles of CogSci posters • But… a significant portion of knowledge is organized in terms of relations – co-authors on posters – who talks to whom (Kemp, Griffiths, & Tenenbaum, 2004) Attributes and relations Data Model objects attributes P(X) = ik z P(xik|zi) i P(zi) X mixture model (c.f. Anderson, 1990) objects objects Y P(Y) = ij z P(yij|zi) i P(zi) stochastic blockmodel Stochastic blockmodels • For any pair of objects, (i,j), probability of relation is determined by classes, (zi, zj) To type j l11 l12 l13 Each entity has a type = Z From type i l21 l22 l23 L l31 l32 l33 P(Z,L|Y) P(Y|Z,LP(Z)P(L • Allows types of objects and class probabilities to be learned from data Stochastic blockmodels A B A B C D C A B C A B C D A A B B C C D Categorizing words • Relational data: word association norms (Nelson, McEvoy, & Schreiber, 1998) • 5018 x 5018 matrix of associations – symmetrized – all words with < 50 and > 10 associates – 2513 nodes, 34716 links Categorizing words BAND TIE SEW WASH INSTRUMENT COAT MATERIAL LIQUID BLOW SHOES WOOL BATHROOM HORN ROPE YARN SINK FLUTE LEATHER WEAR CLEANER BRASS SHOE TEAR STAIN GUITAR HAT FRAY DRAIN PIANO PANTS JEANS DISHES TUBA WEDDING COTTON TUB TRUMPET STRING CARPET SCRUB Categorizing actors • Internet Movie Database (IMDB) data, from the start of cinema to 1960 (Jeremy Kubica) • Relational data: collaboration • 5000 x 5000 matrix of most prolific actors – all actors with < 400 and > 1 collaborators – 2275 nodes, 204761 links Categorizing actors Albert Lieven Moore Marriott Gino Cervi Archie Ricks Karel Stepanek Laurence Hanray Nadia Gray Helen Gibson Walter Rilla Gus McNaughton Enrico Glori Oscar Gahan Anton Walbrook Gordon Harker Paolo Stoppa Buck Moulton Helen Haye Bernardi Nerio Buck Connors Alfred Goddard Amedeo Nazzari Clyde McClary Morland Graham Gina Lollobrigida Barney Beasley Margaret Lockwood Aldo Silvani Buck Morgan Hal Gordon Franco Interlenghi Tex Phelps Bromley Davenport Guido Celano George Sowards Germany UK British comedy Italian US Westerns Structure and statistics • Bayesian approach allows us to specify structured probabilistic models • Explore novel representations and domains – topics for semantic representation – relational categorization • Use powerful methods for inference, developed in statistics and machine learning Other methods and tools... • Inference algorithms – belief propagation – dynamic programming – the EM algorithm and variational methods – Markov chain Monte Carlo • More complex models – Dirichlet processes and Bayesian non-parametrics – Gaussian processes and kernel methods Reading list at http://www.bayesiancognition.com Taking stock Bayesian models of inductive learning • Inductive leaps can be explained with hierarchical Theory-based Bayesian models: Domain Theory Probabilistic Bayesian Generative Structural Hypotheses Model inference Data Bayesian models of inductive learning • Inductive leaps can be explained with hierarchical Theory-based Bayesian models: T S S S ... D D D D D D D D D ... Bayesian models of inductive learning • Inductive leaps can be explained with hierarchical Theory-based Bayesian models. • What the approach offers: – Strong quantitative models of generalization behavior. – Flexibility to model different patterns of reasoning that in different tasks and domains, using differently structured theories, but the same general-purpose Bayesian engine. – Framework for explaining why inductive generalization works, where knowledge comes from as well as how it is used. Bayesian models of inductive learning • Inductive leaps can be explained with hierarchical Theory-based Bayesian models. • Challenges: – Theories are hard. Bayesian models of inductive learning • Inductive leaps can be explained with hierarchical Theory-based Bayesian models: • The interaction between structure and statistics is crucial. – How structured knowledge supports statistical learning, by constraining hypothesis spaces. – How statistics supports reasoning with and learning structured knowledge. – How complex structures can grow from data, rather than being fully specified in advance.

DOCUMENT INFO

Shared By:

Categories:

Tags:
Electronic Document Storage, Ebook Free, pdf Ebook, PowerPoint Presentation, Free Electronic, Electronic Document, Bayesian models, Search theory, book title, based design

Stats:

views: | 28 |

posted: | 10/19/2010 |

language: | English |

pages: | 384 |

OTHER DOCS BY fjwuxn

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.