Docstoc

Reasoning Under Uncertainty

Document Sample
Reasoning Under Uncertainty Powered By Docstoc
					Reasoning Under Uncertainty

      Artificial Intelligence
          CSPP 56553
       February 18, 2004
                    Agenda
• Motivation
  – Reasoning with uncertainty
     • Medical Informatics
• Probability and Bayes’ Rule
  – Bayesian Networks
  – Noisy-Or
• Decision Trees and Rationality
• Conclusions
                       Uncertainty
• Search and Planning Agents
  – Assume fully observable, deterministic, static
• Real World:
  – Probabilities capture “Ignorance & Laziness”
     • Lack relevant facts, conditions
     • Failure to enumerate all conditions, exceptions
  – Partially observable, stochastic, extremely complex
  – Can't be sure of success, agent will maximize
  – Bayesian (subjective) probabilities relate to knowledge
                   Motivation
• Uncertainty in medical diagnosis
  – Diseases produce symptoms
  – In diagnosis, observed symptoms => disease ID
  – Uncertainties
     • Symptoms may not occur
     • Symptoms may not be reported
     • Diagnostic tests not perfect
        – False positive, false negative

• How do we estimate confidence?
              Motivation II
• Uncertainty in medical decision-making
  – Physicians, patients must decide on treatments
  – Treatments may not be successful
  – Treatments may have unpleasant side effects
• Choosing treatments
  – Weigh risks of adverse outcomes
• People are BAD at reasoning intuitively
  about probabilities
  – Provide systematic analysis
           Probability Basics
• The sample space:
  – A set Ω ={ω1, ω2, ω3,… ωn}
     • E.g 6 possible rolls of die;
     • ωi is a sample point/atomic event
• Probability space/model is a sample space
  with an assignment P(ω) for every ω in Ω
  s.t. 0<= P(ω)<=1; Σ ωP(ω) = 1
  – E.g. P(die roll < 4)=1/6+1/6+1/6=1/2
              Random Variables
• A random variable is a function from sample
  points to a range (e.g. reals, bools)
     • E.g. Odd(1) = true
• P induces a probability distribution for any r.v X:
  – P(X=xi) = Σ{ω:X(ω)=xi}P(ω)
        – E.g. P(Odd=true)=1/6+1/6+1/6=1/2

• Proposition is event (set of sample pts) s.t.
  proposition is true: e.g. event a= A(ω)=true
          Why probabilities?
• Definitions imply that logically related
  events have related probabilities
• In AI applications, sample points are
  defined by set of random variables
  – Random vars: boolean, discrete, continuous
             Prior Probabilities
• Prior probabilities: belief prior to evidence
   – E.g. P(cavity=t)=0.2; P(weather=sunny)=0.6
   – Distribution gives values for all assignments
• Joint distribution on set of r.v.s gives probability
  on every atomic event of r.v.s
   – E.g. P(weather,cavity)=4x2 matrix of values
• Every question about a domain can be answered
  with joint b/c every event is a sum of sample pts
      Conditional Probabilities
• Conditional (posterior) probabilities
  – E.g. P(cavity|toothache) = 0.8, given only that
  – P(cavity|toothache)=2 elt vector of 2 elt vectors
• Can add new evidence, possibly irrelevant
• P(a|b) = P(a,b)/P(b) where P(b) ≠0
• Also, P(a,b)=P(a|b)P(b)=P(b|a)P(a)
  – Product rule generalizes to chaining
Inference By Enumeration
Inference by Enumeration
Inference by Enumeration
Independence
Conditional Independence
Conditional Independence II
 Probabilities Model Uncertainty
• The World - Features
   – Random variables       X 2
                          {1 X X     }
                                , ,...,
                                      n
   – Feature values    x i, x
                      { i1,x2..., i }
                                 ik
• States of the world
   – Assignments of values to variables
         n

       k
        i 1
               i

   – Exponential in # of variables
   – ki  2;2 possible states
              n
    Probabilities of World States
• P( Si ): Joint probability of assignments
   – States are distinct and exhaustive
       i1k i
         n



         P (S j )
         j 1
• Typically care about SUBSET of assignments
   – aka “Circumstance”
  (t f 2 3X
  P  X f
   
  2 4 P
   ,
   X
  X )  ({ }
          4
         , ,
        Xt v
        1X ,
         u   
                     }
                     u
                     t {
                     , ,
                      f f
                      }
                     {tv


   – Exponential in # of don’t cares
           A Simpler World
• 2^n world states = Maximum entropy
  – Know nothing about the world
• Many variables independent
  – P(strep,ebola) = P(strep)P(ebola)
• Conditionally independent
  – Depend on same factors but not on each other
  – P(fever,cough|flu) = P(fever|flu)P(cough|flu)
         Probabilistic Diagnosis
• Question:
   – How likely is a patient to have a disease if they have the
     symptoms?
• Probabilistic Model: Bayes’ Rule
• P(D|S) = P(S|D)P(D)/P(S)
   – Where
      • P(S|D) : Probability of symptom given disease
      • P(D): Prior probability of having disease
      • P(S): Prior probability of having symptom
                   Diagnosis
• Consider Meningitis:
  –   Disease: Meningitis: m
  –   Symptom: Stiff neck: s
  –   P(s|m) = 0.5
  –   P(m) =0.0001
  –   P(s) = 0.1
  –   How likely is it that someone with a stiff neck
      actually has meningitis?
      Modeling (In)dependence
• Simple, graphical notation for conditional
  independence; compact spec of joint
• Bayesian network
   – Nodes = Variables
   – Directed acyclic graph: link ~ directly influences
   – Arcs = Child depends on parent(s)
      • No arcs = independent (0 incoming: only a priori)
      • Parents of X =  ( X )
      • For each X need P X| X
                           (      ( ))
Example I
          Simple Bayesian Network
   • MCBN1
                                Need:      Truth table
A = only a priori               P(A)       2
B depends on A          A       P(B|A)     2*2
C depends on A                  P(C|A)     2*2
D depends on B,C                P(D|B,C)   2*2*2
E depends on C      B       C   P(E|C)     2*2


                        D       E
   Simplifying with Noisy-OR
• How many computations?
  – p = # parents; k = # values for variable
  – (k-1)k^p
  – Very expensive! 10 binary parents=2^10=1024
• Reduce computation by simplifying model
  – Treat each parent as possible independent cause
  – Only 11 computations
     • 10 causal probabilities + “leak” probability
        – “Some other cause”
                   Noisy-OR Example
                       A            B

                           Pn (b | a )  1  (1  ca )(1  L)
                           Pn (b | a )  (1  ca )(1  L)
                           Pn (b | a )  1  (1  L)  L  0.5
                           Pn (b | a )  1  L
P(b|a)   b   b

  a      0.6 0.4           Pn (b | a )  1  (1  ca )(1  L)  0.6
                           (1  ca )(1  L)  0.4
  a      0.5 0.5           (1  ca )  0.4 /(1  L)
                           ca  1  0.4 / 0.5  0.2
             Noisy-OR Example II
A       B            Full model: P(d|ab)P(d|ab)P(d|ab)P(d|ab) & neg

                                                                           Assume:
    D       Pn (d | ab)  1  (1  ca )(1  cb )(1  L)
                                                                           P(a)=0.1
            Pn (d | a b)  1  (1  cb )(1  L)
            Pn (d | ab )  1  (1  ca )(1  L)                            P(b)=0.05
            Pn (d | a b )  1  (1  L)  L  0.3                          P(d|ab)=0.3

            Pn (d | b)  Pn (d | ab) Pn (a )  Pn (d | a b) Pn (a )
                                                                           ca    = 0.5
            1  0.7  (1  ca )(1  cb )(1  L)0.1  (1  cb )(1  L)0.9   P(d|b) = 0.7
            0.3  0.035(1  cb )  0.63(1  cb )
            0.3  0.665(1  cb )
            cb  0.55
             Graph Models
• Bipartite graphs
  – E.g. medical reasoning
  – Generally, diseases cause symptom (not reverse)
                                s1
                                s2
       d1
                                s3
       d2
                                s4
       d3
                                s5
       d4                       s6
                  Topologies
• Generally more complex
  – Polytree: One path between any two nodes
• General Bayes Nets
  – Graphs with undirected cycles
     • No directed cycles - can’t be own cause
• Issue: Automatic net acquisition
  – Update probabilities by observing data
  – Learn topology: use statistical evidence of indep,
    heuristic search to find most probable structure
        Holmes Example (Pearl)
Holmes is worried that his house will be burgled. For
the time period of interest, there is a 10^-4 a priori chance
of this happening, and Holmes has installed a burglar alarm
to try to forestall this event. The alarm is 95% reliable in
sounding when a burglary happens, but also has a false
positive rate of 1%. Holmes’ neighbor, Watson, is 90% sure
to call Holmes at his office if the alarm sounds, but he is also
a bit of a practical joker and, knowing Holmes’ concern,
might (30%) call even if the alarm is silent. Holmes’ other
neighbor Mrs. Gibbons is a well-known lush and often
befuddled, but Holmes believes that she is four times more
likely to call him if there is an alarm than not.
       Holmes Example: Model

There a four binary random variables:
B: whether Holmes’ house has been burgled
A: whether his alarm sounded
W: whether Watson called
G: whether Gibbons called

                             W

      B           A

                             G
          Holmes Example: Tables

B = #t          B=#f     A    W=#t   W=#f

0.0001          0.9999   #t   0.90    0.10
                         #f   0.30    0.70

 B    A=#t      A=#f
                         A    G=#t   G=#f
 #t      0.95   0.05
                         #t   0.40    0.60
 #f      0.01   0.99
                         #f   0.10    0.90
           Decision Making
• Design model of rational decision making
  – Maximize expected value among alternatives
• Uncertainty from
  – Outcomes of actions
  – Choices taken
• To maximize outcome
  – Select maximum over choices
  – Weighted average value of chance outcomes
              Gangrene Example

                        Medicine            Amputate foot


              Worse 0.25
 Die 0.05                   Full Recovery 0.7          Die 0.01
                                 1000      Live 0.99
    0
                                              850      0
         Medicine      Amputate leg




Die 0.4     Live 0.6   Die 0.02       Live 0.98
  0           995         0             700
            Decision Tree Issues
• Problem 1: Tree size
  – k activities : 2^k orders
• Solution 1: Hill-climbing
  – Choose best apparent choice after one step
     • Use entropy reduction
• Problem 2: Utility values
  – Difficult to estimate, Sensitivity, Duration
     • Change value depending on phrasing of question
• Solution 2c: Model effect of outcome over lifetime
                Conclusion
• Reasoning with uncertainty
  – Many real systems uncertain - e.g. medical
    diagnosis
• Bayes’ Nets
  – Model (in)dependence relations in reasoning
  – Noisy-OR simplifies model/computation
     • Assumes causes independent
• Decision Trees
  – Model rational decision making
     • Maximize outcome: Max choice, average outcomes
     Bayesian Spam Filtering
• Automatic Text Categorization
• Probabilistic Classifier
  – Conditional Framework
  – Naïve Bayes Formulation
     • Independence assumptions galore
  – Feature Selection
  – Classification & Evaluation
         Spam Classification
• Text categorization problem
  – Given a message,M, is it Spam or NotSpam?
• Probabilistic framework
  – P(Spam|M)> P(NotSpam|M)
     • P(Spam|M)=P(Spam,M)P(M)
     • P(NotSpam|M)=P(NotSpam,M)P(M)
  – Which is more likely?
     Characterizing a Message
• Represent message M as set of features
  – Features: a1,a2,….an
• What features?
  – Words! (again)
        – Alternatively (skip) n-gram sequences
     • Stemmed (?)
     • Term frequencies: N(W, Spam); N(W,NotSpam)
        – Also, N(Spam),N(NotSpam): # of words in each class
    Characterizing a Message II
• Estimating term conditional probabilities
                                1
                  N (W , C ) 
      P(W | C )                K
                     N (C )  1
• Selecting good features:
   – Exclude terms s.t.
      • N(W|Spam)+N(W|NotSpam)<4
      • 0.45 <=P(W|Spam)/P(W|Spam)+P(W|NotSpam)<=0.55
        Naïve Bayes Formulation
• Naïve Bayes (aka “Idiot” Bayes)
  – Assumes all features independent
        • Not accurate but useful simplification
• So,
  – P(M,Spam)=P(a1,a2,..,an,Spam)
  –            = P(a1,a2,..,an|Spam)P(Spam)
  –            =P(a1|Spam)..P(an|Spam)P(Spam)
  – Likewise for NotSpam
 Experimentation (Pantel & Lin)
• Training: 160 spam, 466 non-spam
• Test: 277 spam, 346 non-spam

• 230,449 training words; 60434 spam
  – 12228 terms; filtering reduces to 3848
              Results (PL)
• False positives: 1.16%
• False negatives: 8.3%
• Overall error: 4.33%

• Simple approach, effective
                   Variants
• Features?

• Model?
  – Explicit bias to certain error types


• Address lists
• Explicit rules

				
DOCUMENT INFO