Introduction to Artificial Intelligence 236501 by ewghwehws


									 Intro to AI

Ruth Bergman
  Fall 2002
                     Why Not Use Logic?
• Suppose I want to write down rules about medical diagnosis:
       Diagnostic rules: A x has(x,sorethroat)  has(x, cold)
       Causal rules:         A x has(x,cold)  has(x, sorethroat)
• Clearly, this isn’t right:
       Diagnostic case:
       • we may not know exactly which collections of symptoms
         or tests allow us to infer a diagnosis (qualification problem)
       • even if we did we may not have that information
       • even if we do, how do we know it is correct?
       Causal rules:
       • Symptoms don’t usually appear guaranteed; note logical
         case would use contrapositive
       • There are lots of causes for symptoms; if we miss one we
         might get an incorrect inference
       • How do we reason backwards?
• The problem with pure FOL is that it deals with black
  and write

• The world isn’t black and write because of uncertainty:
   1. Uncertainty due to imprecision or noise
   2. Uncertainty because we don’t know everything about the
   3. Uncertainty because in practice we often cannot acquire all
      the information we’d like.

– As a result, we’d like to assign a degree of belief (or
  plausibility or possibility) to any statement we make
   – note this is different than a degree of truth!
                     Ways of Handling
• MYCIN: operationalize uncertainty with the rules:
      • a  b with certainty 0.7
      • we know a with certainty 1
   – ergo, we know b with 0.7
   – but, we if we also know
      • a  c with certainty 0.6
      • b v c  d with certainty 1
   – do we know d with certainty .7, .6, .88, 1, ....?
      • suppose a ~e and ~e  ~d ....

   – In a rule-based system, such non-local
     dependencies are hard to catch

• Problems such as this have led people to invent lots
  of calculi for uncertainty; probability still dominates

• Basic idea:

   – I have some DoB (a prior probability) about some proposition
   – I receive evidence about p; the evidence is related to p by a
     conditional probability
   – From these two quantities, I can compute an updated DoB
     about p --- a posterior probability
                        Probability Review
• Basic probability is on propositions or propositional
   – P(A) (A is a proposition)
      • P(Accident), P(phonecall), P(Cold)
   – P(X = v) (X is a random variable; v a value)
      • P(card = JackofClubs), P(weather=sunny), ....
   – P(A v B), P(A ^ B), P(~A) ...
   – Referred to as the prior or unconditional probability
• The conditional probability of A given B
   P(A | B) = P(A,B)/P(B)
   – the product rule P(A,B) = P(A | B) * P(B)
• Conditional independence P(A | B) = P(A)
   – A is conditionally independent of B
                     Probability Review

• The joint distribution of A and B
   – P(A,B) = x ( equivalent to P(A ^ B) = x)

      A=1 A=2 A=3          P(A=1,B) = .1
                           P(A=1) = .1 + .2 = .3
B = T .1   .1   .2   .4    P(A =1 | B) = .1/.4 = .25

B = F .2   .1   .3   .6

      .3   .2   .5   1
                         Bayes Theorem
• P(A,B) = P(A | B) P(B) = P(B | A) P(A)
   P(A|B) = P(B | A) P(A) / P(B)

• Example: what is the probability of meningitis when a patient
  has a stiff neck?
    P(S|M) = 0.5
    P(M) = 1/50000
    P(S) = 1/20
    P(M|S) = P(S|M)P(M)/P(S) = 0.5 * 1/50000 / 1/20 = 0.0002

• More general
  P(A | B , E) = P(B | A , E) P(A | E)/ P(B | E)
                 Alarm System Example

• A burglary alarm system is fairly reliable at detecting
• It may also respond to minor earthquakes
• Neighbors John and Mary will call when they hear the
• John always calls when he hears the alarm
• He sometimes confuses the telephone with the alarm
  and calls
• Mary sometimes misses the alarm
• Given the evidence of who has or has not called, we
  would like to estimate the probability of a burglary.
                    Alarm System Example

• P(Alarm|Burglary) A burglary alarm system is fairly reliable at
  detecting burglary
• P(Alarm|Earthquake) It may also respond to minor earthquakes
• P(JohnCalls|Alarm), P(MaryCalls|Alarm) Neighbors John and
  Mary will call when they hear the alarm
• John always calls when he hears the alarm
• P(JohnCalls|~Alarm) He sometimes confuses the telephone with
  the alarm and calls
• Mary sometimes misses the alarm
• Given the evidence of who has or has not called, we would like
  to estimate the probability of a burglary.
                    Influence Diagrams

• Another way to present this information is an
  influence diagram

       burglary             earthquake


       John calls           Mary calls
                       Influence Diagrams
1. A set of random variables.
2. A set of directed arcs
   An arc from X to Y means that X has influence on Y.
3. Each node has an associated conditional probability table.
4. The graph has no directed cycle.
                      burglary                earthquake


                      John calls               Mary calls
                        Conditional Probability

• Each row contains the conditional probability for a possible
combination of values of the parent nodes
• Each row must sum to 1

    burglary                 earthquake B E P(Alarm|B, E)
                                            T        F
                                           T T 0.95        0.05

                   alarm                   T F 0.94        0.06
                                           F T 0.29        0.71
                                           F F 0.001       0.999
    John calls                Mary calls
                Belief Network for the

P(B)                                            P(E)

         burglary             earthquake        0.002

                                B E     P(A)
                                T   T   0.95
                                T   F   0.94
                                F   T   0.29
                                F   F   0.001

         John calls           Mary calls
                                                A P(A
T 0.90                                          T 0.70
F 0.05                                          F 0.01
                        The Semanics of Belief
• The probability that the alarm sounded but neither a burglary
  nor an earthquake has occurred and both John and Mary call
   – P(J ^ M ^ A ^ ~B ^ ~E) =
      P(J | A) P(M | A) P(A | ~B ^ ~E) P(~B) P(~E) =
      0.9 * 0.7 * 0.001 * 0.999* 0.998 = 0.00062

• More generally, we can write this as
  – P(x1, ... xn) = πi P(xi | Parents(Xi))
                       Constructing Belief
1. Choose the set of variables Xi that describe the

2. Choose an ordering for the variables
   1. Ideally, work backward from observables to root causes

3. While there are variables left:
   1. Pick a variable Xi and add it to the network
   2. Set Parents{Xi} to the minimal set of nodes such that
      conditional independence holds
   3. Define the conditional probability table for Xi

• Once you’re done, its likely you’ll realize you need to
  fiddle a little bit!
                        Node Ordering
• The correct order to add nodes is
   – Add the “root causes” first
   – Then the variables they
   – And so on…
                                  Mary calls            John calls
• Alarm example: consider the
   – MaryCalls, JohnCalls, Alarm,
     Burglary, Earthquake
   – MaryCalls, JohnCalls,                   earthquake
     Earthquake, Burglary, Alarm

                                 burglary                alarm
                Probabilistic Inference
• Diagnostic inference (from effets to causes)
   – Given that JohnCalls, infer that P(B|J) = 0.016
• Causal inference (from causes to effects)
   – Given Burglary, P(J|B) = 0.86 and P(M|B) = 0.67
• Intercausal inference (between causes of a common
   – Given Alarm, P(B|A) = 0.376
   – If Earthquake is also true, P(B|A^E) = 0.003
• Mixed inference (combining two or more of the
   – P(A|J ^ ~E) = 0.03
   – P(B|J ^ ~E) = 0.017
                  Conditional Independence
• if every undirected path from a set of nodes X to a set of nodes
  Y is d-separated by E, then X and Y are conditionally
  independent given E
• a set of nodes E d-separates two sets of nodes X and Y if every
  undirected path from a node in X to a node in Y is blocked
  given E
              X                       E                    Y


                   Conditional Independence
•   An undirected path from X to Y is blocked given E if there is a
    node Z s.t.
     1. Z is in E and there is one arrow leading in and one arrow
        leading out
     2. Z is in E and Z has both arrows leading out
     3. Neither Z nor any descendant of Z is in E and both path
        arrows lead into Z
               X                        E                     Y


                  An Inference Algorithm for
                       Belief Networks
• In order to develop an algorithm, we will assume our networks are
  singly connected
   – A network is singly connected if there is at most a single
      undirected path between nodes in the network
       • note this means that any two nodes can be d-separated by
          removing a single node
   – These are also known as polytrees.

• We will then consider a generic node X with parents U1...Um,
  and children Y1 ... Yn.
   – parents of Yi are Zi,j
   – Evidence above X is Ex+; below is Ex-
Singly Connected Network

U1         …        Um

Z1j                 Z1j
      Y1       Y1
                            Inference in Belief
• P(X|Ex) = P(X | Ex+, Ex-) = k P(Ex- | X, Ex+) P(X | Ex+)
                         k P(Ex- | X) P(X | Ex+)
    – the last follows by noting that X d-separates its parents and

• Now, we note that we can apply the product rule to the second
  term                                          i
       P(X | Ex+) = Σu P(X | u, Ex+) P(u | Ex+)
                  = Σu P(X | u) πi P(ui | EU/X)
       again, these last facts follow from conditional independence

• Note that we now have a recursive algorithm: the first term in the
  sum is just a table lookup; the second is what we started with on
  a smaller set of nodes.
                          Inference in Belief
• P(X|E) = k P(Ex- | X) P(X | Ex+)
• The evaluation for the first expression is similar, but
  more involved, yielding
P(X | Ex+) = k2 πi Σy P(Ex-| yi) Σz P(yi | X, zi ) πj P(zij | EZij/Yi)

• P(Ex-| yi) is a recursive instance of P(Ex- | X)
• P(yi | X, zi ) is a conditional probability table entry for Yi
• P(zij | EZij/Yi) is a recursive instance of the P(X|E)
                           The Algorithm
Support-Except(X,V) return P(X| Ex/v)

 if EVIDENCE(X) then return point dist for X
    calculate P(E-x/v| X) = evidence-except(X,V)
    U  parents(X)
    if U is empty
          then return normalized P(E-x/v| X) P(X)
          for each Ui in U
             calculate and store P(Ui|Eui/X) = support-except(Ui,X)
           return k P(Ex- | X) Σu P(X | u) πi P(ui | EU/X)
                             The Algorithm
Evidence-Except(X,V) return P(E-X\V| X )

 Y  children[X] – V
 if Y is empty
    then return a uniform distribution
    for each Yi in Y do
        calculate P(E-Yi|yi) = Evidence-Except(Yi, null)
        Zi = PARENTS(Yi) – X
        foreach Zij in Zi
            calculate P(Zij | Ezij\Yi) = Support-Except(Zij,Yi)
        return k2 πi Σy P(Ex-| yi) Σz P(yi | X, zi ) πj P(zij | EZij/Yi)
                       The Call
• For a node X, call Support-Except(X,null)
• Diagnostic system for lymph node disease
• Pathfinder IV a Bayesian model
  –   8 hrs devising vocabulary
  –   35 hrs defining topology
  –   40 hrs to make 14000 probability assessments
  –   most recent version appears to outperform the
      experts who designed it!
                     Other Uncertainty
• Dempster-Shafer Theory
   – Ignorance: there are sets which have no probability
   – In this case, the best you can do, in some cases, is bound
     the probability
   – D-S theory is one way of doing this

• Fuzzy Logic
   – Suppose we introduce a fuzzy membership function (a
     degree of membership
   – Logical semantics are based on set membership
   – Thus, we get a logic with degrees of truth
      • e.g. John is a big man  bigman(John) w. truth value 0.

To top