Uncertainty by dfhdhdhdhjr

VIEWS: 8 PAGES: 66

									Abduction, Uncertainty,
          and
Probabilistic Reasoning
  Chapters 13, 14, and more



                              1
                         Introduction
• Abduction is a reasoning process that tries to form plausible
  explanations for abnormal observations
  – Abduction is distinct different from deduction and induction
  – Abduction is inherently uncertain
• Uncertainty becomes an important issue in AI research
• Some major formalisms for representing and reasoning about
  uncertainty
  –   Mycin’s certainty factor (an early representative)
  –   Probability theory (esp. Bayesian networks)
  –   Dempster-Shafer theory
  –   Fuzzy logic
  –   Truth maintenance systems


                                                                   2
                         Abduction
• Definition (Encyclopedia Britannica): reasoning that derives
  an explanatory hypothesis from a given set of facts
   – The inference result is a hypothesis, which if true, could
     explain the occurrence of the given facts
• Examples
   – Dendral, an expert system to construct 3D structure of
     chemical compounds
     • Fact: mass spectrometer data of the compound and the
       chemical formula of the compound
     • KB: chemistry, esp. strength of different types of bounds
     • Reasoning: form a hypothetical 3D structure which meets the
       given chemical formula, and would most likely produce the
       given mass spectrum if subjected to electron beam
       bombardment

                                                                     3
– Medical diagnosis
  • Facts: symptoms, lab test results, and other observed findings
    (called manifestations)
  • KB: causal associations between diseases and manifestations
  • Reasoning: one or more diseases whose presence would
    causally explain the occurrence of the given manifestations
– Many other reasoning processes (e.g., word sense
  disambiguation in natural language process, image
  understanding, detective’s work, etc.) can also been seen as
  abductive reasoning.




                                                                     4
   Comparing abduction, deduction and induction
Deduction: major premise:        All balls in the box are black   A => B
                                                                  A
           minor premise:        This ball is from the box        ---------
           conclusion:           This ball is black               B
Abduction: rule:                 All balls in the box are black   A => B
           observation:          This ball is black                      B
                                                                  -------------
           explanation:          This ball is from the box        Possibly A
Induction: case:                 These balls are from the box     Whenever
           observation:          These balls are black            A then B
                                                                  but not
           hypothesized rule:    All ball in the box are black    vice versa
                                                                  -------------
                                                                  Possibly
                                                                  A => B
  Induction: from specific cases to general rules
  Abduction and deduction:
             both from part of a specific case to other part of
             the case using general rules (in different ways)

                                                                                  5
    Characteristics of abduction reasoning
1. Reasoning results are hypotheses, not theorems (may be
   false even if rules and facts are true),
  – e.g., misdiagnosis in medicine
2. There may be multiple plausible hypotheses
  – When given rules A => B and C => B, and fact B
    both A and C are plausible hypotheses
  – Abduction is inherently uncertain
  – Hypotheses can be ranked by their plausibility if that can be
    determined
3. Reasoning is often a Hypothesize-and-test cycle
  – hypothesize phase: postulate possible hypotheses, each of
    which could explain the given facts (or explain most of the
    important facts)
  – test phase: test the plausibility of all or some of these
    hypotheses

                                                                    6
  – One way to test a hypothesis H is to test if something that is
    currently unknown but can be predicted from H is actually
    true.
    • If we also know A => D and C => E, then ask if D and E are
      true.
    • If it turns out D is true and E is false, then hypothesis A
      becomes more plausible (support for A increased, support for
      C decreased)
    • Alternative hypotheses compete with each other (Okam’s
      razor prefers simpler hypotheses)
4. Reasoning is non-monotonic
  – Plausibility of hypotheses can increase/decrease as new
    facts are collected (deductive inference determines if a
    sentence is true but would never change its truth value)
  – Some hypotheses may be discarded/defeated, and new ones
    may be formed when new observations are made
                                                                     7
 Source of Uncertainty in Intelligent Systems
• Uncertain data (noise)
• Uncertain knowledge (e.g, causal relations)
  – A disorder may cause any and all POSSIBLE manifestations in a
    specific case
  – A manifestation can be caused by more than one POSSIBLE
    disorders
• Uncertain reasoning results
  – Abduction and induction are inherently uncertain
  – Default reasoning, even in deductive fashion, is uncertain
  – Incomplete deductive inference may be uncertain




                                                                    8
                Probabilistic Inference
• Based on probability theory (especially Bayes’ theorem)
  – Well established discipline about uncertain outcomes
  – Empirical science like physics/chemistry, can be verified by
    experiments
• Probability theory is too rigid to apply directly in many
  knowledge-based applications
  – Some assumptions have to be made to simplify the reality
  – Different formalisms have been developed in which some aspects
    of the probability theory are changed/modified.
• We will briefly review the basics of probability theory before
  discussing different approaches to uncertainty
• The presentation uses diagnostic process (an abductive and
  evidential reasoning process) as an example

                                                                     9
                   Probability of Events
• Sample space and events
  – Sample space S:     (e.g., all people in an area)
  – Events E1  S:      (e.g., all people having cough)
           E2  S:      (e.g., all people having cold)
• Prior (marginal) probabilities of events
  –   P(E) = |E| / |S| (frequency interpretation)
  –   P(E) = 0.1       (subjective probability)
  –   0 <= P(E) <= 1 for all events
  –   Two special events:  and S: P() = 0 and P(S) = 1.0
• Boolean operators between events (to form compound events)
  – Conjunctive (intersection):   E1 ^ E2 ( E1  E2)
  – Disjunctive (union):          E1 v E2 ( E1  E2)
  – Negation (complement):        ~E (E C = S – E)


                                                               10
• Probabilities of compound events
  – P(~E) = 1 – P(E) because P(~E) + P(E) =1
  – P(E1 v E2) = P(E1) + P(E2) – P(E1 ^ E2)
  – But how to compute the joint probability P(E1 ^ E2)?

                 ~E
            E                       E1       E2



                                       E1 ^ E2
• Conditional probability (of E1, given E2)
  – How likely E1 occurs in the subspace of E2
                     | E1  E 2 | | E1  E 2 | / | S | P ( E1  E 2)
   P ( E1 | E 2)                                    
                        | E2 |       | E2 | / | S |       P ( E 2)
    P ( E1  E 2)  P ( E1 | E 2) P ( E 2)

                                                                       11
• Independence assumption
  – Two events E1 and E2 are said to be independent of each other if
    P ( E1 | E 2)  P ( E1) (given E2 does not change the likelihood of
    E1)
  – Computation can be simplified with independent events
     P ( E1  E 2)  P ( E1 | E 2) P ( E 2)  P ( E1) P ( E 2)
     P ( E1  E 2)  P ( E1)  P ( E 2)  P ( E1  E 2)
                    P ( E1)  P ( E 2)  P ( E1) P ( E 2)
                    1  (1  P ( E1)(1  P ( E 2))

• Mutually exclusive (ME) and exhaustive (EXH) set of events
  – ME:      E i  E j   ( P ( E i  E j )  0), i , j  1,.., n, i  j
  – EXH:     E1  ...  E n  S ( P ( E1  ...  E n )  1)




                                                                            12
                              Bayes’ Theorem
• In the setting of diagnostic/evidential reasoning
                                 H i P(H i )                   hypotheses
               P(E j | Hi )

              E1                Ej                  Em         evidence/manifestations
   – Know prior probability of hypothesis      P(H i )
           conditional probability             P(E j | Hi )
   – Want to compute the posterior probability P ( H i | E j )
• Bayes’ theorem (formula 1): P ( H i | E j )  P ( H i ) P ( E j | H i ) / P ( E j )
• If the purpose is to find which of the n hypotheses H1 ,..., H n
  is more plausible given E j, then we can ignore the denominator
  and rank them use relative likelihood
               rel ( H i | E j )  P ( E j | H i ) P ( H i )
                                                                                         13
• P ( E j ) can be computed from P ( E j | H i ) and P ( H i ) , if we
  assume all hypotheses H1 ,..., H n are ME and EXH
        P ( E j )  P ( E j  ( H 1  ...  H n ) ) (by EXH)
                     n
                    P( E j  H i )                           (by ME)
                    i 1
                      n
                    P( E j | H i )P( H i )
                    i 1

• Then we have another version of Bayes’ theorem:
                                  P(E j | Hi )P(Hi )             rel ( H i | E j )
       P(Hi | E j )          n
                                                               n

                              P(E
                             k 1
                                        j   | Hk )P(Hk )        rel ( H
                                                               k 1
                                                                           k   | Ej)
            n
  where     P(E
           k 1
                         j   | H k ) P ( H k ) , the sum of relative likelihood of all

  n hypotheses, is a normalization factor

                                                                                         14
Probabilistic Inference for simple diagnostic problems
• Knowledge base:
  E1 ,..., E m :  evidence/manifestation
  H1 ,..., H n :  hypotheses/disorders
      E j and H i are binary and hypotheses form a ME & EXH set
  P ( H i ), i  1,...n      prior probabilit ies
  P ( E j | H i ), i  1,...n, j  1,...m conditiona l probabilit ies
• Case input: E1 ,..., E l
• Find the hypothesisH i with the highest posterior
  probability P ( H i | E1 ,..., E l )
                                               P ( E1 ,... E l | H i ) P ( H i )
• By Bayes’ theorem P ( H i | E1 ,..., E l ) 
                                                           P ( E1 ,... E l )
• Assume all pieces of evidence are conditionally
  independent, given any hypothesis
             P( E1,...El | Hi )  lj 1P( E j | Hi )

                                                                                   15
• The relative likelihood
  rel ( H i | E1 ,...,El )  P ( E1 ,...,El | H i ) P ( H i )  P ( H i )lj 1 P ( E j | H i )

• The absolute posterior probability
                               rel ( H i | E1 ,...,E l )              P ( H i ) lj 1 P ( E j | H i )
  P ( H i | E1 ,...,El )                                    
                                                                   P ( H k ) lj 1 P ( E j | H k )
                              n                                   n

                              rel ( H k | E1 ,...,El )
                             k 1
                                                                 
                                                                 k 1

• Evidence accumulation (when new evidence discovered)
 If El+1 present
          rel ( H i | E1 ,..., E l , E l 1 )  P ( E l 1 | H i )rel ( H i | E1 ,..., E l )
 If El+1 present
     rel ( H i | E1 ,..., E l , ~ E l 1 )  (1  P ( E l 1 | H i )) rel ( H i | E1 ,..., E l )


                                                                                                         16
                Assessing the Assumptions
• Assumption 1: hypotheses are mutually exclusive and
  exhaustive
  – Single fault assumption (one and only hypothesis must true)
  – Multi-faults do exist in individual cases
  – Can be viewed as an approximation of situations where
    hypotheses are independent of each other and their prior
    probabilities are very small
  P ( H1  H 2 )  P ( H1 ) P ( H 2 )  0 if both P ( H1 ) and P ( H 2 ) are very small

• Assumption 2: pieces of evidence are conditionally
  independent of each other, given any hypothesis
  – Manifestations themselves are not independent of each other, they
    are correlated by their common causes
  – Reasonable under single fault assumption
  – Not so when multi-faults are to be considered
                                                                                          17
  Limitations of the simple Bayesian system
• Cannot handle well hypotheses of multiple disorders
  – Suppose H1 ,..., H n are independent of each other
  – Consider a composite hypothesis H1 ^ H 2
  – How to compute the posterior probability (or relative likelihood)
             P ( H1 ^ H 2 | E1 ,..., E l ) ?
  – Using Bayes’ theorem
                                    P ( E1 ,... E l | H1 ^ H 2 ) P ( H1 ^ H 2 )
    P ( H1 ^ H 2 | E1 ,..., E l ) 
                                                    P ( E1 ,... E l )
    P ( H1 ^ H 2 )  P ( H1 ) P ( H 2 ) because they are independent
     P ( E1 ,...El | H1 ^ H 2 )   lj 1 P ( E j | H1 ^ H 2 )
         assuming E j are independent, given H1 ^ H 2
     How to compute P ( E j | H1 ^ H 2 ) ?



                                                                                  18
  – Assuming H1 ,..., H n are independent, given E1 ,..., E l ) ?

    P ( H1 ^ H 2 | E1 ,..., E l )  P ( H1 | E1 ,..., E l ) P ( H 2 | E1 ,..., E l )
    but this is a very unreasonable assumption

      E: earth quake                        B: burglar       E and B are independent
                                                             But when A is given, they
                                                             are (adversely) dependent
                        A: alarm set off                     because they become
                                                             competitors to explain A
• Cannot handle causal chaining                                  P(B|A,E) <<P(B|A)
  – Ex. A: weather of the year
        B: cotton production of the year
        C: cotton price of next year
  – Observed: A influences C
  – The influence is not direct (A –> B –> C)
    P(C|B, A) = P(C|B): instantiation of B blocks influence of A on C
• Need a better representation and a better assumption
                                                                                         19
             Bayesian Networks (BNs)

• Definition: BN = (DAG, CPD)
  – DAG: directed acyclic graph (BN’s structure)
    • Nodes: random variables (typically binary or discrete, but
      methods also exist to handle continuous variables)
    • Arcs: indicate probabilistic dependencies between nodes
      (lack of link signifies conditional independence)
  – CPD: conditional probability distribution (BN’s parameters)
    • Conditional probabilities at each node, usually stored as a
      table (conditional probability table, or CPT)
      P ( xi |  i ) where  i is the set of all parent nodes of xi

  – Root nodes are a special case – no parents, so just use priors
    in CPD:  i  , so P ( xi |  i )  P ( xi )

                                                                      20
                           Example BN
                                     P(a0) = 0.001
                                            A
                  P(b0|a0) = 0.3                       P(c0|a0) = 0.2
                  P(b0|a1) = 0.001
                                     B               C P(c0|a0) = 0.005

                                           D              E
 P(d0|…)     b0       b1         P(d0|b0, c0) = 0.1
                                 P(d0|b0, c1) = 0.01      P(e0|c0) = 0.4
     c0      0.1      0.01       P(d0|b1, c0) = 0.01      P(e0|c1) = 0.002
     c1      0.01     0.00001    P(d0|b1, c1) = 0.00001



Uppercase: variables (A, B, …)
Lowercase: values/states of variables (A has two states a0 and a1)
Note that we only specify P(a0) etc., not P(a1), since they have
to add to one
                                                                             21
                     Netica

• An commercial BN package by Norsys
• Down load limited version for free from
  http://www.norsys.com/
• May also down load APIs




                                            22
       Conditional independence and
                 chaining
• Conditional independence assumption
  – P ( xi |  i , q)  P ( xi |  i )               i
    where q is any set of variables                            q
    (nodes) other than x i and its successors        xi
  –  i blocks influence of other nodes on x i
    and its successors (q influences x i only
    through variables in  i )
  – With this assumption, the complete joint probability distribution
    of all variables in the network can be represented by (recovered
    from) local CPDs by chaining these CPDs:

          P ( x1 ,..., x n )   n1 P ( x i |  i )
                                 i




                                                                        23
                 Chaining: Example
                                                         A
Computing the joint probability for all
variables is easy:                            B              C
The joint distribution of all variables
                                                     D          E
P(A, B, C, D, E)
  = P(E | A, B, C, D) P(A, B, C, D) by Bayes’ theorem
  = P(E | C) P(A, B, C, D)              by cond. indep. assumption
  = P(E | C) P(D | A, B, C) P(A, B, C)
  = P(E | C) P(D | B, C) P(C | A, B) P(A, B)
  = P(E | C) P(D | B, C) P(C | A) P(B | A) P(A)
For a particular state:
P(a0, b0, c1, d1, e0) = P(a0)P(b0|a0)P(c1|a0)P(d1|b0, c1)P(e0| c1)
  = 0.001*0.3*0.8*0.99*0.002 = 4.752*10^(-7)


                                                                     24
P(E) = 0.002                    P(B) = 0.01
E: earth quake                      B: burglar


                 A: alarm set off
                                            P(E|A) = 0.167; P(B|A) = 0.835
    P(A|…)         B     ~B
         E         0.9   0.8
                                            P(E|A, E) = 1.0; P(B|A, E) = 0.0112
        ~E         0.8   0.0


                           P(B|A, E) = P(B,A,E)/P(A,E)
                           = P(B,A,E)/(P(B,A,E) + P(~B,A,E)
                           = 0.01*0.002*0.9/(0.01*0.002*0.9 + 0.99*0.002*0.8)
                           = 0.000018/(0.000018 + 0.001548)
                           = 0.000018/0.001566
                           = 0.01123


                                                                                  25
               Topological semantics
• A node is conditionally independent of its non-
  descendants given its parents
• A node is conditionally independent of all other nodes in
  the network given its parents, children, and children’s
  parents (also known as its Markov blanket)
• The method called d-separation can be applied to decide
  whether a set of nodes X is independent of another set Y,
  given a third set Z
      A                   A               B           C

       B
                                                  A
                     B            C
           C
  Chain: A and C     Diverging: B and     Converging: B and
  are independent,   C are independent,   C are independent,
  given B            given A              NOT given A
                                                               26
                       Inference tasks
• Simple queries: Computer posterior probability P(Xi | E=e)
  – E.g., P(NoGas | Gauge=empty, Lights=on, Starts=false)
  – Posteriors for ALL nonevidence nodes
  – Priors for all nodes (E = )
• Conjunctive queries:
  – P(Xi, Xj | E=e) = P(Xi | E=e) P(Xj | Xi, E=e)
• Optimal decisions: Decision networks or influence diagrams
  – include utility information and actions;
  – inference is to find P(outcome | action, evidence)
• Value of information: Which evidence should we seek next?
• Sensitivity analysis: Which probability values are most
  critical?
• Explanation: Why do I need a new starter motor?
                                                               27
• MAP problems (explanation)
  – Let X denote the set of all variables in a BN, V  X the set
    of instantiat ed variables , U  X  V the set of all un - instantiat ed
    varialbes. Then the MAP (maximum aposteriori probabilit y) problem
    is to find the most probable instantiat ion of U , given V , i.e.,
                max u ( P(U | V ))
  – The solution provides a good explanation for your action

  – This is an optimization problem




                                                                               28
          Approaches to inference
• Exact inference
  – Enumeration
  – Variable elimination
  – Belief propagation in polytrees (singly connected BNs)
  – Clustering / junction tree algorithms
• Approximate inference
  – Stochastic simulation / sampling methods
    • Markov chain Monte Carlo methods
  – Loopy propagation
  – Others
    • Mean field theory
    • Neural networks
                                                             29
        Inference by enumeration
• To compute P(X|E=e), where X is a single variable and E is
  evidence (instantiation of a set of variables)
• Add all of the terms (atomic event probabilities) from the full
  joint distribution that are consistent with E
• If Y are the other (unobserved) variables, excluding X, then
  the posterior distribution
      P(X|E=e) = α P(X, e) = α ∑yP(X, e, Y)
     • Sum is over all possible instantiations of variables in Y
• Each P(X, e, Y) term can be computed using the chain rule
• Computationally expensive!

                                                                    30
                                                             A
               Example: Enumeration
                                                   B             C

                                                         D             E
• P(xi) = Σ πi P(xi | πi) P(πi)
• Suppose we want P(D), and only the value of E is given as true
• P (D|e) =  ΣA,B,CP(a, b, c, d, e)
          =  ΣA,B,CP(a) P(b|a) P(c|a) P(d|b,c) P(e|c)
• With simple iteration to compute this expression, there’s going
  to be a lot of repetition (e.g., P(e|c) has to be recomputed every
  time we iterate over C for all possible assignments of A and B))




                                                                       31
                         Exercise: Enumeration

       p(smart)=.8                  p(study)=.6
            smart              study

                                                      p(fair)=.9
                   prepared                    fair
                                                          p(prep|…) smart smart
                                                          study    .9     .7
                        pass                              study   .5     .1
                 smart              smart
p(pass|…)
            prep   prep       prep    prep           Query: What is the
fair        .9     .7          .7      .2
                                                       probability that a student
                                                       studied, given that they pass
fair       .1     .1          .1      .1
                                                       the exam?                       32
              Variable elimination
• Basically just enumeration, but with caching of local
  calculations
• Linear for polytrees
• Potentially exponential for multiply connected BNs
   Exact inference in Bayesian networks is NP-hard!




                                                          33
              Variable elimination
General idea:
• Write query in the form

       P( X n , e )  L    P( x | pa )i   i
                    xk      x3   x2   i
• Iteratively
   – Move all irrelevant terms outside of innermost sum
   – Perform innermost sum, getting a new term
   – Insert the new term into the product




                                                          34
                 Variable elimination
                                               8 x 4 = 32 multiplications
Example:
  ΣAΣBΣCP(a) P(b|a) P(c|a) P(d|b,c) P(e|c)
= ΣAΣBP(a)P(b|a)ΣCP(c|a) P(d|b,c) P(e|c)       8 x 2 + 4 + 2 = 22
= ΣAP(a)ΣBP(b|a)ΣCP(c|a) P(d|b,c) P(e|c)       multiplications

for each state of A = a
                                             Variable C is summed out
   for each state of B = b
       compute fC(a, b) = ΣCP(c|a) P(d|b,c) P(e|c)
    compute fB(a) = ΣBP(b)fC(a, b)
Compute result = ΣAP(a)fB(a)                 variable B is summed out


Here fC(a, b), fB(a) are called factors, which are vectors or matrices


                                                                            35
            Exercise: Variable elimination
       p(smart)=.8                  p(study)=.6
             smart             study

                                                      p(fair)=.9
                   prepared                    fair
                                                          p(prep|…) smart smart
                                                          study     .9     .7
                        pass                              study    .5     .1
                 smart              smart
p(pass|…)
            prep   prep       prep    prep           Query: What is the
fair        .9     .7          .7      .2
                                                       probability that a student is
                                                       smart, given that they pass
fair       .1     .1          .1      .1
                                                       the exam?                       36
                  Belief Propagation
• Singly connected network, SCN (also known as polytree)
  – there is at most one undirected path between any two nodes
    (i.e., the network is a tree if the direction of arcs are ignored)
  – The influence of the instantiated variable (evidence) spreads
    to the rest of the network along the arcs
     • The instantiated variable influences
       its predecessors and successors
       differently (using CPT along opposite              A
       directions)
     • Computation is linear to the diameter of   B              C
       the network (the longest undirected
       path)                                    D        E=e         F
     • Update belief (posterior) of every non-
       evidence node in one pass
 – For multi-connected net: conditioning
                                                                         37
                      Conditioning
                                      A

                            B              C

                                  D            E

• Conditioning: Find the network’s smallest cutset S (a set of
  nodes whose removal renders the network singly connected)
   – In this network, S = {A} or {B} or {C} or {D}
• For each instantiation of S, compute the belief update with the
  belief propagation algorithm
• Combine the results from all instantiations of S (each is weighted
  by P(S = s))
• Computationally expensive (finding the smallest cutset is in
  general NP-hard, and the total number of possible instantiations
  of S is O(2|S|))
                                                                       38
                     Junction Tree
• Convert a BN to a junction tree
  – Moralization: add undirected edge between every pair of
    parents, then drop directions of all arc: Moralized Graph
  – Triangulation: add an edge to any cycle of length > 3:
    Triangulated Graph
  – A junction tree is a tree of cliques of the triangulated graph
  – Cliques are connected by links
    • A link stands for the set of all variables S shared by these two
      cliques
    • Each clique has a potential (similar to CPT), constructed
      from CPT of variables in the original BN



                                                                         39
                               Junction Tree
• Example
                                                  A

           A
                                         B             C

 B             C
                                                                       A,B,C
                                         D             E

 D                 E                     Triangulated graph                (B, C)

     A simple BN               A                                       B,C,D

                                                                           (C, D)
                       B           C
                                                                       C,D.E


                   D                E                         Junction tree of 3 nodes

                       Moralized graph
                                                                                         40
                     Junction Tree
• Reasoning
  – Since it is now a tree, polytree algorithm can be applied,
    but now two cliques exchange P(S), the distribution over
    S, their shared variables.
  – Complexity:
    • O(n) steps, where n is the number of cliques
    • Each step is expensive if cliques are large (CPT
      exponential to clique size)
    • Construction of CPT of JT is expensive as well, but it
      needs to compute only once.




                                                                 41
        Some comments on BN reasoning
– Let X  ( X 1 , K , X n ) be the set of all variables in a BN. Any BN
  reasoning task can be expressed in the form of calculating
              P(U | V ) where U ,V  X
– This can be done by marginalization of the joint distribution P(X)
  over Y = X \ U \ V:
              Y P(U ,V , Y )
   where each entry P(x) = P(u,v,y) can be calculated by chain rule
  from CPTs
– Computation can be done more efficiently using, say Junction tree,
  by utilizing variable interdependencies
– Computational complexity of BN reasoning is proved to be NP-
  hard by reducing 3SAT problems to BN reasoning (Cooper 1990)


                                                                          42
Approximate inference: Direct sampling
• Suppose you are given values for some subset of the
  variables, E, and want to infer distributions for unknown
  variables, Z
• Randomly generate a very large number of instantiations
  from the BN according to the distribution
  – Generate instantiations for all variables – start at root variables and
    work your way “forward” in topological order
• Rejection sampling: Only keep those instantiations that are
  consistent with the values for E
• Use the frequency of values for Z to get estimated
  probabilities
• Accuracy of the results depends on the size of the sample
  (asymptotically approaches exact results)
• Very expensive and inefficient
                                                                              43
            Likelihood weighting
• Idea: Don’t generate samples that need to be rejected in the
  first place!
• Sample only from the unknown variables Z and X (E are
  fixed)
• Weight each sample according to the likelihood that it
  would occur, given the evidence E
  – A weight w is associated with each sample (w initialized to 1)
  – When a evidence node (say E1 = e1-0) is selected for
    weighting, its parents are already instantiated (say parents A
    and B are assigned state a and b)
  – Modify w = w * P(e1-0 | a, b) based on E1’s CPT
  – Repeat for the other evidence nodes

                                                                     44
Markov chain Monte Carlo algorithm
• So called because
  – Markov chain – each instance generated in the sample is dependent
    on the previous instance
  – Monte Carlo – statistical sampling method
• Perform a random walk through variable assignment space,
  collecting statistics as you go
  – Start with a random instantiation, consistent with evidence variables
  – At each step, randomly select a non-evidence variable x, randomly
    sample its value by
   P( x | mb( x))  P( x | parent( x))         Π                       (
                                                            P( y | parentsY )
                                            Ychild ( X )

• Given enough samples, MCMC gives an accurate estimate of the
  true distribution of values

                                                                                45
                Loopy Propagation
• Belief propagation
  – Works only for polytrees (exact solution)
  – Each evidence propagates once throughout the network
• Loopy propagation
  – Let propagation continue until the network stabilize (hope)
• Experiments show
  – Many BN stabilize with loopy propagation
  – If it stabilizes, often yielding exact or very good approximate
    solutions
• Analysis
  – Conditions for convergence and quality approximation are
    under intense investigation

                                                                      46
                            Noisy-Or BN
• A special BN of binary variables (Peng & Reggia, Cooper)
  – Each link xi  x j is associated with a probabilit y value called
    causal strength cij that measures the strength of xi alone may
    cause x j , i.e., cij  P( xi | x j is true and all others in i are false)
  – Causation independence: parent nodes influence a child
    independently
• Advantages:
  – One-to-one correspondence between causal links and causal
    strengths
  – Easy for humans to understand (acquire and evaluate KB)
  – Fewer # of probabilities needed in KB
     Complete joint prob. distributi on : 2 n
                       General BN : in1 2|i |
                     Noisy - Or BN : in1| i |
  – Computation is less expensive
• Disadvantage: less expressive (less general)
                                                                                  47
         Learning BN (from case data)
• Needs for learning
  – Difficult to construct BN by humans (esp. CPT)
  – Experts’ opinions are often biased, inaccurate, and incomplete
  – Large databases of cases become available
• What to learn
  – Parameter learning: learning CPT when DAG is known (easy)
  – Structural learning: learning DAG (hard)
• Difficulties in learning DAG from case data
  – There are too many possible DAG when # of variables is large
    (more than exponential)
        n      # of possible DAG
        3         25
       10         4*10^18
  – Missing values in database
  – Noisy data
                                                                     48
           BN Learning Approaches

• Early effort: Based on variable dependencies (Pearl)
  – Find all pairs of variables that are dependent of each
    other (applying standard statistical method on the
    database)
  – Eliminate (as much as possible) indirect dependencies
  – Determine directions of dependencies
  – Learning results are often incomplete (learned BN
    contains indirect dependencies and undirected links)




                                                             49
           BN Learning Approaches

• Bayesian approach (Cooper)
  – Find the most probable DAG, given database DB, i.e.,
       max(P(DAG|DB)) or max(P(DAG, DB))
  – Based on some assumptions, a formula is developed to
    compute P(DAG, DB) for a given pair of DAG and DB
  – A hill-climbing algorithm (K2) is developed to search a
    (sub)optimal DAG
  – Compute CPTs after the DAG is determined
  – Extensions to handle some form of missing values



                                                              50
            BN Learning Approaches
• Minimum description length (MDL) (Lam, etc.)
  – Sacrifices accuracy for simpler (less dense) structure
    • Case data not always accurate
    • Outliers are hard to model (needs more links)
    • Fewer links imply smaller CPD tables and less expensive
      inference
  – L = L1 + L2 where
    • L1: the length of the encoding of DAG (smaller for simpler
      DAG)
    • L2: the length of the encoding of the difference between DAG
      and DB (smaller for better match of DAG with DB)
    • Smaller L1 implies less accurate DAG, and thus larger L2
  – Find DAG by heuristic best-first search, that Minimizes L

                                                                     51
           BN Learning Approaches
• Neural network approach (Neal, Peng)
  – For noisy-or BN
                                  ~
   Maximizing L  ln           P(V  V r ) where
                       V r D
        D : case database;
        V r : case in D;
        ~
        V : state vector of the learned network
    L measures the similarity of the two distributi ons : one in D,
    another in the learned network

  – Change inter-node link strength locally, following gradient
    descent approach to maximize L.

                                                                      52
• Compare Neural network approach with Cooper’s K2
• Network: Alarm (37 nodes)



       # cases   missing links   extra links    time
          500        2/0            2/6         63.76/5.91
         1000        0/0            1/1         69.62/6.04
         2000        0/0            0/0         77.45/5.86
        10000        0/0            0/0        161.97/5.83




                                                             53
              Current research in BN
• Missing data
  – Missing value: EM (expectation maximization)
  – Missing (hidden) variables are harder to handle
• BN with time
  – Dynamic BN: assuming temporal relation obey Markov
    chain
• Cyclic relations
  – Often found in social-economic analysis
  – Using dynamic BN?
• Continuous variable
  – Some work on variables obeying Gaussian distribution
• Connecting to other fields
  – Databases; Statistics; Symbolic AI (FOL); Semantic web
• Reasoning with uncertain evidence
  – Virtual evidence
  – Soft evidence
                                                             54
         Other formalisms for Uncertainty
            Fuzzy sets and fuzzy logic
• Ordinary set theory
   – f A ( x)  1 if x  A
                  0 otherwise
                  
           f A ( x) is called the characteri stic or membership function of set A
                        1 if x  A
      Predicate A( x)  
                        0 otherwise
     When it is uncertain if x  A , use probabilit y P ( x  A )
   – There are sets that are described by vague linguistic terms (sets
     without hard, clearly defined boundaries), e.g., tall-person, fast-
     car
      • Continuous
      • Subjective (context dependent)
      • Hard to define a clear-cut 0/1 membership function
                                                                                    55
• Fuzzy set theory
   – Relax f A ( x ) from binary {0,1} to continuous[0,1]
                                          o
     stands for thedegree x is thought t belong to set A
       height(john) = 6’5”             Tall(john) = 0.9
       height(harry) = 5’8”            Tall(harry) = 0.5
       height(joe) = 5’1”              Tall(joe) = 0.1
  – Examples of membership functions
        1-
                                                     Set of teenagers
         0         12        19
        1-
                                                     Set of young people
          0        12        19

        1-
                                                     Set of mid-age
                                                     people
              20   35   50        65     80
                                                                           56
• Fuzzy logic: many-value logic
   – Fuzzy predicates (degree of truth) FA ( x)  y if f A ( x)  y
   – Connectors/Operators
             negation : FA ( x)  1  FA ( x)
         conjunction : FA B ( x)  min{FA ( x) , FB ( x)}
          disjunction : FA B ( x)  max{ FA ( x) , FB ( x)}
• Compare with probability theory
   – Prob. Uncertainty of outcome,
      • Based on large # of repetitions or instances
      • For each experiment (instance), the outcome is either true or false
        (without uncertainty or ambiguity)
        unsure before it happens but sure after it happens
     Fuzzy: vagueness of conceptual/linguistic characteristics
      • Unsure even after it happens
        whether a child of tall mother and short father is tall
        unsure before the child is born
        unsure after grown up (height = 5’6”)
                                                                              57
– Empirical vs subjective (testable vs agreeable)
– Fuzzy set connectors may lead to unreasonable results
  • Consider two events A and B with P(A) < P(B)
  • If A => B (or A  B) then
      P(A ^ B) = P(A) = min{P(A), P(B)}
      P(A v B) = P(B) = max{P(A), P(B)}
  • Not the case in general
      P(A ^ B) = P(A)P(B|A)  P(A)
      P(A v B) = P(A) + P(B) – P(A ^ B)  P(B)
      (equality holds only if P(B|A) = 1, i.e., A => B)
– Something prob. theory cannot represent
  • Tall(john) = 0.9, ~Tall(john) = 0.1
    Tall(john) ^ ~Tall(john) = min{0.1, 0.9) = 0.1
    john’s degree of membership in the fuzzy set of “median-
    height people” (both Tall and not-Tall)
  • In prob. theory: P(john  Tall ^ john Tall) = 0
                                                               58
         Uncertainty in rule-based systems
• Elements in Working Memory (WM) may be uncertain because
  – Case input (initial elements in WM) may be uncertain
       Ex: the CD-Drive does not work 70% of the time
  – Decision from a rule application may be uncertain even if the
    rule’s conditions are met by WM with certainty
       Ex: flu => sore throat with high probability
• Combining symbolic rules with numeric uncertainty: Mycin’s
  Certainty Factor (CF)
  – An early attempt to incorporate uncertainty into KB systems
  – CF  [-1, 1]
  – Each element in WM is associated with a CF: certainty of that
    assertion
  – Each rule C1,...,Cn => Conclusion is associated with a CF:
    certainty of the association (between C1,...Cn and Conclusion).

                                                                      59
– CF propagation:
   • Within a rule: each Ci has CFi, then the certainty of Action is
              min{CF1,...CFn} * CF-of-the-rule
   • When more than one rules can apply to the current WM for the
     same Conclusion with different CFs, the largest of these CFs
     will be assigned as the CF for Conclusion
   • Similar to fuzzy rule for conjunctions and disjunctions
– Good things of Mycin’s CF method
   • Easy to use
   • CF operations are reasonable in many applications
   • Probably the only method for uncertainty used in real-world
     rule-base systems
– Limitations
   • It is in essence an ad hoc method (it can be viewed as a
     probabilistic inference system with some strong, sometimes
     unreasonable assumptions)
   • May produce counter-intuitive results.
                                                                       60
              Dempster-Shafer theory
• A variation of Bayes’ theorem to represent ignorance
• Uncertainty and ignorance
   – Suppose two events A and B are ME and EXH, given an
     evidence E
     A: having cancer B: not having cancer    E: smoking
  – By Bayes’ theorem: our beliefs on A and B, given E, are measured by
    P(A|E) and P(B|E), and P(A|E) + P(B|E) = 1
  – In reality,
        I may have some belief in A, given E
        I may have some belief in B, given E
        I may have some belief not committed to either one,
  – The uncommitted belief (ignorance) should not be given to
    either A or B, even though I know one of the two must be true,
    but rather it should be given to “A or B”, denoted {A, B}
  – Uncommitted belief may be given to A and B when new
    evidence is discovered
                                                                      61
• Representing ignorance
   – Frame of discernment :q  {h1 ,...,hn }, a set of ME and EXH
     hypotheses. The power set 2q is organized as a lattice of super/subs
                                                                        et
     relations. Each node S is a subset of hypotheses( S  q )
   – Ex: q = {A,B,C}
     Each node S is associated with a                {A,B,C} 0.15
     basic probabilit y assignment m ( S )
        0  m ( S )  1;                   {A,B} 0.1 {A,C} 0.1 {B,C}0.05
        m ()  0;
        Sq m(S)  1                       {A} 0.1    {B} 0.2    {C}0.3

• Belief function                                          {} 0
    Bel ( S )  S ' S m ( S ' ); Bel ()  0; Bel (q )  1
    Bel ({A, B})  m ({A, B})  m ({A})  m ({B})  m ()
                  0.1  0.1  0.2  0  0.4
    Bel ({A, B}C )  Bel ({C})  0.3

                                                                             62
– Plausibility (upper bound of belief of a node)
  All belief not committed to S C may be commited to S
     Pls( S )  1  Bel ( S C )
     Pls({A, B})  1  Bel ({C})  1  0.3  0.7
     [ Bel ( S ), Pls( S )] belief interval

    Lower      Upper                         {A,B,C} 0.15
    bound      bound
    (known     (maximally        {A,B} 0.1    {A,C} 0.1     {B,C}0.05
    belief)    possible)
                                  {A} 0.1      {B} 0.2       {C}0.3

                                               {} 0




                                                                        63
• Evidence combination (how to use D-S theory)
   – Each piece of evidence has its own m(.) function for the same q
      q  { A, B} : A : having cancer; B : not having cancer
                  {A,B} 0.3                         {A,B} 0.1

            {A} 0.2          {B} 0.5          {A} 0.7          {B} 0.2

                      {} 0                              {} 0
                  m1 ( S )                            m2 ( S )
                E1 : smoking           E2 : living in high radiation area
  – Belief based on combined evidence can be computed from

        m( S )  m1 ( S )  m2 ( S ) 
                                         X Y  S m1 ( X )m2 (Y )
                                       1  X Y  m1 ( X )m2 (Y )


            normalization factor               incompatible combination

                                                                            64
      {A,B} 0.3                  {A,B} 0.1                  {A,B} 0.049

{A} 0.2          {B} 0.5   {A} 0.7          {B} 0.2   {A} 0.607      {B} 0.344

          {} 0                       {} 0                     {} 0
       E1                            E2                    E1 ^ E2

            m1 ({A})m 2 ({A})  m1 ({A})m 2 ({A, B})  m1 ({A, B})m 2 ({A})
m ({A}) 
                        1  [m1 ({A})m 2 ({B})  m1 ({B})m 2 ({A})]
            0.2  0.7  0.2  0.1  0.3  0.7 0.37
                                                   0.607
              1  [0.2  0.2  0.5  0.7]      0.61
          m1 ({B})m 2 ({B})  m1 ({B})m 2 ({A, B})  m1 ({A, B})m 2 ({B})
m ({B}) 
                       1  [m1 ({A})m 2 ({B})  m1 ({B})m 2 ({A})]
          0.5  0.2  0.5  0.1  0.3  0.2 0.21
                                                 0.344
            1  [0.2  0.2  0.5  0.7]      0.61
            m1 ({A, B})m2 ({A, B}) 0.03
m({A, B})                               0.049
                     0.61           0.61
                                                                              65
  – Ignorance is reduced
       from m1({A,B}) = 0.3 to m({A,B}) = 0.049)
  – Belief interval is narrowed
       A: from [0.2, 0.5] to [0.607, 0.656]
       B: from [0.5, 0.8] to [0.344, 0.393]
• Advantage:
  – The only formal theory about ignorance
  – Disciplined way to handle evidence combination
• Disadvantages
  – Computationally very expensive (lattice size 2^|q|)
  – Assuming hypotheses are ME and EXH
  – How to obtain m(.) for each piece of evidence is not clear,
    except subjectively


                                                                  66

								
To top