# Uncertainty Uncertainty

Document Sample

```					                                        Uncertainty

Cholwich Nattee
Sirindhorn International Institute of Technology
Thammasat University

Lecture 10: Uncertainty                                                            1/41

Uncertainty
Logical agents have limitations in handling uncertain
knowledge.
For example, consider a KB:
∀p Symptom(p, Toothache) ⇒
Disease(p, Cavity) ∨ Disease(p, GumDisease) ∨ . . .
We cannot conclude that a patient with toothache has
cavity.
Three main reasons that makes logical agents fail:
Laziness: too much work to construct the complete KB.
Theoretical ignorance: no complete theory for the
domain.
Practical ignorance: not all necessary percepts can be
checked.
The best way to deal with them is to provide
degree of belief based on probability theory.
Lecture 10: Uncertainty                                                            2/41
Probability and Degree of Belief
The probability is used to denote the degree of belief not
the degree of truth.
For example, A probability of 0.8 for having cavity in a
patient with toothache means we believe that there is an
80% chance that the patient with toothache has a cavity.
The probability is about the agent’s beliefs, not directly
For example, the agent draws a card from a shuﬄed pack.
Before looking at the card, we have a probability of 1/52
for being the ace of spades. After looking at the card, an
appropriate probability will be just 0 or 1.
An assignment of probability just shows entailment status
with the currently available knowledge base.

Lecture 10: Uncertainty                                                                  3/41

Uncertainty: Example [1]
Let At be an action leaving for airport t minutes before
ﬂight.
Will At get me there on time?
Problems
partial observability (road state, other drivers’s plan,
etc.)
noisy sensors
uncertainty in action outcomes (ﬂat tire, etc.)
immense complexity of modeling and predicting traﬃc
Using a logical approach:
1. Risks falsehood: “A25 will get me there on time”, or
2. Leads to too weak conclusions for decision making:
“A25 will get me there on time if there is no accident on
the bridge, and it does not rain and my tires remain
intact etc etc.”
Lecture 10: Uncertainty                                                                  4/41
Uncertainty: Example [2]

Probabilities are used to relate propositions to the current
KB, for example,

P(A25 gets me there on time|no accidents) = 0.06

Probabilities change with new evidence,

P(A25 gets me there on time|no accidents, 5 a.m.) = 0.15

Lecture 10: Uncertainty                                                                   5/41

Making Decisions under Uncertainty
Suppose I believe the following:

P(A25    gets   me   there   on   time| . . . )   =   0.04
P(A90    gets   me   there   on   time| . . . )   =   0.70
P(A120    gets   me   there   on   time| . . . )   =   0.95
P(A1440   gets   me   there   on   time| . . . )   =   0.9999

Which action to choose?
It depends on my preferences for missing ﬂight vs.
airport cuisine, etc.
Utility theory is used to represent and infer preferences.
Decision theory = utility theory + probability theory

Lecture 10: Uncertainty                                                                   6/41
Probability Basics
Let Ω be the sample space, e.g., 6 possible rolls of a die.
ω ∈ Ω is a sample point/possible world/atomic event.
A probability model is a sample space with an assignment
P(ω) for every ω ∈ Ω

0 ≤ P(ω) ≤ 1
P(ω) = 1
ω

E.g., P(1) = P(2) = P(3) = P(4) = P(5) = P(6) = 1/6
An event A is any subset of Ω

P(A) =         P(ω)
ω∈A

E.g. P(die roll < 4) = P(1) + P(2) + P(3) = 0.5
Lecture 10: Uncertainty                                                               7/41

Random Variables
The random variable is used to refer to a part of the
world whose status is initially unknown.
For example, Cavity refers to whether the tooth has a
cavity.
Each random variable has a domain of values, e.g. the
domain of Cavity might be true, false
Boolean random variables have the domain
true, false . For example, Cavity = true is also written
cavity, and we often write ¬cavity for Cavity = false
Discrete random variables take on values from a
countable domain. For example, the domain of Weather
might be sunny, rainy, cloudy, snow
Continuous random variables take on values from the
real numbers. For example, Temp = 21.6 or
Temp < 22.0
Lecture 10: Uncertainty                                                               8/41
Atomic Event

Atomic event is a complete speciﬁcation of the state of
the world.
For example, if the world composes of only two variables
Cavity and Toothache, then there are four distinct
atomic events:

Cavity = true ∧ Toothache = true
Cavity = true ∧ Toothache = false
Cavity = false ∧ Toothache = true
Cavity = false ∧ Toothache = false

Lecture 10: Uncertainty                                                           9/41

Propositions [1]
A proposition can be thought as the event where the
proposition is true.
For example, given Boolean random variables A and B:
event a = set of sample points where A(ω) = true
event a ∧ b = points where A(ω) = true and
B(ω) = true
With Boolean variables, sample point = propositional
logic model, e.g. A = true, or a ∧ ¬b
Proposition = disjunction of atomic events in which it is
true, e.g.

P(a ∨ b) = P(¬a ∧ b) + P(a ∧ ¬b) + P(a ∧ b)

Lecture 10: Uncertainty                                                          10/41
Propositions [2]
From the deﬁnitions, the logically related events must
have related probabilities.
For example, P(a ∨ b) = P(a) + P(b) − P(a ∧ b)

A B

A         B

Lecture 10: Uncertainty                                                        11/41

Prior Probability
Prior or unconditional probabilities of propositions
correspond to belief prior to arrival of any evidence.
E.g., P(cavity) = 0.1, and P(Weather = sunny) = 0.72
Probability distribution gives values for all possible
assignments:
P(Weather) = 0.72, 0.1, 0.08, 0.1
Joint probability distribution gives the probability of
every atomic event.
E.g., P(Weather, Cavity) = a 4 × 2 matrix of values

Weather        sunny   rain   cloudy   snow
Cavity = true    0.144   0.02    0.016    0.02
Cavity = false   0.576   0.08    0.064    0.08

Lecture 10: Uncertainty                                                        12/41
Probability for Continuous Variables
Probability distribution is expressed as a parameterized
function of value, e.g. P(X = x) = U [18, 26](x) =
uniform density between 18 and 26

0.125

18           dx       26

P(X = 20.5) = 0.125 really means

lim P(20.5 ≤ X ≤ 20.5 + dx)/dx = 0.125
dx→0

Lecture 10: Uncertainty                                                         13/41

Gaussian Density

2    2
P(X = x) =    √ 1 e −(x−µ) /2σ
2πσ

0

Lecture 10: Uncertainty                                                         14/41
Conditional Probability [1]
Conditional or posterior probabilities P(a|b), where a and
b are any propositions. This is read as “the probability of
a, given that all we know is b.”
E.g. P(cavity|toothache) = 0.8
Notation for conditional distribution, e.g.,
P(Cavity|Toothache)
New evidence may change the probability, e.g.,

P(cavity|toothache, cavity) = 1

Anyway, new evidence may be irrelevant, allowing
simpliﬁcation, e.g.,

P(cavity|toothache, thaiWins) = P(cavity|toothache) = 0.8

Lecture 10: Uncertainty                                                            15/41

Conditional Probability [2]
Deﬁnition of conditional probability:

P(a ∧ b)
P(a|b) =            if P(b) = 0
P(b)

Alternative formulation:

P(a ∧ b) = P(a|b)P(b) = P(b|a)P(a)

For conditional distribution,

P(Weather, Cavity) = P(Weather|Cavity)P(Cavity)

(View as a 4 × 2 set of equations, not matrix mult)

Lecture 10: Uncertainty                                                            16/41
Inference by Enumeration [1]
Inference by Enumeration is a simple method for
probabilistic inference. It is the computation from
observed evidence of posterior probabilities from query
propositions.
Fully joint distribution is used as the knowledge base.
For example,
toothache            toothache

L
catch       catch catch       catch

L

L
cavity   .108 .012         .072 .008
cavity   .016 .064         .144 .576
L

For any proposition φ, sum the atomic events where it is
true:
P(φ) =       P(ω)
ω:ω|=φ

Lecture 10: Uncertainty                                                            17/41

Inference by Enumeration [2]

For example,
P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2

toothache            toothache
L

catch       catch catch       catch
L

L

cavity   .108 .012         .072 .008
cavity   .016 .064         .144 .576
L

Lecture 10: Uncertainty                                                            18/41
Inference by Enumeration [3]

P(toothache ∨ cavity) = 0.108 + 0.012 + 0.072 +
0.008 + 0.016 + 0.064
= 0.28

toothache            toothache

L
catch       catch catch       catch
L

L
cavity   .108 .012         .072 .008
cavity   .016 .064         .144 .576
L

Lecture 10: Uncertainty                                                       19/41

Inference by Enumeration [4]

P(¬cavity ∧ toothache)
P(¬cavity|toothache) =
P(toothache)
0.016 + 0.064
=
0.108 + 0.012 + 0.016 + 0.064
= 0.4

toothache            toothache
L

catch       catch catch       catch
L

L

cavity   .108 .012         .072 .008
cavity   .016 .064         .144 .576
L

Lecture 10: Uncertainty                                                       20/41
Inference by Enumeration [5]

P(Cavity|toothache) = αP(Cavity, toothache)
= α[P(Cavity, toothache, catch)
+ P(Cavity, toothache, ¬catch)]
= α[ 0.108, 0.016 + 0.012 + 0.064 ]
= α 0.12, 0.08 = 0.6, 0.4

toothache              toothache

L
catch   L   catch catch         catch

L
cavity   .108 .012            .072 .008
cavity   .016 .064            .144 .576
L

Lecture 10: Uncertainty                                                            21/41

Inference by Enumeration [6]

General idea is to compute distribution on query variable
by ﬁxing evidence variables and summing over hidden
variables.

P(X |e) = αP(X , e)
= α   P(X , e, y)
y

where,
X is the query variable.
e is the observed values of evidence variables.
y is the remaining unobserved variables.

Lecture 10: Uncertainty                                                            22/41
Exercises
From the given fully joint distribution, compute the
following probabilities, and probability distributions.
disease                             ¬disease
TestA = low    TestA = high          TestA = low   TestA = high
TestB = low            0.10           0.07                  0.07          0.03
TestB = norm            0.03           0.07                  0.20          0.07
TestB = high            0.17           0.13                  0.03          0.03

1.       P(disease ∧ TestB = low ∧ TestA = high)
2.       P(disease ∨ TestA = low)
3.       P(TestA = high ⇒ disease)
4.       P(TestA = high|TestB = low, disease)
5.       P(Disease|TestA = high)

Lecture 10: Uncertainty                                                                           23/41

Absolute Independence
Variables A and B are independent iﬀ
P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A)P(B)

For example,
Cavity
Cavity        decomposes into   Toothache Catch
Toothache     Catch
Weather
Weather

P(Toothache, Catch, Cavity, Weather)
= P(Toothache, Catch, Cavity)P(Weather)
Number of entries reduce from 32 to 12.
Absolute independence is powerful but very rare
Dentistry is a large ﬁeld
Lecture 10: Uncertainty                                                                           24/41
Conditional Independence [1]
If a patient has a cavity, the probability that the probe
catches in it does not depend on whether the patient has
a toothache:

P(catch|toothache, cavity) = P(catch|cavity)

The same independence holds if the patient does not
have a cavity.

P(catch|toothache, ¬cavity) = P(catch|¬cavity)

We can say that Catch is conditionally independent of
Toothache given Cavity:

P(Catch|Toothache, Cavity) = P(Catch|Cavity)

Lecture 10: Uncertainty                                                          25/41

Conditional Independence [2]

Two variables: X and Y are conditionally independent
given Z , iﬀ
P(X |Y , Z ) = P(X |Z )
P(Y |X , Z ) = P(Y |Z )
P(X , Y |Z ) = P(X |Z )P(Y |Z )

Lecture 10: Uncertainty                                                          26/41
Conditional Independence [3]

The fully joint distribution, P(Toothache, Catch, Cavity)
has 23 − 1 = 7 independence entries.
We can write the distribution using chain rule:

P(Toothache, Catch, Cavity)
= P(Toothache|Catch, Cavity)P(Catch, Cavity)
= P(Toothache|Catch, Cavity)P(Catch|Cavity)P(Cavity)
= P(Toothache|Cavity)P(Catch|Cavity)P(Cavity)

This requires only 2 + 2 + 1 = 5 independent entries.
Knowing P(toothache|cavity) and P(toothache|¬cavity)
is enough for P(Toothache|Cavity), and so on

Lecture 10: Uncertainty                                                          27/41

Bayes’ Rule
From the product rule,

P(a ∧ b) = P(a|b)P(b) = P(b|a)P(a)

Thus, we have
P(a|b)P(b)
P(b|a) =
P(a)

This is known as Bayes’ rule
In more general case, we have a set of questions:

P(X |Y )P(Y )
P(Y |X ) =
P(X )

Lecture 10: Uncertainty                                                          28/41
Bayes’ Rule: Normalization

Bayes’ rule can be written:

P(X |Y )P(Y )
P(Y |X ) =
i P(X |Y = yi )P(Y = yi )

More generally,

P(Y |X ) = αP(X |Y )P(Y )

Lecture 10: Uncertainty                                                     29/41

Applying Bayes’ rule: Example [1]

A doctor knows that the disease meningitis causes the patient to
have a stiﬀ neck, 50% of the time. The doctor also knows some
unconditional facts: the prior probability that a patient has
meningitis is 1/50000, and the prior probability that any patient
has a stiﬀ neck is 1/20.

What is the probability that the patient who has a stiﬀ neck
will has meningitis?

Lecture 10: Uncertainty                                                     30/41
Appling Bayes’ rule: Example [2]

Let s be the proposition that the patient has a stiﬀ neck,
and m be the proposition that the patient has meningitis,
we have
P(s|m) = 0.5
P(m) = 1/50000
P(s) = 1/20
Thus,
P(s|m)P(m)   0.5 × 1/50000
P(m|s) =              =               = 0.0002
P(s)          1/20

Lecture 10: Uncertainty                                                           31/41

Applying Bayes’ rule: Example [3]

Using Bayes’ rule and normalization, we have

P(M |s) = α P(s|m)P(m), P(s|¬m)P(¬m)

In general, we have

P(Y |X ) = αP(X |Y )P(Y )

Lecture 10: Uncertainty                                                           32/41
Combining Evidence: Example [1]

When we want to combine several pieces of information:
P(T , Ct|Cv)P(Ct)
P(Cv|T , Ct) =
P(T , Ct)

However, Catch and Toothache are conditionally
independent given Cavity, we have
P(T |Cv)P(Ct|Cv)P(Cv)
P(Cv|T , Ct) =
P(T , Ct)

We can combine each evidence sequentially.

Lecture 10: Uncertainty                                                          33/41

Naïve Bayes Model

P(Cause, Eﬀect 1 , . . . , Eﬀect n ) = P(Cause)           P(Eﬀect i |Cause)
i

Cause

Effect      Effect      ....   Effect
1           2                  n

Lecture 10: Uncertainty                                                          34/41
Exercises

After your yearly checkup, the doctor has bad news and
good news. The bad news is that you tested positive for a
serious disease and that test is 99% accurate. The good
news is that this is a rare disease, striking only 1 in
10,000 people of your age. What are the chances that
you actually have the disease?1
In a dishonest casino, one die out of 100 dies is loaded to
make 6 come up 50% of the time. If someone rolls three
6’s in a row, what is the probability that the die is

1
AIMA Execises 13.8
2
http://www.mscs.mu.edu/˜cstruble/class/cosc159/spring2003/notes/
Lecture 10: Uncertainty                                                                35/41

Application to Data Mining [1]
Classiﬁcation problem aims to predict the class of object
from given set of evidence.
For example, the credit card company want to predict
customers’ credit risk (true or false) from their
applications composing of several attributes.
We need to compute P(Class|E = e) where Class is a
set of classes, and E is the evidence we want to predict.
Then, we select c ∈ Class with the highest probability.

cpredicted = argmax P(Class = c|E = e)
c
P(E = e|Class = c)P(Class = c)
= argmax
c             P(E = e)
= argmax P(E = e|Class = e)P(Class = c)
c

Lecture 10: Uncertainty                                                                36/41
Application to Data Mining [2]
By gathering statistical data, we can estimate the
probability of evidence in ease class, i.e., we know
P(E = e|Class = c), and P(Class = c)
Generally speaking, the evidence composes of several
attributes: E = e1 , e2 , e3 , . . . . To simplify computation,
the attributes of E are assumed to be conditionally
independent given Class = C . Thus,
k
P(E = e|Class = c) =                       P(Ei = ei |Class = c)
i=1

This data mining technique is called “Naïve Bayesian
Classiﬁcation”

Lecture 10: Uncertainty                                                                                   37/41

Naïve Bayesian Classiﬁcation: Example [1]
ID           Credit History        Debt       Collateral        Income        Credit Risk?
1                bad              high         none             0-15k           high
2             unknown             high         none            15-35k           high
3             unknown             low          none            15-35k         moderate
4             unknown             low          none             0-15k           high
5             unknown             low          none             >35k             low
6             unknown             low        adequate           >35k             low
7                bad              low          none             0-15k           high
9               good              low          none             >35k             low
10               good              high       adequate           >35k             low
11               good              high         none             0-15k           high
12               good              high         none            15-35k         moderate
13               good              high         none             >35k             low
14                bad              high         none            15-35k           high
Source: http://www.mscs.mu.edu/~cstruble/class/cosc159/spring2003/notes/

Lecture 10: Uncertainty                                                                                   38/41
Naïve Bayesian Classiﬁcation: Example [2]

What is the credit risk of the following customer?
e = bad, low, none, 0-15k
To predict the credit risk, we need to select the most
suitable class:

cpredict =      argmax          P(E = e|Class = c)P(Class = c)
c∈{low,mod,high}

Lecture 10: Uncertainty                                                                   39/41

Naïve Bayesian Classiﬁcation: Example [3]

Since we have three classes, we need to compute three
times for each case:
1. Case: Class = low

P(E = e|Class = low)P(Class = low)
= (        P(Ei = ei |Class = low))P(Class = low)
i
= P(History = Bad|Class = low) × P(Debt = bad|Class = low
P(Collateral = none|Class = low) ×
P(Income = 0-15k|Class = low) × P(Class = low)
0 3 3 0          5
≈  × × × ×           = 0.0
5 5 5 5 14

Lecture 10: Uncertainty                                                                   40/41
Naïve Bayesian Classiﬁcation: Example [4]
2. Case: Class = moderate

P(E = e|Class = moderate)P(Class = moderate)
1 2 2 0          3
≈     × × × ×           = 0.0
3 3 3 3 14
3. Case: Class = high

P(E = e|Class = moderate)P(Class = moderate)
3 2 6 4          6
≈     × × × ×           ≈ 0.05
6 6 6 6 14
Thus,

argmax P(E = e|Class = c)P(Class = c) = high
c

Lecture 10: Uncertainty                                                         41/41

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 17 posted: 11/22/2011 language: English pages: 21