probability
Document Sample


Probabilistic Approaches to Data
Mining
Craig A. Struble, Ph.D.
Department of Mathematics, Statistics, and Computer Science
Marquette University
Overview
Introduction to Probability
Bayes’ Rule
Naïve Bayesian Classification
Model Based Clustering
Other applications
MSCS 282: Data Mining - Craig A.
Struble 2
Probability
Let P(A) represent the probability that
proposition A is true.
Example: Let Risky represent that a customer is a
high credit risk. P(Risky) = 0.519 means that there is
a 51.9% chance a given customer is a high-credit
risk.
Without any other information, this probability is
called the prior or unconditional probability
MSCS 282: Data Mining - Craig A.
Struble 3
Random Variables
Could also consider a random variable X, which
can take on one of many values in its domain
<x1,x2,…,xn>
Example: Let Weather be a random variable with
domain <sunny, rain, cloudy, snow>. The
probabilities of Weather taking on one of these
values is
P(Weather=sunny)=0.7 P(Weather=rain)=0.2
P(Weather=cloudy)=0.08 P(Weather=snow)=0.02
MSCS 282: Data Mining - Craig A.
Struble 4
Probability Distributions
The notation P(X) is used to represent the probabilities
of all possible values of a random variable
Example, P(Weather) = <0.7,0.2,0.08,0.02>
The statement above defines a probability distribution
for the random variable Weather
The notation P(Weather, Risky) is used to denote the
probabilities of all combinations of the two variables.
Represented by a 4x2 table of probabilities
MSCS 282: Data Mining - Craig A.
Struble 5
Conditional Probability
Probabilities of events change when we know
something about the world
The notation P(A|B) is used to represent the
conditional or posterior probability of A
Read “the probability of A given that all we know is
B.”
P(Weather = snow | Temperature = below freezing) = 0.10
MSCS 282: Data Mining - Craig A.
Struble 6
Logical Connectives
We can use logical connectives for probabilities
P(Weather = snow Temperature = below freezing)
Can use disjunctions (or) or negation (not) as well
The product rule
P(A B) = P(A|B)P(B) = P(B|A)P(A)
Using probability distributions
P(X,Y) = P(X|Y)P(Y)
which is equivalent to saying
P(X=xi Y=yj)=P(X=xi|Y=yj)P(Y=yj) for all i and j
MSCS 282: Data Mining - Craig A.
Struble 7
Axioms of Probability
All probabilities are between 0 and 1
0P(A) 1
Necessarily true propositions have prob. of 1,
necessarily false prob. of 0
P(true) = 1 P(false) = 0
The probability of a disjunction is given by
P(AB) = P(A) + P(B) - P(AB)
MSCS 282: Data Mining - Craig A.
Struble 8
Joint Probability Distributions
Recall P(A,B) represents the probabilities of all
possible combinations of assignments to random
variables A and B.
More generally, P(X1, …, Xn) for random
variables X1, …, Xn is called the joint probability
distribution or joint
MSCS 282: Data Mining - Craig A.
Struble 9
Joint Probability Distributions
Risky Risky
Example
Weather=sunny 0.30 0.40
Weather=rain 0.15 0.05
Weather=cloudy 0.05 0.03
Weather=snow 0.019 0.001
P(Weather=sunny) = 0.7 (add up row)
P(Risky) = 0.519 (add up column)
What about P(Weather=sunnyRisky)?
MSCS 282: Data Mining - Craig A.
Struble 10
Bayes’ Rule
Bayes’ rule relates conditional probabilities
P(A B)=P(A|B)P(B)
P(A B)=P(B|A)P(A)
Bayes’ Rule
P( A | B) P( B)
P( B | A)
P( A)
MSCS 282: Data Mining - Craig A.
Struble 11
Generalizing Bayes’ Rule
For probability distributions
P ( A | B )P ( B )
P ( B | A)
P ( A)
Conditionalized on background evidence E
P ( A | B, E )P ( B | E )
P ( B | A, E )
P ( A | E)
MSCS 282: Data Mining - Craig A.
Struble 12
Bayes’ Rule Example
Suppose we want P(Risky|Weather=sunny)
P(Weather=sunny|Risky) = 0.578,
P(Risky)=0.519, P(Weather=sunny)=0.7
Using Bayes’ Rule we get
P(Weather sunny | Risky) P( Risky)
P( Risky | Weather sunny )
P(Weather sunny )
P(Risky|Weather=sunny) = (0.578*0.519)/0.7=0.429
MSCS 282: Data Mining - Craig A.
Struble 13
Bayes’ Rule Example
Dishonest casino: Loaded die where 6 comes up 50% of the time. 1 out of
every 100 die is loaded
P(Dfair)=0.99 P(Dloaded) = 0.01
Let’s say someone rolls three 6’s in a row. What’s the probability that the die is
loaded?
P(3sixes |D loaded) P( Dloaded )
P( Dloaded 3sixes)
P(3sixes)
P(3sixes | Dloaded ) P( Dloaded )
P(3sixes | Dloaded ) P( Dloaded ) P(3sixes | D fair ) P( D fair )
(0.53 )(0.01)
(0.53 )(0.01) (1 / 6)3 (0.99)
0.21
MSCS 282: Data Mining - Craig A.
Struble 14
Normalization
Direct assessment of P(A) may not be possible,
but we can use the fact
P( A) P( A | B bi )P( B bi )
i
to estimate the value.
Example:
P(Weather=sunny) = P(Weather=sunny | Risky)P(Risky)
+ P(Weather=sunny | Risky)P( Risky)
MSCS 282: Data Mining - Craig A.
Struble 15
Normalization
Using the previous fact, Bayes’ rule can be
written
P( A | B) P( B)
P( B | A)
P( A | B bi ) P( B bi )
i
More generally
P ( B | A) P ( A | B)P ( B)
where is a constant that makes the probability
distribution table for P(B|A) total to 1.
MSCS 282: Data Mining - Craig A.
Struble 16
Application to Data Mining
In data mining, we have data which provides evidence
of probabilities
Let h be a hypothesis (classification) and D be a data
observation
P(h) - probability of classification before data
P(D) - probability of obtaining data sample
P(h|D) - probability of classification after observing data
P(D|h) - probability that we would observe D given that
h is the correct classification.
MSCS 282: Data Mining - Craig A.
Struble 17
Naïve Bayesian Classifier
Choose the most likely classfication using
Bayesian techniques
MAP (maximum a posteriori classification)
P( D | hi ) P(hi )
max P(hi | D) max
i i P( D)
max P( D | hi ) P(hi )
i
MSCS 282: Data Mining - Craig A.
Struble 18
Naïve Bayesian Classifier
ML (maximum likelihood)
Assume P(hi) = P(hj) (classifications are equally
likely)
Choose hi such that
max P(hi | D) max P( D | hi )
i i
MSCS 282: Data Mining - Craig A.
Struble 19
Naïve Bayesian Classifier
Generally speaking, D is a vector <d1, …, dk>of
values for attributes
To simplify things, the attributes for D are
assumed to be independent. That means we can
write
k
P( D | h) P( Di d i | h)
i 1
MSCS 282: Data Mining - Craig A.
Struble 20
Example
MSCS 282: Data Mining - Craig A.
Struble 21
Example
Let D=<unknown, low, none, 15-35>
Which risk category is D in?
Three hypotheses: Risk=low, Risk=moderate,
Risk=high
Because of naïve assumption, calculate
individual probabilities and then multiply
together.
MSCS 282: Data Mining - Craig A.
Struble 22
Example
P(CH=unknown | Risk=low) = 2/5 P(D|Risk=low)=2/5*3/5*3/5*0/5=0
P(CH=unknown | Risk=moderate) = 1/3 P(D|Risk=moderate)=1/3*1/3*2/3*2/3=4/81=0.494
P(CH=unknown | Risk=high) = 2/6 P(D|Risk=high)=2/6*2/6*6/6*2/6=48/1296=0.370
P(Debt=low | Risk=low) = 3/5
P(Debt=low | Risk=moderate) = 1/3 P(Risk=low)=5/14
P(Debt=low | Risk=high) = 2/6 P(Risk=moderate)=3/14
P(Coll=none | Risk=low) = 3/5 P(Risk=high)=6/14
P(Coll=none | Risk=moderate) = 2/3
P(Coll=none | Risk=high) = 6/6 P(D|Risk=low)P(Risk=low) = 0*5/14 = 0
P(Inc=15-35 | Risk=low) = 0/5 P(D|Risk=moderate)P(Risk=moderate)=4/81*3/14=0.0106
P(Inc=15-35 | Risk=moderate) = 2/3 P(D|Risk=high)P(Risk=high)=48/1296*6/14=0.0159
P(Inc=15-35 | Risk=high) = 2/6
MSCS 282: Data Mining - Craig A.
Struble 23
Considerations
When estimating probabilities from data, you
want to avoid 0 probabilities, because they
dominate over all attributes
One strategy is to use a Laplace estimator
Choose a small constant , split into parts for
numerators, and add to each denominator
Example 2 / 4 3 / 4 3 / 4 0 / 4
, , ,
5 5 5 5
MSCS 282: Data Mining - Craig A.
Struble 24
Considerations
Missing values are just removed from the
computation
For numeric attributes, use a probability density
function, such as for a normal distribution
( x )2
1
f ( x) e 2 2
2
MSCS 282: Data Mining - Craig A.
Struble 25
Statistical Clustering (COBWEB)
Fisher
Incremental approach to clustering
Creates a classification tree, in which each node
describes a concept and a probabilistic
description of the concept
Prior probability of the concept
Conditional probabilities for the attributes given that
concept.
MSCS 282: Data Mining - Craig A.
Struble 26
Classification Tree
MSCS 282: Data Mining - Craig A.
Struble 27
Algorithm
Add each data item to the hierarchy one at a
time.
Try placing the data item in each existing node
(going level by level), select good node by
maximizing category utility
n
P(Ck ) P( Ai Vij | Ck ) P( Ai Vij )
2 2
k 1 i j i j
n
MSCS 282: Data Mining - Craig A.
Struble 28
Algorithm
Incorporating a new instance might cause the
two best nodes to merge
Calculate CU for the merged nodes
Alternatively, incorporating a new instance might
cause a split
Calculate CU for splitting the best node
MSCS 282: Data Mining - Craig A.
Struble 29
Probability-Based Clustering
Consider clustering data into k clusters
Model each cluster with a probability distribution
This set of k distributions is called a mixture, and
the overall model is a finite mixture model.
Each probability distribution gives the probability
of an instance being in a given cluster
MSCS 282: Data Mining - Craig A.
Struble 30
Probability-Based Clustering
Simplest case: A single numeric attribute and
two clusters A and B each represented by a
normal distribution
Parameters for A: A - mean, A - standard dev.
Parameters for B: B - mean, B - standard dev.
And P(A), P(B) = 1 - P(A), the prior probabilities of
being in cluster A and B respectively
MSCS 282: Data Mining - Craig A.
Struble 31
Probability -Based Clustering data
A 51 B 62 B 64 A 48 A 39 A 51
A 43 A 47 A 51 B 64 B 62 A 48
B 62 A 52 A 52 A 51 B 64 B 64
B 64 B 64 B 62 B 63 A 52 A 42
A 45 A 51 A 49 A 43 B 63 A 48
A 42 B 65 A 48 B 65 B 64 A 41
A 46 A 48 B 62 B 66 A 48
A 45 A 49 A 43 B 65 B 64
A 45 A 46 A 40 A 46 A 48
model
A=50, A =5, pA=0.6 B=65, B =2, pB=0.4
MSCS 282: Data Mining - Craig A.
Struble 32
Probability-Based Clustering
Question is, how do we know the parameters for
the mixture?
A ,A, B ,B,P(A)
If data is labeled, easy
But clustering is more often used for unlabeled data
Use an iterative approach similar in spirit to the
k-means algorithm
MSCS 282: Data Mining - Craig A.
Struble 33
Expectation Maximization
Start with initial guesses for the parameters
Calculate cluster probabilities for each instance
Expectation
Reestimate the parameters from probabilities
Maximization
Repeat
MSCS 282: Data Mining - Craig A.
Struble 34
Maximization
Let wi be the probability of instance i belonging
to cluster A
Recalculated parameters are
w1 x1 w2 x2 wn xn
A
w1 w2 wn
w1 ( x1 A ) 2 w2 ( x2 A ) 2 wn ( xn A ) 2
A 2
w1 w2 wn
i | wi is the largest probabilit y of all clusters
P( A)
n
MSCS 282: Data Mining - Craig A.
Struble 35
Termination
The EM algorithm converges to a maximum, but
never gets there
Continue until overall likelihood growth is
negligible
n
P( A) P( x | A) P( B) P( x | B)
i 1
i i
Maximum could be local, so repeat several times
with different initial values
MSCS 282: Data Mining - Craig A.
Struble 36
Extending the Model
Extending to multiple clusters is straightforward, just use k
normal distributions
For multiple attributes, assume independence and multiply
attribute probabilities as in Naïve Bayes
For nominal attributes, can’t use normal distribution. Have to
create probability distributions for the values, one per cluster.
This gives kv parameters to estimate, where v is the number of
values for the nominal attribute.
Can use different distributions depending on data: e.g., log-
normal distribution for attributes with minimum
MSCS 282: Data Mining - Craig A.
Struble 37
Get documents about "