# probability by ajizai

VIEWS: 8 PAGES: 37

• pg 1
```									Probabilistic Approaches to Data
Mining
Craig A. Struble, Ph.D.
Department of Mathematics, Statistics, and Computer Science
Marquette University
Overview
 Introduction to Probability
 Bayes’ Rule
 Naïve Bayesian Classification
 Model Based Clustering
 Other applications

MSCS 282: Data Mining - Craig A.
Struble                 2
Probability
 Let P(A) represent the probability that
proposition A is true.
Example: Let Risky represent that a customer is a
high credit risk. P(Risky) = 0.519 means that there is
a 51.9% chance a given customer is a high-credit
risk.
 Without any other information, this probability is
called the prior or unconditional probability
MSCS 282: Data Mining - Craig A.
Struble                        3
Random Variables
 Could also consider a random variable X, which
can take on one of many values in its domain
<x1,x2,…,xn>
Example: Let Weather be a random variable with
domain <sunny, rain, cloudy, snow>. The
probabilities of Weather taking on one of these
values is
P(Weather=sunny)=0.7 P(Weather=rain)=0.2
P(Weather=cloudy)=0.08 P(Weather=snow)=0.02
MSCS 282: Data Mining - Craig A.
Struble                  4
Probability Distributions
   The notation P(X) is used to represent the probabilities
of all possible values of a random variable
Example, P(Weather) = <0.7,0.2,0.08,0.02>
 The statement above defines a probability distribution
for the random variable Weather
 The notation P(Weather, Risky) is used to denote the
probabilities of all combinations of the two variables.
Represented by a 4x2 table of probabilities

MSCS 282: Data Mining - Craig A.
Struble                     5
Conditional Probability
 Probabilities of events change when we know
something about the world
 The notation P(A|B) is used to represent the
conditional or posterior probability of A
Read “the probability of A given that all we know is
B.”
P(Weather = snow | Temperature = below freezing) = 0.10

MSCS 282: Data Mining - Craig A.
Struble                         6
Logical Connectives
   We can use logical connectives for probabilities
P(Weather = snow  Temperature = below freezing)
Can use disjunctions (or) or negation (not) as well
   The product rule
P(A  B) = P(A|B)P(B) = P(B|A)P(A)
   Using probability distributions
P(X,Y) = P(X|Y)P(Y)
which is equivalent to saying
P(X=xi  Y=yj)=P(X=xi|Y=yj)P(Y=yj) for all i and j

MSCS 282: Data Mining - Craig A.
Struble                       7
Axioms of Probability
 All probabilities are between 0 and 1
0P(A) 1
 Necessarily true propositions have prob. of 1,
necessarily false prob. of 0
P(true) = 1 P(false) = 0
 The probability of a disjunction is given by
P(AB) = P(A) + P(B) - P(AB)

MSCS 282: Data Mining - Craig A.
Struble                 8
Joint Probability Distributions
 Recall P(A,B) represents the probabilities of all
possible combinations of assignments to random
variables A and B.
 More generally, P(X1, …, Xn) for random
variables X1, …, Xn is called the joint probability
distribution or joint

MSCS 282: Data Mining - Craig A.
Struble                  9
Joint Probability Distributions
Risky Risky
   Example
Weather=sunny     0.30      0.40

Weather=rain      0.15      0.05

Weather=cloudy 0.05         0.03

Weather=snow      0.019 0.001

 P(Weather=sunny) = 0.7 (add up row)
 P(Risky) = 0.519 (add up column)
 What about P(Weather=sunnyRisky)?
MSCS 282: Data Mining - Craig A.
Struble                 10
Bayes’ Rule
 Bayes’ rule relates conditional probabilities
 P(A B)=P(A|B)P(B)
 P(A B)=P(B|A)P(A)
 Bayes’ Rule
P( A | B) P( B)
P( B | A) 
P( A)

MSCS 282: Data Mining - Craig A.
Struble                 11
Generalizing Bayes’ Rule
 For probability distributions
P ( A | B )P ( B )
P ( B | A) 
P ( A)

 Conditionalized on background evidence E

P ( A | B, E )P ( B | E )
P ( B | A, E ) 
P ( A | E)

MSCS 282: Data Mining - Craig A.
Struble                     12
Bayes’ Rule Example
 Suppose we want P(Risky|Weather=sunny)
 P(Weather=sunny|Risky) = 0.578,
P(Risky)=0.519, P(Weather=sunny)=0.7
 Using Bayes’ Rule we get
P(Weather  sunny | Risky) P( Risky)
P( Risky | Weather  sunny ) 
P(Weather  sunny )

   P(Risky|Weather=sunny) = (0.578*0.519)/0.7=0.429
MSCS 282: Data Mining - Craig A.
Struble                             13
Bayes’ Rule Example
   Dishonest casino: Loaded die where 6 comes up 50% of the time. 1 out of
every 100 die is loaded
   P(Dfair)=0.99 P(Dloaded) = 0.01
   Let’s say someone rolls three 6’s in a row. What’s the probability that the die is
loaded?
P(3sixes |D loaded) P( Dloaded )
P( Dloaded 3sixes) 
P(3sixes)
P(3sixes | Dloaded ) P( Dloaded )

P(3sixes | Dloaded ) P( Dloaded )  P(3sixes | D fair ) P( D fair )
(0.53 )(0.01)

(0.53 )(0.01)  (1 / 6)3 (0.99)
 0.21

MSCS 282: Data Mining - Craig A.
Struble                                                 14
Normalization
 Direct assessment of P(A) may not be possible,
but we can use the fact
P( A)   P( A | B  bi )P( B  bi )
i

to estimate the value.
 Example:
P(Weather=sunny) = P(Weather=sunny | Risky)P(Risky)
+ P(Weather=sunny | Risky)P( Risky)

MSCS 282: Data Mining - Craig A.
Struble                          15
Normalization
 Using the previous fact, Bayes’ rule can be
written
P( A | B) P( B)
P( B | A) 
 P( A | B  bi ) P( B  bi )
i

 More generally
P ( B | A)  P ( A | B)P ( B)
where  is a constant that makes the probability
distribution table for P(B|A) total to 1.
MSCS 282: Data Mining - Craig A.
Struble                 16
Application to Data Mining
 In data mining, we have data which provides evidence
of probabilities
 Let h be a hypothesis (classification) and D be a data
observation
 P(h) - probability of classification before data
 P(D) - probability of obtaining data sample
 P(h|D) - probability of classification after observing data
 P(D|h) - probability that we would observe D given that
h is the correct classification.
MSCS 282: Data Mining - Craig A.
Struble                     17
Naïve Bayesian Classifier
 Choose the most likely classfication using
Bayesian techniques
 MAP (maximum a posteriori classification)

P( D | hi ) P(hi )
max P(hi | D)  max
i               i      P( D)
 max P( D | hi ) P(hi )
i

MSCS 282: Data Mining - Craig A.
Struble                 18
Naïve Bayesian Classifier
 ML (maximum likelihood)
Assume P(hi) = P(hj) (classifications are equally
likely)
Choose hi such that

max P(hi | D)  max P( D | hi )
i                                i

MSCS 282: Data Mining - Craig A.
Struble                    19
Naïve Bayesian Classifier
 Generally speaking, D is a vector <d1, …, dk>of
values for attributes
 To simplify things, the attributes for D are
assumed to be independent. That means we can
write
k
P( D | h)   P( Di  d i | h)
i 1

MSCS 282: Data Mining - Craig A.
Struble                 20
Example

MSCS 282: Data Mining - Craig A.
Struble                 21
Example
 Let D=<unknown, low, none, 15-35>
 Which risk category is D in?
Three hypotheses: Risk=low, Risk=moderate,
Risk=high
 Because of naïve assumption, calculate
individual probabilities and then multiply
together.

MSCS 282: Data Mining - Craig A.
Struble                 22
Example
P(CH=unknown | Risk=low) = 2/5                P(D|Risk=low)=2/5*3/5*3/5*0/5=0
P(CH=unknown | Risk=moderate) = 1/3           P(D|Risk=moderate)=1/3*1/3*2/3*2/3=4/81=0.494
P(CH=unknown | Risk=high) = 2/6               P(D|Risk=high)=2/6*2/6*6/6*2/6=48/1296=0.370
P(Debt=low | Risk=low) = 3/5
P(Debt=low | Risk=moderate) = 1/3             P(Risk=low)=5/14
P(Debt=low | Risk=high) = 2/6                 P(Risk=moderate)=3/14
P(Coll=none | Risk=low) = 3/5                 P(Risk=high)=6/14
P(Coll=none | Risk=moderate) = 2/3
P(Coll=none | Risk=high) = 6/6         P(D|Risk=low)P(Risk=low) = 0*5/14 = 0
P(Inc=15-35 | Risk=low) = 0/5          P(D|Risk=moderate)P(Risk=moderate)=4/81*3/14=0.0106
P(Inc=15-35 | Risk=moderate) = 2/3     P(D|Risk=high)P(Risk=high)=48/1296*6/14=0.0159
P(Inc=15-35 | Risk=high) = 2/6

MSCS 282: Data Mining - Craig A.
Struble                                            23
Considerations
 When estimating probabilities from data, you
want to avoid 0 probabilities, because they
dominate over all attributes
 One strategy is to use a Laplace estimator
Choose a small constant , split into parts for
numerators, and add to each denominator
Example 2   / 4 3   / 4 3   / 4 0   / 4
,              ,              ,
5         5            5            5 
MSCS 282: Data Mining - Craig A.
Struble                            24
Considerations
 Missing values are just removed from the
computation
 For numeric attributes, use a probability density
function, such as for a normal distribution
( x )2
1       
f ( x)           e        2 2

2 

MSCS 282: Data Mining - Craig A.
Struble                   25
Statistical Clustering (COBWEB)
 Fisher
 Incremental approach to clustering
 Creates a classification tree, in which each node
describes a concept and a probabilistic
description of the concept
Prior probability of the concept
Conditional probabilities for the attributes given that
concept.

MSCS 282: Data Mining - Craig A.
Struble                        26
Classification Tree

MSCS 282: Data Mining - Craig A.
Struble                 27
Algorithm
 Add each data item to the hierarchy one at a
time.
 Try placing the data item in each existing node
(going level by level), select good node by
maximizing category utility
n                                                
 P(Ck ) P( Ai  Vij | Ck )   P( Ai  Vij ) 
2                  2

k 1     i j                    i j               
n

MSCS 282: Data Mining - Craig A.
Struble                      28
Algorithm
 Incorporating a new instance                    might cause the
two best nodes to merge
Calculate CU for the merged nodes
 Alternatively, incorporating a new instance might
cause a split
Calculate CU for splitting the best node

MSCS 282: Data Mining - Craig A.
Struble                              29
Probability-Based Clustering
 Consider clustering data into k clusters
 Model each cluster with a probability distribution
 This set of k distributions is called a mixture, and
the overall model is a finite mixture model.
 Each probability distribution gives the probability
of an instance being in a given cluster

MSCS 282: Data Mining - Craig A.
Struble                 30
Probability-Based Clustering
 Simplest case: A single numeric attribute and
two clusters A and B each represented by a
normal distribution
Parameters for A: A - mean, A - standard dev.
Parameters for B: B - mean, B - standard dev.
And P(A), P(B) = 1 - P(A), the prior probabilities of
being in cluster A and B respectively

MSCS 282: Data Mining - Craig A.
Struble                       31
Probability -Based Clustering    data
A   51   B   62      B    64                 A    48               A   39   A   51
A   43   A   47      A    51                 B    64               B   62   A   48
B   62   A   52      A    52                 A    51               B   64   B   64
B   64   B   64      B    62                 B    63               A   52   A   42
A   45   A   51      A    49                 A    43               B   63   A   48
A   42   B   65      A    48                 B    65               B   64   A   41
A   46   A   48      B    62                 B    66               A   48
A   45   A   49      A    43                 B    65               B   64
A   45   A   46      A    40                 A    46               A   48
model

A=50, A =5, pA=0.6      B=65, B =2, pB=0.4

MSCS 282: Data Mining - Craig A.
Struble                                             32
Probability-Based Clustering
 Question is, how do we know the parameters for
the mixture?
A ,A, B ,B,P(A)
If data is labeled, easy
But clustering is more often used for unlabeled data
 Use an iterative approach similar in spirit to the
k-means algorithm

MSCS 282: Data Mining - Craig A.
Struble                   33
Expectation Maximization
 Start with initial guesses for the parameters
 Calculate cluster probabilities for each instance
Expectation
 Reestimate the parameters from probabilities
Maximization
 Repeat

MSCS 282: Data Mining - Craig A.
Struble                 34
Maximization
 Let wi be the probability of instance i belonging
to cluster A
 Recalculated parameters are
w1 x1  w2 x2    wn xn
A 
w1  w2    wn
w1 ( x1   A ) 2  w2 ( x2   A ) 2    wn ( xn   A ) 2
A   2

w1  w2    wn
i | wi is the largest probabilit y of all clusters
P( A) 
n
MSCS 282: Data Mining - Craig A.
Struble                                   35
Termination
 The EM algorithm converges to a maximum, but
never gets there
 Continue until overall likelihood growth is
negligible
n

 P( A) P( x | A)  P( B) P( x | B)
i 1
i                      i

 Maximum could be local, so repeat several times
with different initial values
MSCS 282: Data Mining - Craig A.
Struble                     36
Extending the Model
   Extending to multiple clusters is straightforward, just use k
normal distributions
   For multiple attributes, assume independence and multiply
attribute probabilities as in Naïve Bayes
   For nominal attributes, can’t use normal distribution. Have to
create probability distributions for the values, one per cluster.
This gives kv parameters to estimate, where v is the number of
values for the nominal attribute.
   Can use different distributions depending on data: e.g., log-
normal distribution for attributes with minimum

MSCS 282: Data Mining - Craig A.
Struble                             37

```
To top