probability by ajizai

VIEWS: 8 PAGES: 37

									Probabilistic Approaches to Data
Mining
                    Craig A. Struble, Ph.D.
  Department of Mathematics, Statistics, and Computer Science
                     Marquette University
Overview
 Introduction to Probability
 Bayes’ Rule
 Naïve Bayesian Classification
 Model Based Clustering
 Other applications




                  MSCS 282: Data Mining - Craig A.
                             Struble                 2
Probability
 Let P(A) represent the probability that
  proposition A is true.
     Example: Let Risky represent that a customer is a
     high credit risk. P(Risky) = 0.519 means that there is
     a 51.9% chance a given customer is a high-credit
     risk.
 Without any other information, this probability is
  called the prior or unconditional probability
                    MSCS 282: Data Mining - Craig A.
                               Struble                        3
Random Variables
 Could also consider a random variable X, which
  can take on one of many values in its domain
  <x1,x2,…,xn>
    Example: Let Weather be a random variable with
    domain <sunny, rain, cloudy, snow>. The
    probabilities of Weather taking on one of these
    values is
  P(Weather=sunny)=0.7 P(Weather=rain)=0.2
  P(Weather=cloudy)=0.08 P(Weather=snow)=0.02
                  MSCS 282: Data Mining - Craig A.
                             Struble                  4
Probability Distributions
   The notation P(X) is used to represent the probabilities
    of all possible values of a random variable
       Example, P(Weather) = <0.7,0.2,0.08,0.02>
 The statement above defines a probability distribution
  for the random variable Weather
 The notation P(Weather, Risky) is used to denote the
  probabilities of all combinations of the two variables.
       Represented by a 4x2 table of probabilities


                        MSCS 282: Data Mining - Craig A.
                                   Struble                     5
Conditional Probability
 Probabilities of events change when we know
  something about the world
 The notation P(A|B) is used to represent the
  conditional or posterior probability of A
     Read “the probability of A given that all we know is
     B.”
      P(Weather = snow | Temperature = below freezing) = 0.10


                     MSCS 282: Data Mining - Craig A.
                                Struble                         6
Logical Connectives
   We can use logical connectives for probabilities
       P(Weather = snow  Temperature = below freezing)
       Can use disjunctions (or) or negation (not) as well
   The product rule
       P(A  B) = P(A|B)P(B) = P(B|A)P(A)
   Using probability distributions
       P(X,Y) = P(X|Y)P(Y)
       which is equivalent to saying
            P(X=xi  Y=yj)=P(X=xi|Y=yj)P(Y=yj) for all i and j

                        MSCS 282: Data Mining - Craig A.
                                   Struble                       7
Axioms of Probability
 All probabilities are between 0 and 1
     0P(A) 1
 Necessarily true propositions have prob. of 1,
  necessarily false prob. of 0
     P(true) = 1 P(false) = 0
 The probability of a disjunction is given by
     P(AB) = P(A) + P(B) - P(AB)

                   MSCS 282: Data Mining - Craig A.
                              Struble                 8
Joint Probability Distributions
 Recall P(A,B) represents the probabilities of all
  possible combinations of assignments to random
  variables A and B.
 More generally, P(X1, …, Xn) for random
  variables X1, …, Xn is called the joint probability
  distribution or joint


                  MSCS 282: Data Mining - Craig A.
                             Struble                  9
Joint Probability Distributions
                                Risky Risky
   Example
              Weather=sunny     0.30      0.40

              Weather=rain      0.15      0.05

              Weather=cloudy 0.05         0.03

              Weather=snow      0.019 0.001


 P(Weather=sunny) = 0.7 (add up row)
 P(Risky) = 0.519 (add up column)
 What about P(Weather=sunnyRisky)?
                       MSCS 282: Data Mining - Craig A.
                                  Struble                 10
Bayes’ Rule
 Bayes’ rule relates conditional probabilities
 P(A B)=P(A|B)P(B)
 P(A B)=P(B|A)P(A)
 Bayes’ Rule
                        P( A | B) P( B)
            P( B | A) 
                            P( A)

                  MSCS 282: Data Mining - Craig A.
                             Struble                 11
Generalizing Bayes’ Rule
 For probability distributions
                           P ( A | B )P ( B )
              P ( B | A) 
                                P ( A)

 Conditionalized on background evidence E

                                P ( A | B, E )P ( B | E )
            P ( B | A, E ) 
                                       P ( A | E)


                     MSCS 282: Data Mining - Craig A.
                                Struble                     12
Bayes’ Rule Example
 Suppose we want P(Risky|Weather=sunny)
 P(Weather=sunny|Risky) = 0.578,
  P(Risky)=0.519, P(Weather=sunny)=0.7
 Using Bayes’ Rule we get
                                        P(Weather  sunny | Risky) P( Risky)
    P( Risky | Weather  sunny ) 
                                               P(Weather  sunny )

   P(Risky|Weather=sunny) = (0.578*0.519)/0.7=0.429
                        MSCS 282: Data Mining - Craig A.
                                   Struble                             13
Bayes’ Rule Example
   Dishonest casino: Loaded die where 6 comes up 50% of the time. 1 out of
    every 100 die is loaded
   P(Dfair)=0.99 P(Dloaded) = 0.01
   Let’s say someone rolls three 6’s in a row. What’s the probability that the die is
    loaded?
                             P(3sixes |D loaded) P( Dloaded )
         P( Dloaded 3sixes) 
                                     P(3sixes)
                                           P(3sixes | Dloaded ) P( Dloaded )
                           
                             P(3sixes | Dloaded ) P( Dloaded )  P(3sixes | D fair ) P( D fair )
                                      (0.53 )(0.01)
                           
                             (0.53 )(0.01)  (1 / 6)3 (0.99)
                            0.21

                                MSCS 282: Data Mining - Craig A.
                                           Struble                                                 14
Normalization
 Direct assessment of P(A) may not be possible,
  but we can use the fact
     P( A)   P( A | B  bi )P( B  bi )
                 i

  to estimate the value.
 Example:
        P(Weather=sunny) = P(Weather=sunny | Risky)P(Risky)
                          + P(Weather=sunny | Risky)P( Risky)


                      MSCS 282: Data Mining - Craig A.
                                 Struble                          15
Normalization
 Using the previous fact, Bayes’ rule can be
  written
                             P( A | B) P( B)
            P( B | A) 
                         P( A | B  bi ) P( B  bi )
                           i

 More generally
            P ( B | A)  P ( A | B)P ( B)
  where  is a constant that makes the probability
  distribution table for P(B|A) total to 1.
                      MSCS 282: Data Mining - Craig A.
                                 Struble                 16
Application to Data Mining
 In data mining, we have data which provides evidence
  of probabilities
 Let h be a hypothesis (classification) and D be a data
  observation
 P(h) - probability of classification before data
 P(D) - probability of obtaining data sample
 P(h|D) - probability of classification after observing data
 P(D|h) - probability that we would observe D given that
  h is the correct classification.
                     MSCS 282: Data Mining - Craig A.
                                Struble                     17
Naïve Bayesian Classifier
 Choose the most likely classfication using
  Bayesian techniques
 MAP (maximum a posteriori classification)

                          P( D | hi ) P(hi )
      max P(hi | D)  max
       i               i      P( D)
                     max P( D | hi ) P(hi )
                                i



                  MSCS 282: Data Mining - Craig A.
                             Struble                 18
Naïve Bayesian Classifier
 ML (maximum likelihood)
     Assume P(hi) = P(hj) (classifications are equally
    likely)
     Choose hi such that

          max P(hi | D)  max P( D | hi )
             i                                i




                   MSCS 282: Data Mining - Craig A.
                              Struble                    19
Naïve Bayesian Classifier
 Generally speaking, D is a vector <d1, …, dk>of
  values for attributes
 To simplify things, the attributes for D are
  assumed to be independent. That means we can
  write
                            k
           P( D | h)   P( Di  d i | h)
                           i 1


                  MSCS 282: Data Mining - Craig A.
                             Struble                 20
Example




          MSCS 282: Data Mining - Craig A.
                     Struble                 21
Example
 Let D=<unknown, low, none, 15-35>
 Which risk category is D in?
     Three hypotheses: Risk=low, Risk=moderate,
     Risk=high
 Because of naïve assumption, calculate
  individual probabilities and then multiply
  together.

                  MSCS 282: Data Mining - Craig A.
                             Struble                 22
Example
P(CH=unknown | Risk=low) = 2/5                P(D|Risk=low)=2/5*3/5*3/5*0/5=0
P(CH=unknown | Risk=moderate) = 1/3           P(D|Risk=moderate)=1/3*1/3*2/3*2/3=4/81=0.494
P(CH=unknown | Risk=high) = 2/6               P(D|Risk=high)=2/6*2/6*6/6*2/6=48/1296=0.370
P(Debt=low | Risk=low) = 3/5
P(Debt=low | Risk=moderate) = 1/3             P(Risk=low)=5/14
P(Debt=low | Risk=high) = 2/6                 P(Risk=moderate)=3/14
P(Coll=none | Risk=low) = 3/5                 P(Risk=high)=6/14
P(Coll=none | Risk=moderate) = 2/3
P(Coll=none | Risk=high) = 6/6         P(D|Risk=low)P(Risk=low) = 0*5/14 = 0
P(Inc=15-35 | Risk=low) = 0/5          P(D|Risk=moderate)P(Risk=moderate)=4/81*3/14=0.0106
P(Inc=15-35 | Risk=moderate) = 2/3     P(D|Risk=high)P(Risk=high)=48/1296*6/14=0.0159
P(Inc=15-35 | Risk=high) = 2/6




                                MSCS 282: Data Mining - Craig A.
                                           Struble                                            23
Considerations
 When estimating probabilities from data, you
  want to avoid 0 probabilities, because they
  dominate over all attributes
 One strategy is to use a Laplace estimator
     Choose a small constant , split into parts for
     numerators, and add to each denominator
     Example 2   / 4 3   / 4 3   / 4 0   / 4
                         ,              ,              ,
                 5         5            5            5 
                    MSCS 282: Data Mining - Craig A.
                               Struble                            24
Considerations
 Missing values are just removed from the
  computation
 For numeric attributes, use a probability density
  function, such as for a normal distribution
                                            ( x )2
                                1       
                  f ( x)           e        2 2

                               2 




                  MSCS 282: Data Mining - Craig A.
                             Struble                   25
Statistical Clustering (COBWEB)
 Fisher
 Incremental approach to clustering
 Creates a classification tree, in which each node
  describes a concept and a probabilistic
  description of the concept
     Prior probability of the concept
     Conditional probabilities for the attributes given that
     concept.

                     MSCS 282: Data Mining - Craig A.
                                Struble                        26
Classification Tree




           MSCS 282: Data Mining - Craig A.
                      Struble                 27
Algorithm
 Add each data item to the hierarchy one at a
  time.
 Try placing the data item in each existing node
  (going level by level), select good node by
  maximizing category utility
       n                                                
       P(Ck ) P( Ai  Vij | Ck )   P( Ai  Vij ) 
                                    2                  2

      k 1     i j                    i j               
                             n

                     MSCS 282: Data Mining - Craig A.
                                Struble                      28
Algorithm
 Incorporating a new instance                    might cause the
  two best nodes to merge
     Calculate CU for the merged nodes
 Alternatively, incorporating a new instance might
  cause a split
     Calculate CU for splitting the best node



                    MSCS 282: Data Mining - Craig A.
                               Struble                              29
Probability-Based Clustering
 Consider clustering data into k clusters
 Model each cluster with a probability distribution
 This set of k distributions is called a mixture, and
  the overall model is a finite mixture model.
 Each probability distribution gives the probability
  of an instance being in a given cluster


                   MSCS 282: Data Mining - Craig A.
                              Struble                 30
Probability-Based Clustering
 Simplest case: A single numeric attribute and
  two clusters A and B each represented by a
  normal distribution
    Parameters for A: A - mean, A - standard dev.
    Parameters for B: B - mean, B - standard dev.
    And P(A), P(B) = 1 - P(A), the prior probabilities of
    being in cluster A and B respectively


                   MSCS 282: Data Mining - Craig A.
                              Struble                       31
    Probability -Based Clustering    data
A   51   B   62      B    64                 A    48               A   39   A   51
A   43   A   47      A    51                 B    64               B   62   A   48
B   62   A   52      A    52                 A    51               B   64   B   64
B   64   B   64      B    62                 B    63               A   52   A   42
A   45   A   51      A    49                 A    43               B   63   A   48
A   42   B   65      A    48                 B    65               B   64   A   41
A   46   A   48      B    62                 B    66               A   48
A   45   A   49      A    43                 B    65               B   64
A   45   A   46      A    40                 A    46               A   48
                                    model




                  A=50, A =5, pA=0.6      B=65, B =2, pB=0.4

                      MSCS 282: Data Mining - Craig A.
                                 Struble                                             32
Probability-Based Clustering
 Question is, how do we know the parameters for
  the mixture?
     A ,A, B ,B,P(A)
     If data is labeled, easy
     But clustering is more often used for unlabeled data
 Use an iterative approach similar in spirit to the
  k-means algorithm

                   MSCS 282: Data Mining - Craig A.
                              Struble                   33
Expectation Maximization
 Start with initial guesses for the parameters
 Calculate cluster probabilities for each instance
     Expectation
 Reestimate the parameters from probabilities
     Maximization
 Repeat


                    MSCS 282: Data Mining - Craig A.
                               Struble                 34
Maximization
 Let wi be the probability of instance i belonging
  to cluster A
 Recalculated parameters are
                 w1 x1  w2 x2    wn xn
       A 
                     w1  w2    wn
                  w1 ( x1   A ) 2  w2 ( x2   A ) 2    wn ( xn   A ) 2
       A   2
                
                                      w1  w2    wn
                    i | wi is the largest probabilit y of all clusters
       P( A) 
                                                     n
                             MSCS 282: Data Mining - Craig A.
                                        Struble                                   35
Termination
 The EM algorithm converges to a maximum, but
  never gets there
 Continue until overall likelihood growth is
  negligible
               n

               P( A) P( x | A)  P( B) P( x | B)
              i 1
                                 i                      i



 Maximum could be local, so repeat several times
  with different initial values
                     MSCS 282: Data Mining - Craig A.
                                Struble                     36
Extending the Model
   Extending to multiple clusters is straightforward, just use k
    normal distributions
   For multiple attributes, assume independence and multiply
    attribute probabilities as in Naïve Bayes
   For nominal attributes, can’t use normal distribution. Have to
    create probability distributions for the values, one per cluster.
    This gives kv parameters to estimate, where v is the number of
    values for the nominal attribute.
   Can use different distributions depending on data: e.g., log-
    normal distribution for attributes with minimum

                         MSCS 282: Data Mining - Craig A.
                                    Struble                             37

								
To top