VIEWS: 8 PAGES: 37 POSTED ON: 12/1/2011
Probabilistic Approaches to Data Mining Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University Overview Introduction to Probability Bayes’ Rule Naïve Bayesian Classification Model Based Clustering Other applications MSCS 282: Data Mining - Craig A. Struble 2 Probability Let P(A) represent the probability that proposition A is true. Example: Let Risky represent that a customer is a high credit risk. P(Risky) = 0.519 means that there is a 51.9% chance a given customer is a high-credit risk. Without any other information, this probability is called the prior or unconditional probability MSCS 282: Data Mining - Craig A. Struble 3 Random Variables Could also consider a random variable X, which can take on one of many values in its domain <x1,x2,…,xn> Example: Let Weather be a random variable with domain <sunny, rain, cloudy, snow>. The probabilities of Weather taking on one of these values is P(Weather=sunny)=0.7 P(Weather=rain)=0.2 P(Weather=cloudy)=0.08 P(Weather=snow)=0.02 MSCS 282: Data Mining - Craig A. Struble 4 Probability Distributions The notation P(X) is used to represent the probabilities of all possible values of a random variable Example, P(Weather) = <0.7,0.2,0.08,0.02> The statement above defines a probability distribution for the random variable Weather The notation P(Weather, Risky) is used to denote the probabilities of all combinations of the two variables. Represented by a 4x2 table of probabilities MSCS 282: Data Mining - Craig A. Struble 5 Conditional Probability Probabilities of events change when we know something about the world The notation P(A|B) is used to represent the conditional or posterior probability of A Read “the probability of A given that all we know is B.” P(Weather = snow | Temperature = below freezing) = 0.10 MSCS 282: Data Mining - Craig A. Struble 6 Logical Connectives We can use logical connectives for probabilities P(Weather = snow Temperature = below freezing) Can use disjunctions (or) or negation (not) as well The product rule P(A B) = P(A|B)P(B) = P(B|A)P(A) Using probability distributions P(X,Y) = P(X|Y)P(Y) which is equivalent to saying P(X=xi Y=yj)=P(X=xi|Y=yj)P(Y=yj) for all i and j MSCS 282: Data Mining - Craig A. Struble 7 Axioms of Probability All probabilities are between 0 and 1 0P(A) 1 Necessarily true propositions have prob. of 1, necessarily false prob. of 0 P(true) = 1 P(false) = 0 The probability of a disjunction is given by P(AB) = P(A) + P(B) - P(AB) MSCS 282: Data Mining - Craig A. Struble 8 Joint Probability Distributions Recall P(A,B) represents the probabilities of all possible combinations of assignments to random variables A and B. More generally, P(X1, …, Xn) for random variables X1, …, Xn is called the joint probability distribution or joint MSCS 282: Data Mining - Craig A. Struble 9 Joint Probability Distributions Risky Risky Example Weather=sunny 0.30 0.40 Weather=rain 0.15 0.05 Weather=cloudy 0.05 0.03 Weather=snow 0.019 0.001 P(Weather=sunny) = 0.7 (add up row) P(Risky) = 0.519 (add up column) What about P(Weather=sunnyRisky)? MSCS 282: Data Mining - Craig A. Struble 10 Bayes’ Rule Bayes’ rule relates conditional probabilities P(A B)=P(A|B)P(B) P(A B)=P(B|A)P(A) Bayes’ Rule P( A | B) P( B) P( B | A) P( A) MSCS 282: Data Mining - Craig A. Struble 11 Generalizing Bayes’ Rule For probability distributions P ( A | B )P ( B ) P ( B | A) P ( A) Conditionalized on background evidence E P ( A | B, E )P ( B | E ) P ( B | A, E ) P ( A | E) MSCS 282: Data Mining - Craig A. Struble 12 Bayes’ Rule Example Suppose we want P(Risky|Weather=sunny) P(Weather=sunny|Risky) = 0.578, P(Risky)=0.519, P(Weather=sunny)=0.7 Using Bayes’ Rule we get P(Weather sunny | Risky) P( Risky) P( Risky | Weather sunny ) P(Weather sunny ) P(Risky|Weather=sunny) = (0.578*0.519)/0.7=0.429 MSCS 282: Data Mining - Craig A. Struble 13 Bayes’ Rule Example Dishonest casino: Loaded die where 6 comes up 50% of the time. 1 out of every 100 die is loaded P(Dfair)=0.99 P(Dloaded) = 0.01 Let’s say someone rolls three 6’s in a row. What’s the probability that the die is loaded? P(3sixes |D loaded) P( Dloaded ) P( Dloaded 3sixes) P(3sixes) P(3sixes | Dloaded ) P( Dloaded ) P(3sixes | Dloaded ) P( Dloaded ) P(3sixes | D fair ) P( D fair ) (0.53 )(0.01) (0.53 )(0.01) (1 / 6)3 (0.99) 0.21 MSCS 282: Data Mining - Craig A. Struble 14 Normalization Direct assessment of P(A) may not be possible, but we can use the fact P( A) P( A | B bi )P( B bi ) i to estimate the value. Example: P(Weather=sunny) = P(Weather=sunny | Risky)P(Risky) + P(Weather=sunny | Risky)P( Risky) MSCS 282: Data Mining - Craig A. Struble 15 Normalization Using the previous fact, Bayes’ rule can be written P( A | B) P( B) P( B | A) P( A | B bi ) P( B bi ) i More generally P ( B | A) P ( A | B)P ( B) where is a constant that makes the probability distribution table for P(B|A) total to 1. MSCS 282: Data Mining - Craig A. Struble 16 Application to Data Mining In data mining, we have data which provides evidence of probabilities Let h be a hypothesis (classification) and D be a data observation P(h) - probability of classification before data P(D) - probability of obtaining data sample P(h|D) - probability of classification after observing data P(D|h) - probability that we would observe D given that h is the correct classification. MSCS 282: Data Mining - Craig A. Struble 17 Naïve Bayesian Classifier Choose the most likely classfication using Bayesian techniques MAP (maximum a posteriori classification) P( D | hi ) P(hi ) max P(hi | D) max i i P( D) max P( D | hi ) P(hi ) i MSCS 282: Data Mining - Craig A. Struble 18 Naïve Bayesian Classifier ML (maximum likelihood) Assume P(hi) = P(hj) (classifications are equally likely) Choose hi such that max P(hi | D) max P( D | hi ) i i MSCS 282: Data Mining - Craig A. Struble 19 Naïve Bayesian Classifier Generally speaking, D is a vector <d1, …, dk>of values for attributes To simplify things, the attributes for D are assumed to be independent. That means we can write k P( D | h) P( Di d i | h) i 1 MSCS 282: Data Mining - Craig A. Struble 20 Example MSCS 282: Data Mining - Craig A. Struble 21 Example Let D=<unknown, low, none, 15-35> Which risk category is D in? Three hypotheses: Risk=low, Risk=moderate, Risk=high Because of naïve assumption, calculate individual probabilities and then multiply together. MSCS 282: Data Mining - Craig A. Struble 22 Example P(CH=unknown | Risk=low) = 2/5 P(D|Risk=low)=2/5*3/5*3/5*0/5=0 P(CH=unknown | Risk=moderate) = 1/3 P(D|Risk=moderate)=1/3*1/3*2/3*2/3=4/81=0.494 P(CH=unknown | Risk=high) = 2/6 P(D|Risk=high)=2/6*2/6*6/6*2/6=48/1296=0.370 P(Debt=low | Risk=low) = 3/5 P(Debt=low | Risk=moderate) = 1/3 P(Risk=low)=5/14 P(Debt=low | Risk=high) = 2/6 P(Risk=moderate)=3/14 P(Coll=none | Risk=low) = 3/5 P(Risk=high)=6/14 P(Coll=none | Risk=moderate) = 2/3 P(Coll=none | Risk=high) = 6/6 P(D|Risk=low)P(Risk=low) = 0*5/14 = 0 P(Inc=15-35 | Risk=low) = 0/5 P(D|Risk=moderate)P(Risk=moderate)=4/81*3/14=0.0106 P(Inc=15-35 | Risk=moderate) = 2/3 P(D|Risk=high)P(Risk=high)=48/1296*6/14=0.0159 P(Inc=15-35 | Risk=high) = 2/6 MSCS 282: Data Mining - Craig A. Struble 23 Considerations When estimating probabilities from data, you want to avoid 0 probabilities, because they dominate over all attributes One strategy is to use a Laplace estimator Choose a small constant , split into parts for numerators, and add to each denominator Example 2 / 4 3 / 4 3 / 4 0 / 4 , , , 5 5 5 5 MSCS 282: Data Mining - Craig A. Struble 24 Considerations Missing values are just removed from the computation For numeric attributes, use a probability density function, such as for a normal distribution ( x )2 1 f ( x) e 2 2 2 MSCS 282: Data Mining - Craig A. Struble 25 Statistical Clustering (COBWEB) Fisher Incremental approach to clustering Creates a classification tree, in which each node describes a concept and a probabilistic description of the concept Prior probability of the concept Conditional probabilities for the attributes given that concept. MSCS 282: Data Mining - Craig A. Struble 26 Classification Tree MSCS 282: Data Mining - Craig A. Struble 27 Algorithm Add each data item to the hierarchy one at a time. Try placing the data item in each existing node (going level by level), select good node by maximizing category utility n P(Ck ) P( Ai Vij | Ck ) P( Ai Vij ) 2 2 k 1 i j i j n MSCS 282: Data Mining - Craig A. Struble 28 Algorithm Incorporating a new instance might cause the two best nodes to merge Calculate CU for the merged nodes Alternatively, incorporating a new instance might cause a split Calculate CU for splitting the best node MSCS 282: Data Mining - Craig A. Struble 29 Probability-Based Clustering Consider clustering data into k clusters Model each cluster with a probability distribution This set of k distributions is called a mixture, and the overall model is a finite mixture model. Each probability distribution gives the probability of an instance being in a given cluster MSCS 282: Data Mining - Craig A. Struble 30 Probability-Based Clustering Simplest case: A single numeric attribute and two clusters A and B each represented by a normal distribution Parameters for A: A - mean, A - standard dev. Parameters for B: B - mean, B - standard dev. And P(A), P(B) = 1 - P(A), the prior probabilities of being in cluster A and B respectively MSCS 282: Data Mining - Craig A. Struble 31 Probability -Based Clustering data A 51 B 62 B 64 A 48 A 39 A 51 A 43 A 47 A 51 B 64 B 62 A 48 B 62 A 52 A 52 A 51 B 64 B 64 B 64 B 64 B 62 B 63 A 52 A 42 A 45 A 51 A 49 A 43 B 63 A 48 A 42 B 65 A 48 B 65 B 64 A 41 A 46 A 48 B 62 B 66 A 48 A 45 A 49 A 43 B 65 B 64 A 45 A 46 A 40 A 46 A 48 model A=50, A =5, pA=0.6 B=65, B =2, pB=0.4 MSCS 282: Data Mining - Craig A. Struble 32 Probability-Based Clustering Question is, how do we know the parameters for the mixture? A ,A, B ,B,P(A) If data is labeled, easy But clustering is more often used for unlabeled data Use an iterative approach similar in spirit to the k-means algorithm MSCS 282: Data Mining - Craig A. Struble 33 Expectation Maximization Start with initial guesses for the parameters Calculate cluster probabilities for each instance Expectation Reestimate the parameters from probabilities Maximization Repeat MSCS 282: Data Mining - Craig A. Struble 34 Maximization Let wi be the probability of instance i belonging to cluster A Recalculated parameters are w1 x1 w2 x2 wn xn A w1 w2 wn w1 ( x1 A ) 2 w2 ( x2 A ) 2 wn ( xn A ) 2 A 2 w1 w2 wn i | wi is the largest probabilit y of all clusters P( A) n MSCS 282: Data Mining - Craig A. Struble 35 Termination The EM algorithm converges to a maximum, but never gets there Continue until overall likelihood growth is negligible n P( A) P( x | A) P( B) P( x | B) i 1 i i Maximum could be local, so repeat several times with different initial values MSCS 282: Data Mining - Craig A. Struble 36 Extending the Model Extending to multiple clusters is straightforward, just use k normal distributions For multiple attributes, assume independence and multiply attribute probabilities as in Naïve Bayes For nominal attributes, can’t use normal distribution. Have to create probability distributions for the values, one per cluster. This gives kv parameters to estimate, where v is the number of values for the nominal attribute. Can use different distributions depending on data: e.g., log- normal distribution for attributes with minimum MSCS 282: Data Mining - Craig A. Struble 37