VIEWS: 50 PAGES: 15 CATEGORY: Research POSTED ON: 5/26/2010 Public Domain
POL 602 Prof. Matthew Lebo Week 6 – October 6th, 2005 Multivariate Probability Distributions Often we are interested in the intersection of two or more events, or the outcome of two or more random variables at the same time. For example, the probability that the Democrats would win against incumbent G.W. Bush for the White House and the probability that the Democrats would take back control of the House in the same 2004 election. First, let us consider the discrete, bivariate case. If Y1 and Y2 are discrete random variables, their joint probability distribution is given by: p( y1 , y2 ) = P(Y1 = y1 , Y2 = y2 ) − ∞ < y1 < ∞, −∞ < y2 < ∞ where the function p(y1,y2) is often called the joint probability function. As in the case of a distribution of a single discrete variable, p( y1 , y2 ) ≥ 0 ∀y1 , y2 ∑ y1, y2 p ( y1 , y2 ) = 1 For example, two six-sided dice are rolled. This can be viewed as a single experiment, in which the outcome is distributed with the following discrete bivariate distribution: 1 p ( y1 , y2 ) = P (Y1 = y1 , Y2 = y2 ) = , yk = 1, 2,..., 6 36 Graphically: (where 1/36 is approximately equal to .0278) 1 It is also possible to define a joint (or bivariate) distribution function for any bivariate probability function, F ( y1 , y2 ) = P(Y1 ≤ y1 , Y2 ≤ y2 ), − ∞ ≤ yk ≤ ∞ In the discrete case, F has the form: a b F ( a, b) = ∑ ∑ y1 =−∞ y2 =−∞ p( y1 , y2 ) Example: In the same die-tossing experiment (two six-sided dice), what is the probability of getting a two or less on the first die, and a three or less on the second die? 6 F (2,3) = P (Y1 ≤ 2, Y2 ≤ 3) = p (1,1) + p (1, 2) + p (1,3) + p (2,1) + p (2, 2) + p (2,3) = 36 Graphically, this corresponds to the following volume: 2 In the case of two continuous random variables, the joint distribution function is given by: y1 y2 F ( y1 , y2 ) = ∫∫ −∞ −∞ f (t1 , t2 ) dt2 dt1 , − ∞ ≤ yk ≤ ∞ where the function f ( y1 , y2 ) is called the joint probability density function. So, as in the case of the univariate distributions, the integral of the density function is the distribution function. The same constraints apply: f ( y1 , y2 ) ≥ 0 ∀y1 , y2 ∞ ∞ ∫∫ −∞ −∞ f ( y1 , y2 )dy1dy2 = 1 Note that in the bivariate case we are, in effect, finding a volume under two curves in 3- space, vs. finding an area under a single curve in the univariate case. Example (modified from 5.3 in the text): Suppose that a legislator is going to introduce a bill in two policy dimensions (e.g. educational spending and military spending). The bill is randomly located in a two dimensional policy space of unit length (i.e. each dimension is an element of [0,1]) and the two dimensions are independent. Graphically, this function looks like: 3 Find P(.1 ≤ Y1 ≤ .3, 0 ≤ Y2 ≤ .5) : f ( y1, y2 ) = { 1, 0≤ y1 ≤1, 0≤ y2 ≤1 0, elsewhere .5 .3 .5 .3 .5 ∫∫ f ( y1 , y2 )dy1dy2 = ∫ ∫ 1dy1dy2 = ∫ [ y1 ].1 dy2 .3 0 .1 0 .1 0 .5 = ∫ .2dy2 = .2 y2 ]0 = .10 .5 0 Which corresponds to the volume of the box with base = .2*.5, height = 1 Another example: Let X and Y have the joint p.d.f. 3 2 f ( x, y ) = x (1 − y ), − 1 < x < 1, − 1 < y < 1 2 Let A = {( x, y ) : 0 < x < 1, 0 < y < x} . The probability that (X,Y) falls in A is given by: 4 x 3 ⎡ y2 ⎤ 1 x 1 3 P [ ( X , Y ) ∈ A] = ∫ ∫ x 2 (1 − y )dydx = ∫ x 2 ⎢ y − ⎥ dx 0 0 2 0 2 ⎣ 2 ⎦0 3⎛ x4 ⎞ 1 = ∫ ⎜ x 3 − ⎟dx 0 2⎝ 2⎠ 1 3 ⎡ x 4 x5 ⎤ = ⎢ − ⎥ 2 ⎣ 4 10 ⎦ 0 9 = 40 It should be noted that all of the definitions and operations described so far for the bivariate case can be extended to the multivariate (i.e. Y1, Y2, …Yn). Solutions of multiple integrals for more than 4 or 5 variables become computationally difficult, however. Marginal and Conditional Probabilities Return to the die rolling experiment yet again, where Y1 is the number rolled on the first die and Y2 is the number rolled on the second. What are the probabilities associated with Y1 alone? 6 1 P (Y1 = 1) = p (1,1) + p (1, 2) + p (1,3) + p (1, 4) + p (1,5) + p (1, 6) = = 36 6 and likewise for P(Y1=2) etc. In summation notation, the probabilities for Y1 alone are: 6 P (Y1 = y1 ) = p1 ( y1 ) = ∑ p ( y1 , y2 ) y2 =1 This distribution of a single variable which is part of a multivariate distribution is known as a marginal probability function (in the discrete case) and a marginal density function in the case of continuous random variables. In general terms, for the bivariate case, the discrete marginal p.f. is given by: p1 ( y1 ) = ∑ p( y1 , y2 ) y2 (and vice-versa for y2 ) and for the continuous case by: ∞ f1 ( y1 ) = ∫ −∞ f ( y1 , y2 )dy2 (and vice-versa for y2 ). The term “marginal” comes from the discrete case in which, if probabilities are presented in a table, the margins of the table show the probability functions of the individual variables. 5 Example: The following table (from example 5.5 in the text) shows the joint probability function for two random variables: Y1 = the number of Republicans and Y2 = the number of Democrats on a two person committee chosen from 2 Democrats, 3 Republicans and 1 Independent. (Note—leave table on board for next section example): Republicans 0 1 2 Total 0 0 3/15 3/15 6/15 Democrats 1 2/15 6/15 0 8/15 2 1/15 0 0 1/15 Total 3/15 9/15 3/15 1 Note that each of the margins sums to 1. In this case the marginal distribution of Democrats is given by the row totals, and that of Republicans by the column totals i.e.: 3 6 p1 (Republicans = 0) = , p2 (Democrats = 0) = 15 15 9 8 p1 (Republicans = 1) = , p2 (Democrats = 1) = 15 15 3 1 p1 (Republicans = 2) = , p2 (Democrats = 2) = 15 15 The concept of marginal probability distributions leads directly to a related concept: conditional probability distributions. Recall from the multiplicative law that P ( A ∩ B ) = P ( A) P ( B | A) We can think of a bivariate event (y1,y2) or P(Y1=y1), P(Y2=y2) as the intersection of two univariate events. If so, we can thus use the multiplicative law to state: p( y1 , y2 ) = p1 ( y1 ) p( y2 | y1 ) = p1 ( y2 ) p( y1 | y2 ) We can then rearrange one of these equations to calculate, for the discrete case, conditional discrete probability functions and, for the continuous case, conditional distribution functions. Each point in the discrete case is given by: P (Y1 = y1 , Y2 = y2 ) p ( y1 , y2 ) p ( y1 | y2 ) = P (Y1 = y1 | Y2 = y2 ) = = P (Y2 = y2 ) p2 ( y 2 ) 6 Example: (5.7) Returning to the table given in the previous example, what is the conditional distribution for the number of Republicans on the committee given that the number of Democrats is 1? The table gives joint probabilities. To find the conditional distribution, we need to calculate the probabilities of three outcomes: p(0,1) 2 /15 1 P(Y1 = 0 | Y2 = 1) = = = p2 (1) 8 /15 4 p (1,1) 6 /15 3 P(Y1 = 1| Y2 = 1) = = = p2 (1) 8 /15 4 p(2,1) 0 P(Y1 ≥ 2 | Y2 = 1) = = =0 p2 (1) 8 /15 These three numbers, ¼ ¾ 0, give us the conditional probability distribution. For continuous random variables, we define a conditional distribution function: y1 f (t1 , y2 ) F ( y1 | y2 ) = P (Y1 ≤ y1 | Y2 = y2 ) = ∫ −∞ f 2 ( y2 ) dt1 We can thus write a conditional density function, f ( y1 , y2 ) f ( y1 , y2 ) f ( y1 | y2 ) = or f ( y2 | y1 ) = f 2 ( y2 ) f1 ( y1 ) Example (5.8) p.228: A soda machine has a random amount Y2 gallons of soda at the beginning of the day and dispenses Y1 gallons over the course of the day (which must be less than or equal to Y2). The two variables have the following joint density: ⎧1 ⎪ , 0 ≤ y1 ≤ y2 ≤ 2 f ( y1 , y2 ) = ⎨ 2 ⎪0 elsewhere ⎩ Find the conditional density of Y1 given Y2=y2 and the probability that less than ½ gallon will be sold if the machine has 1.5 gallon at the start of the day. The marginal density of Y2 is given by: 7 ⎧ y2 1 1 ∞ ⎪ ∫ dy1 = y2 , 0 ≤ y2 ≤ 2 ⎪ 2 2 f 2 ( y2 ) = ∫ f ( y1 , y2 )dy1 = ⎨ 0 ∞ ⎪ 0dy = 0 elsewhere ⎪∫ −∞ 1 ⎩−∞ so, using the definition of a conditional density function, f ( y1 , y2 ) 1/ 2 1 f ( y1 | y2 ) = = = , 0 ≤ y1 ≤ y2 f 2 ( y2 ) 1/ 2 y2 y2 And, evaluating: 1/ 2 2 ⎤ 1/ 2 1/ 2 2 1 P (Y1 ≤ 1/ 2 | Y2 = 3 / 2) = ∫ −∞ f ( y1 | y2 = 1.5)dy1 = ∫ 0 3 dy1 = y1 ⎥ = 3 ⎦0 3 1/ 2 2 1 1 Note that ∫ 3 dy 0 1 comes from substituting 1.5 for y2 . Next, two topics to discuss very briefly: independent random variables and expected values of functions of more than one random variable. First, if Y1 and Y2 have joint distribution function F(Y1,Y2), Y1 and Y2 are said to be independent if and only if: F ( y1 , y2 ) = F1 ( y1 ) F2 ( y2 ) Thus, for discrete random variables with a joint probability function, Y1 and Y2 are independent iff: p( y1 , y2 ) = p1 ( y1 ) p2 ( y2 ) and for two continuous random variables with a joint density function, f ( y1 , y2 ) = f1 ( y1 ) f 2 ( y2 ) These definitions can be extended (as can everything in this section) to an arbitrary number of random variables. I also want to briefly present the definitions of expected value for a function of more than one random variable. We will not perform calculations of this nature in this class, but you may need to someday in your own work. If g (Y1 , Y2 ,...Yk ) is a function of k random variables, then for the discrete case: E[ g (Y1 , Y2 ,..., Yk )] = ∑ ...∑∑ g ( y1 , y2 ,..., yk ) p ( y1 , y2 ,..., yk ) yk y2 y1 Likewise, for the continuous case, 8 E[ g (Y1 , Y2 ,..., Yk )] = ∫ ... ∫ ∫ g ( y1 , y2 ,..., yk ) × f ( y1 , y2 ,..., yk )dy1dy2 ...dyk yk y2 y1 The examples in the book are fairly straightforward. Covariance and Correlation The last topics I want to cover in area of multivariate distributions are the related ideas of covariance and correlation. When we think of the dependence of two (or more, but it can be hard to think of more) random variables, it is natural to think of what is happening to the values of one variable, Y1 while another Y2 is changing. For example: Spending Vs. Poll Increases 16 Increase at Polls (%) 14 12 10 8 Series1 6 4 2 0 0 5 10 15 20 25 Spending ($M) This figure shows two random variables that appear to have a dependent relationship. As campaign spending increases, for example, the percentage of those polled who support the candidate also tends to increase. If we calculate the expected values (means) of the two variables in this case, we can then calculate the deviations from the means for each observed value of each variable: Spending Poll Increase (Spendingi-mu1) (Poll Increasei-mu2) (mean=10.3) (mean=6.7) 1 2 -9.3 -4.7 3 2 -7.3 -4.7 5 3 -5.3 -3.7 7 4 -3.3 -2.7 9 6 -1.3 -0.7 11 7 0.7 0.3 13 8 2.7 1.3 15 10 4.7 3.3 18 11 7.7 4.3 21 14 10.7 7.3 9 We can then take the product of the two deviations, ( y1 − µ1 )( y2 − µ2 ) for each of the observed pairs of data: Product of Deviations 43.71 34.31 19.61 8.91 0.91 0.21 3.51 15.51 33.11 78.11 Taking the average of these products of deviations gives us a single number that characterizes the relationship between the two random variables. In this case, that number is 23.79, which is positive—reflecting the fact that as spending increases, success at the polls tends to increase (and vice versa). This number is called the covariance and is defined: Cov(Y1 , Y2 ) = E[(Y1 − µ1 )(Y2 − µ2 )] By manipulating this equation, we can derive a useful calculational formula: Cov(Y1 , Y2 ) = E[(Y1 − µ1 )(Y2 − µ 2 )] = E (Y1Y2 − µ1Y2 − µ 2Y1 + µ1µ 2 ) = E (Y1Y2 ) − µ1 E (Y2 ) − µ 2 E (Y1 ) + µ1µ 2 = E (Y1Y2 ) − µ1µ 2 − µ 2 µ1 + µ1µ 2 = E (Y1Y2 ) − µ1µ 2 Or, cov(Y1 , Y2 ) = E (Y1Y2 ) − E (Y1 ) E (Y2 ) The larger the absolute value of covariance, the greater the linear dependence between the variables. A positive value indicates that both variables “move the same way.” A negative value indicates the opposite (one variable increases while the other decreases). Note that if two random variables are independent, their covariance is zero. However, the converse is not necessarily true: if two variables have zero covariance, they are not necessarily independent (example 5.24 in the book illustrates this). A problem with the measure of covariance is that it is not scale invariant; that is, the value computed for covariance depends on the scales of measurement used. 10 The typical solution in statistics (and elsewhere) for this problem is to normalize the quantity—divide it by some other quantity that will allow for scale invariance. One method of this is to divide the covariance by the product of the standard deviations of the variables. This yields the coefficient of correlation: Cov(Y1 , Y2 ) ρ= σ 1σ 2 which returns a value between –1 and 1, where –1 implies perfect negative correlation and 1 implies perfect positive correlation (all point falling on a straight line with positive slope). For an example of correlation, let us turn to some real data for the first time. The following is a graph of data from the 1996 American National Election Study: In this scatterplot of 1516 real observations of people’s reported opinions about Clinton and Dole (on a 100-point “feeling thermometer”), it is hard to clearly see what the relationship between the variables is. What do we expect it to be? What is a reasonable correlation coefficient? We find the actual correlation to be -.3529 in this case. In other words, as affect towards Clinton increases, affect towards Dole tends to decrease, but far from perfectly. Pearson’s r, a.k.a. Correlation coefficient. Varies between -1 and +1. Tells us the strength and direction of a relationship. We look at each observation and see how far it is from the mean in each variable together and we compare this to how far the observations are separately. 11 Later we will learn R2, the coefficient of determination. How much X moves around its mean is given by the Sum of Squares for X: n SS xx = ∑ ( X i − X ) 2 i =1 Likewise, how much Y moves around its mean is given by the Sum of Squares for Y: n SS yy = ∑ (Yi − Y ) 2 i =1 How much X and Y are moving together is given by the cross-product of X and Y: n SS xy = ∑ ( X i − X )(Yi − Y ) i =1 Looking at the cross-product: for each observation we see how far their scores is from the means on X and Y and multiply them together. Where scores on the two variables are both above the mean, this will give us a positive number indicating a positive relationship. Same where both are below the mean. When one score is above the mean and the other is below the mean, multiplying them will give a negative number indicating a negative relationship. The statistic r tells us how much the two variables are moving together out of the total that they are moving separately. SS XY r= SS XX * SSYY Which expands to: n ∑(X i − X )(Yi − Y ) r= i =1 n n ∑ ( X i − X )2 * ∑ (Yi − Y )2 i =1 i =1 Example of this: What is the correlation between hours studied and percentage on an exam. cases X Xi − X ( X i − X )2 Y Yi − Y (Yi − Y ) 2 ( X i − X )(Yi − Y ) 12 A 2 -1 1 60 -20 400 20 B 6 3 9 90 10 100 30 C 1 -2 4 65 -15 225 30 D 3 0 0 90 10 100 0 E 2 -1 1 80 0 0 0 F 4 1 1 95 15 225 15 16 1050 95 X =3 Y =80 n n n ∑(X i =1 i − X ) 2 = 16 ∑ (Y − Y ) i =1 i 2 = 1050 ∑(X i =1 i − X )(Yi − Y ) =95 So we get 95 95 r= = = .73 16*1050 121.61 Tells us there is a pretty strong positive relationship between the two variables. r will vary between -1 and 1. r- square will tell us the percentage of variance explained. Will always be positive and between 0 and 1. SSXX, SSYY and SSXY will stay important to us. The Multivariate Normal Distribution I want to conclude this section with an example of a commonly-encountered multivariate continuous distribution: the multivariate normal. The easiest way to describe this distribution is with matrix notation; I will try to indicate vectors and matrices using “boldface” chalk writing. We can say that a vector, y , of k variables are jointly distributed multivariate normal with a vector of means: µ = [ µ1 , µ2 ,..., µk ]' and a variance-covariance matrix: ⎡ σ 12 σ 21 ... σ k1 ⎤ ⎢ ⎥ ⎢σ 12 σ 2 ... ... ⎥ 2 Σ= ⎢ ... ... ... ... ⎥ ⎢ ⎥ ⎢σ 1k ⎣ ... σ k2 ⎥ ⎦ where the main diagonal gives the variance and the off-diagonals are symmetrical 13 covariances, if they are distributed such that: −1/ 2 ⎡ −1/ 2( y- µ) ' Σ −1 ( y- µ) ⎤ f ( y ) = 2π −1/ 2 k Σ e ⎣ ⎦ Which may look confusing, but is actually just a simple extension of the univariate normal. In two dimensions, the so called bivariate normal, we are concerned with two means: ⎡µ ⎤ µ=⎢ 1⎥ ⎣ µ2 ⎦ and a 2x2 variance-covariance matrix: ⎡ σ 12 σ 21 ⎤ Σ=⎢ 2⎥ ⎣σ 12 σ 1 ⎦ where the covariance between 1 and 2 is sometimes written just as ρ This distribution is fairly easy to visualize. Here is an example of a bivariate normal with means 0,0 variances 1,1 and covariance .5: 14 0.15 0.1 2 0.05 0 0 -2 0 -2 2 Note also that we can apply the idea of conditional and marginal distributions. For example, it is often much easier (computationally) to consider: f ( y1 | µ1 , µ2 , σ 12 , σ 2 , ρ ) ~ N 2 Think of this as taking a “slice” out of the picture above. The marginal, conditional on all of those parameters, is distributed univariate normal. Homework Problems: 5.1, 5.2, 5.3, 5.7, 5.9, 5.11, 5.15, 5.17, 5.20, 5.21, 5.23, 5.30, 5.37, 5.39, 5.42, 5.43, 5.44, 5.48, 5.58*, 5.75, 5.77, 5.79 Its ok to just work your way through the solutions to the double integral problems. 15