Introduction to Multivariate Probability Distributions by uee19558

VIEWS: 50 PAGES: 15

• pg 1
```									                                         POL 602
Prof. Matthew Lebo
Week 6 – October 6th, 2005

Multivariate Probability Distributions

Often we are interested in the intersection of two or more events, or the outcome of two
or more random variables at the same time.

For example, the probability that the Democrats would win against incumbent G.W. Bush
for the White House and the probability that the Democrats would take back control of
the House in the same 2004 election.

First, let us consider the discrete, bivariate case. If Y1 and Y2 are discrete random
variables, their joint probability distribution is given by:

p( y1 , y2 ) = P(Y1 = y1 , Y2 = y2 ) − ∞ < y1 < ∞, −∞ < y2 < ∞

where the function p(y1,y2) is often called the joint probability function.

As in the case of a distribution of a single discrete variable,

p( y1 , y2 ) ≥ 0 ∀y1 , y2
∑   y1, y2
p ( y1 , y2 ) = 1

For example, two six-sided dice are rolled. This can be viewed as a single experiment, in
which the outcome is distributed with the following discrete bivariate distribution:

1
p ( y1 , y2 ) = P (Y1 = y1 , Y2 = y2 ) =         , yk = 1, 2,..., 6
36

Graphically: (where 1/36 is approximately equal to .0278)

1
It is also possible to define a joint (or bivariate) distribution function for any bivariate
probability function,

F ( y1 , y2 ) = P(Y1 ≤ y1 , Y2 ≤ y2 ), − ∞ ≤ yk ≤ ∞

In the discrete case, F has the form:
a       b
F ( a, b) =   ∑ ∑
y1 =−∞ y2 =−∞
p( y1 , y2 )

Example: In the same die-tossing experiment (two six-sided dice), what is the probability
of getting a two or less on the first die, and a three or less on the second die?

6
F (2,3) = P (Y1 ≤ 2, Y2 ≤ 3) = p (1,1) + p (1, 2) + p (1,3) + p (2,1) + p (2, 2) + p (2,3) =
36

Graphically, this corresponds to the following volume:

2
In the case of two continuous random variables, the joint distribution function is given
by:
y1 y2

F ( y1 , y2 ) =   ∫∫
−∞ −∞
f (t1 , t2 ) dt2 dt1 , − ∞ ≤ yk ≤ ∞

where the function f ( y1 , y2 ) is called the joint probability density function.

So, as in the case of the univariate distributions, the integral of the density function is the
distribution function. The same constraints apply:

f ( y1 , y2 ) ≥ 0 ∀y1 , y2
∞ ∞

∫∫
−∞ −∞
f ( y1 , y2 )dy1dy2 = 1

Note that in the bivariate case we are, in effect, finding a volume under two curves in 3-
space, vs. finding an area under a single curve in the univariate case.

Example (modified from 5.3 in the text): Suppose that a legislator is going to introduce a
bill in two policy dimensions (e.g. educational spending and military spending). The bill
is randomly located in a two dimensional policy space of unit length (i.e. each dimension
is an element of [0,1]) and the two dimensions are independent. Graphically, this function
looks like:

3
Find P(.1 ≤ Y1 ≤ .3, 0 ≤ Y2 ≤ .5) :

f ( y1, y2 ) =   {   1, 0≤ y1 ≤1, 0≤ y2 ≤1
0, elsewhere

.5 .3                        .5 .3                .5

∫∫      f ( y1 , y2 )dy1dy2 = ∫ ∫ 1dy1dy2 = ∫ [ y1 ].1 dy2
.3

0 .1                        0 .1                  0
.5
= ∫ .2dy2 = .2 y2 ]0 = .10
.5

0

Which corresponds to the volume of the box with base = .2*.5, height = 1

Another example:

Let X and Y have the joint p.d.f.

3 2
f ( x, y ) =     x (1 − y ), − 1 < x < 1, − 1 < y < 1
2

Let A = {( x, y ) : 0 < x < 1, 0 < y < x} .

The probability that (X,Y) falls in A is given by:

4
x
3 ⎡      y2 ⎤
1 x                     1
3
P [ ( X , Y ) ∈ A] = ∫ ∫ x 2 (1 − y )dydx = ∫ x 2 ⎢ y − ⎥ dx
0 0
2                  0
2 ⎣      2 ⎦0
3⎛     x4 ⎞
1
= ∫ ⎜ x 3 − ⎟dx
0
2⎝     2⎠
1
3 ⎡ x 4 x5 ⎤
= ⎢ − ⎥
2 ⎣ 4 10 ⎦ 0
9
=
40

It should be noted that all of the definitions and operations described so far for the
bivariate case can be extended to the multivariate (i.e. Y1, Y2, …Yn). Solutions of
multiple integrals for more than 4 or 5 variables become computationally difficult,
however.

Marginal and Conditional Probabilities

Return to the die rolling experiment yet again, where Y1 is the number rolled on the first
die and Y2 is the number rolled on the second. What are the probabilities associated with
Y1 alone?

6 1
P (Y1 = 1) = p (1,1) + p (1, 2) + p (1,3) + p (1, 4) + p (1,5) + p (1, 6) =     =
36 6

and likewise for P(Y1=2) etc. In summation notation, the probabilities for Y1 alone are:
6
P (Y1 = y1 ) = p1 ( y1 ) = ∑ p ( y1 , y2 )
y2 =1

This distribution of a single variable which is part of a multivariate distribution is known
as a marginal probability function (in the discrete case) and a marginal density function
in the case of continuous random variables. In general terms, for the bivariate case, the
discrete marginal p.f. is given by:

p1 ( y1 ) = ∑ p( y1 , y2 )
y2

(and vice-versa for y2 ) and for the continuous case by:
∞
f1 ( y1 ) =   ∫
−∞
f ( y1 , y2 )dy2

(and vice-versa for y2 ). The term “marginal” comes from the discrete case in which, if
probabilities are presented in a table, the margins of the table show the probability
functions of the individual variables.

5
Example: The following table (from example 5.5 in the text) shows the joint probability
function for two random variables: Y1 = the number of Republicans and Y2 = the number
of Democrats on a two person committee chosen from 2 Democrats, 3 Republicans and 1
Independent. (Note—leave table on board for next section example):

Republicans
0                    1                          2      Total
0                   0                    3/15                       3/15   6/15
Democrats             1                   2/15                 6/15                       0      8/15
2                   1/15                 0                          0      1/15
Total               3/15                 9/15                       3/15   1

Note that each of the margins sums to 1.

In this case the marginal distribution of Democrats is given by the row totals, and that of
Republicans by the column totals i.e.:

3                         6
p1 (Republicans = 0) =   , p2 (Democrats = 0) =
15                        15
9                        8
p1 (Republicans = 1) = , p2 (Democrats = 1) =
15                       15
3                         1
p1 (Republicans = 2) = , p2 (Democrats = 2) =
15                        15

The concept of marginal probability distributions leads directly to a related concept:
conditional probability distributions. Recall from the multiplicative law that

P ( A ∩ B ) = P ( A) P ( B | A)

We can think of a bivariate event (y1,y2) or P(Y1=y1), P(Y2=y2) as the intersection of two
univariate events. If so, we can thus use the multiplicative law to state:

p( y1 , y2 ) = p1 ( y1 ) p( y2 | y1 )
= p1 ( y2 ) p( y1 | y2 )

We can then rearrange one of these equations to calculate, for the discrete case,
conditional discrete probability functions and, for the continuous case, conditional
distribution functions.

Each point in the discrete case is given by:

P (Y1 = y1 , Y2 = y2 ) p ( y1 , y2 )
p ( y1 | y2 ) = P (Y1 = y1 | Y2 = y2 ) =                         =
P (Y2 = y2 )        p2 ( y 2 )

6
Example: (5.7) Returning to the table given in the previous example, what is the
conditional distribution for the number of Republicans on the committee given that the
number of Democrats is 1?

The table gives joint probabilities. To find the conditional distribution, we need to
calculate the probabilities of three outcomes:

p(0,1) 2 /15 1
P(Y1 = 0 | Y2 = 1) =            =     =
p2 (1) 8 /15 4
p (1,1) 6 /15 3
P(Y1 = 1| Y2 = 1) =             =     =
p2 (1) 8 /15 4
p(2,1)   0
P(Y1 ≥ 2 | Y2 = 1) =             =     =0
p2 (1) 8 /15

These three numbers, ¼ ¾ 0, give us the conditional probability distribution.

For continuous random variables, we define a conditional distribution function:
y1
f (t1 , y2 )
F ( y1 | y2 ) = P (Y1 ≤ y1 | Y2 = y2 ) =     ∫
−∞
f 2 ( y2 )
dt1

We can thus write a conditional density function,

f ( y1 , y2 )                    f ( y1 , y2 )
f ( y1 | y2 ) =                 or f ( y2 | y1 ) =
f 2 ( y2 )                       f1 ( y1 )

Example (5.8) p.228:

A soda machine has a random amount Y2 gallons of soda at the beginning of the day and
dispenses Y1 gallons over the course of the day (which must be less than or equal to Y2).
The two variables have the following joint density:

⎧1
⎪ , 0 ≤ y1 ≤ y2 ≤ 2
f ( y1 , y2 ) = ⎨ 2
⎪0 elsewhere
⎩
Find the conditional density of Y1 given Y2=y2 and the probability that less than ½ gallon
will be sold if the machine has 1.5 gallon at the start of the day.

The marginal density of Y2 is given by:

7
⎧ y2 1   1
∞                    ⎪ ∫ dy1 = y2 , 0 ≤ y2 ≤ 2
⎪ 2      2
f 2 ( y2 ) = ∫ f ( y1 , y2 )dy1 = ⎨ 0
∞
⎪ 0dy = 0 elsewhere
⎪∫
−∞
1
⎩−∞
so, using the definition of a conditional density function,

f ( y1 , y2 )    1/ 2   1
f ( y1 | y2 ) =                  =        = , 0 ≤ y1 ≤ y2
f 2 ( y2 )    1/ 2 y2 y2

And, evaluating:
1/ 2
2 ⎤
1/ 2                                 1/ 2
2              1
P (Y1 ≤ 1/ 2 | Y2 = 3 / 2) =           ∫
−∞
f ( y1 | y2 = 1.5)dy1 =        ∫
0
3
dy1 = y1 ⎥ =
3 ⎦0    3

1/ 2
2                                                1      1
Note that    ∫ 3 dy
0
1   comes from substituting
1.5
for
y2
.

Next, two topics to discuss very briefly: independent random variables and expected
values of functions of more than one random variable. First, if Y1 and Y2 have joint
distribution function F(Y1,Y2), Y1 and Y2 are said to be independent if and only if:

F ( y1 , y2 ) = F1 ( y1 ) F2 ( y2 )

Thus, for discrete random variables with a joint probability function, Y1 and Y2 are
independent iff:

p( y1 , y2 ) = p1 ( y1 ) p2 ( y2 )
and for two continuous random variables with a joint density function,

f ( y1 , y2 ) = f1 ( y1 ) f 2 ( y2 )

These definitions can be extended (as can everything in this section) to an arbitrary
number of random variables.

I also want to briefly present the definitions of expected value for a function of more than
one random variable. We will not perform calculations of this nature in this class, but you
may need to someday in your own work. If g (Y1 , Y2 ,...Yk ) is a function of k random
variables, then for the discrete case:

E[ g (Y1 , Y2 ,..., Yk )] = ∑ ...∑∑ g ( y1 , y2 ,..., yk ) p ( y1 , y2 ,..., yk )
yk      y2   y1

Likewise, for the continuous case,

8
E[ g (Y1 , Y2 ,..., Yk )] = ∫ ... ∫ ∫ g ( y1 , y2 ,..., yk ) × f ( y1 , y2 ,..., yk )dy1dy2 ...dyk
yk    y2 y1

The examples in the book are fairly straightforward.

Covariance and Correlation

The last topics I want to cover in area of multivariate distributions are the related ideas of
covariance and correlation. When we think of the dependence of two (or more, but it can
be hard to think of more) random variables, it is natural to think of what is happening to
the values of one variable, Y1 while another Y2 is changing. For example:

Spending Vs. Poll Increases

16
Increase at Polls (%)

14
12
10
8                                                                                      Series1
6
4
2
0
0             5             10              15             20             25
Spending (\$M)

This figure shows two random variables that appear to have a dependent relationship. As
campaign spending increases, for example, the percentage of those polled who support
the candidate also tends to increase.

If we calculate the expected values (means) of the two variables in this case, we can then
calculate the deviations from the means for each observed value of each variable:
Spending                Poll Increase         (Spendingi-mu1)         (Poll Increasei-mu2)
(mean=10.3)             (mean=6.7)
1                       2                                      -9.3                    -4.7
3                       2                                      -7.3                    -4.7
5                       3                                      -5.3                    -3.7
7                       4                                      -3.3                    -2.7
9                       6                                      -1.3                    -0.7
11                      7                                       0.7                     0.3
13                      8                                       2.7                     1.3
15                      10                                      4.7                     3.3
18                      11                                      7.7                     4.3
21                      14                                    10.7                      7.3

9
We can then take the product of the two deviations, ( y1 − µ1 )( y2 − µ2 ) for each of the
observed pairs of data:

Product of
Deviations
43.71
34.31
19.61
8.91
0.91
0.21
3.51
15.51
33.11
78.11

Taking the average of these products of deviations gives us a single number that
characterizes the relationship between the two random variables. In this case, that number
is 23.79, which is positive—reflecting the fact that as spending increases, success at the
polls tends to increase (and vice versa). This number is called the covariance and is
defined:

Cov(Y1 , Y2 ) = E[(Y1 − µ1 )(Y2 − µ2 )]

By manipulating this equation, we can derive a useful calculational formula:

Cov(Y1 , Y2 ) = E[(Y1 − µ1 )(Y2 − µ 2 )]
= E (Y1Y2 − µ1Y2 − µ 2Y1 + µ1µ 2 )
= E (Y1Y2 ) − µ1 E (Y2 ) − µ 2 E (Y1 ) + µ1µ 2
= E (Y1Y2 ) − µ1µ 2 − µ 2 µ1 + µ1µ 2
= E (Y1Y2 ) − µ1µ 2

Or, cov(Y1 , Y2 ) = E (Y1Y2 ) − E (Y1 ) E (Y2 )

The larger the absolute value of covariance, the greater the linear dependence between
the variables. A positive value indicates that both variables “move the same way.” A
negative value indicates the opposite (one variable increases while the other decreases).
Note that if two random variables are independent, their covariance is zero. However, the
converse is not necessarily true: if two variables have zero covariance, they are not
necessarily independent (example 5.24 in the book illustrates this).
A problem with the measure of covariance is that it is not scale invariant; that is, the
value computed for covariance depends on the scales of measurement used.

10
The typical solution in statistics (and elsewhere) for this problem is to normalize the
quantity—divide it by some other quantity that will allow for scale invariance. One
method of this is to divide the covariance by the product of the standard deviations of the
variables. This yields the coefficient of correlation:

Cov(Y1 , Y2 )
ρ=
σ 1σ 2

which returns a value between –1 and 1, where –1 implies perfect negative correlation
and 1 implies perfect positive correlation (all point falling on a straight line with positive
slope).

For an example of correlation, let us turn to some real data for the first time. The
following is a graph of data from the 1996 American National Election Study:

In this scatterplot of 1516 real observations of people’s reported opinions about Clinton
and Dole (on a 100-point “feeling thermometer”), it is hard to clearly see what the
relationship between the variables is. What do we expect it to be? What is a reasonable
correlation coefficient? We find the actual correlation to be -.3529 in this case. In other
words, as affect towards Clinton increases, affect towards Dole tends to decrease, but far
from perfectly.

Pearson’s r, a.k.a. Correlation coefficient.

Varies between -1 and +1.

Tells us the strength and direction of a relationship.

We look at each observation and see how far it is from the mean in each variable together
and we compare this to how far the observations are separately.

11
Later we will learn R2, the coefficient of determination.

How much X moves around its mean is given by the Sum of Squares for X:

n
SS xx = ∑ ( X i − X ) 2
i =1

Likewise, how much Y moves around its mean is given by the Sum of Squares for Y:

n
SS yy = ∑ (Yi − Y ) 2
i =1

How much X and Y are moving together is given by the cross-product of X and Y:

n
SS xy = ∑ ( X i − X )(Yi − Y )
i =1

Looking at the cross-product: for each observation we see how far their scores is from the
means on X and Y and multiply them together. Where scores on the two variables are
both above the mean, this will give us a positive number indicating a positive
relationship. Same where both are below the mean. When one score is above the mean
and the other is below the mean, multiplying them will give a negative number indicating
a negative relationship.

The statistic r tells us how much the two variables are moving together out of the total
that they are moving separately.

SS XY
r=
SS XX * SSYY

Which expands to:

n

∑(X       i   − X )(Yi − Y )
r=              i =1
n                          n

∑ ( X i − X )2 * ∑ (Yi − Y )2
i =1                       i =1

Example of this:
What is the correlation between hours studied and percentage on an exam.

cases X                    Xi − X       ( X i − X )2 Y    Yi − Y (Yi − Y ) 2   ( X i − X )(Yi − Y )

12
A          2       -1                1           60        -20          400             20
B          6       3                 9           90        10           100             30
C          1       -2                4           65        -15          225             30
D          3       0                 0           90        10           100             0
E          2       -1                1           80        0            0               0
F          4       1                 1           95        15           225             15
16                                 1050            95

X =3                      Y =80

n                         n                                   n

∑(X
i =1
i   − X ) 2 = 16   ∑ (Y − Y )
i =1
i
2
= 1050        ∑(X
i =1
i   − X )(Yi − Y ) =95

So we get

95      95
r=            =       = .73
16*1050 121.61

Tells us there is a pretty strong positive relationship between the two variables.

r will vary between -1 and 1.
r- square will tell us the percentage of variance explained.
Will always be positive and between 0 and 1.

SSXX, SSYY and SSXY will stay important to us.

The Multivariate Normal Distribution

I want to conclude this section with an example of a commonly-encountered multivariate
continuous distribution: the multivariate normal.

The easiest way to describe this distribution is with matrix notation; I will try to indicate
vectors and matrices using “boldface” chalk writing. We can say that a vector, y , of k
variables are jointly distributed multivariate normal with a vector of means:

µ = [ µ1 , µ2 ,..., µk ]'
and a variance-covariance matrix:

⎡ σ 12 σ 21 ... σ k1 ⎤
⎢                    ⎥
⎢σ 12 σ 2 ... ... ⎥
2
Σ=
⎢ ... ... ... ... ⎥
⎢                    ⎥
⎢σ 1k
⎣           ... σ k2 ⎥
⎦
where the main diagonal gives the variance and the off-diagonals are symmetrical

13
covariances,

if they are distributed such that:

−1/ 2       ⎡ −1/ 2( y- µ) ' Σ −1 ( y- µ) ⎤
f ( y ) = 2π       −1/ 2 k
Σ           e   ⎣                             ⎦

Which may look confusing, but is actually just a simple extension of the univariate
normal.

In two dimensions, the so called bivariate normal, we are concerned with two means:

⎡µ ⎤
µ=⎢ 1⎥
⎣ µ2 ⎦
and a 2x2 variance-covariance matrix:

⎡ σ 12 σ 21 ⎤
Σ=⎢            2⎥
⎣σ 12 σ 1 ⎦
where the covariance between 1 and 2 is sometimes written just as ρ

This distribution is fairly easy to visualize. Here is an example of a bivariate normal with
means 0,0 variances 1,1 and covariance .5:

14
0.15

0.1
2
0.05

0
0
-2

0
-2

2

Note also that we can apply the idea of conditional and marginal distributions. For
example, it is often much easier (computationally) to consider:

f ( y1 | µ1 , µ2 , σ 12 , σ 2 , ρ ) ~ N
2

Think of this as taking a “slice” out of the picture above. The marginal, conditional on all

of those parameters, is distributed univariate normal.

Homework Problems: 5.1, 5.2, 5.3, 5.7, 5.9, 5.11, 5.15, 5.17, 5.20, 5.21, 5.23, 5.30, 5.37,
5.39, 5.42, 5.43, 5.44, 5.48, 5.58*, 5.75, 5.77, 5.79

Its ok to just work your way through the solutions to the double integral problems.

15

```
To top