Basic principles of probability theory by Z0Kgs0P


									          Short review of probabilistic concepts
Probability theory plays cenral role in statistics. This lecture will give a short review
   of the basic concepts of the probability theory.

Contents of this lecture
• Basic principles and definitions
• Conditional probabilities and independence
• Bayes’s theorem and postulate
• Random variables and probability distributions
• Bayes’s theorem and likelihood
• Expectations and moments
                            Random experiment
Random experiment satisfies following conditions:
1.   All possible distinct outcomes are known in advance
2.   In any particular experiment outcome is not known in advance
3.   Experiment can be repeated under identical conditions

The outcome space  is the set of the possible outcomes.
Example 1. Tossing a coin is a random experiment. The outcome space is {H,T} –
     head and tail.
Example 2. Rolling a die. The outcome space is {1,2,3,4,5,6}
Example 3. Drawing from an urn with N balls, M of them is red and N-M is white.
     The outcome space is {R,W} – red and white
Example 5. Measuring temperature (in C or in K): What is the outcome space?

Something that might or might not happen depending on outcome of the experiment
    is called an event. An event is a subset of the outcome space
Example: Rolling a die. {1,2,3} or {2,4,6}
Example: Measuring temperature in Celsius. Give an example of an event.
                   Classical definition of probability

If all the outcomes are equally likely then the probability of an event A is the number
    of outcomes in A (M(A)) divided by the number of all outcomes (M):

                                       M ( A)
                          P( A) 
Example: If a coin is fair then the probability of H is ½ and probability of T is ½
Example: If a die is fair then the probability of {1} is 1/6

If the outcome space is real numbers or are in a space then probability is measured as
    ratio of the area of an event to that of outcome space:
                              M ( A)
                    P( A) 
                              M ()
Where M is the area.
Example: Outcome space is the interval [0,2]. What is the probability of [0,1]?
                 Frequency definition of probability

Since random experiments can be repeated as many times as we wish under identical
   conditions (in theory) we can measure the relative frequency of the occurrences of
   an event. If the number of trials is m and the number of the occurrences of A is
   m(A) then according to the frequency definition the probability of A is the limit:
                                              m( A)
                                P( A)  lim           ( m  )

According to the law of large numbers this limit exists. When the number of trials is
  small then there might be strong fluctuations. As the number of trials increases
  fluctuations tend to decrease.
            Other (subjective) definitions of probability

There are other definitions of probability also:

• Degree of belief. How much a person believes in an event. In that sense one
  person’s probability would be different from another person’s. For example:
  existence of “an extra-terrestrial life”.
• Degree of knowledge. In many cases exact value of an event exists but we do not
  know it. By carrying out experiments we want to find this value. Since experiment
  is prone to errors it is in general impossible to find the exact value and we assign
  probability for this. That is the purpose of the most statistical procedures and
  techniques. According to Jaynes if proper rules are designed then exactly same
  information would produce exactly same probabilities. (See Jaynes, The
  Probability theory: Logic of Science). This definition reflects our state of
  knowledge about events and can change as we update our knowledge.
                             Probability axioms

Probability is defined as a function from subsets of outcome space  to the real line
     R that satisfies the following conditions:

1.    Non-negativity: P(A)  0
2.    Additivity: if AB= then P(AB) = P(A) + P(B)
3.    Probability of the whole space is 1. P() = 1

All above definitions obey these rules. So any property that can be derived from
      these axioms is valid for all definitions

Show that: P( )=0 (Hint:   = )
Show that: 0  P(A)  1 (Hint A and Ã=-A are not intersecting).

a)   Let us assume that outcome space is a square          a)
     with sides equal 1 units. Probability of the
     event A is the area of A. The the probability of
     either A or B is the sum of areas of A and B.                  B
     Probability of A and B is zero.
b)   Same as in a). Probability of A is the area of A,          A
     probability of B is the area of B. Probability of
     either A or B is not the sum of he areas of A
     and B. P(AB)=P(A)+P(B)-P(AB)

                                                         AB            B
             Conditional probability and independence

Let us consider if an event B has occurred or will occur and we want to know what is
   the probability of A. Knowing B may influence our knowledge about A. Or
   occurrence of B may influence of occurrence of A. The probability of A given B is
   called conditional probability of A given B and is defined as (for P(B)>0):

                                             P( A  B )
                              P( A | B ) 
                                               P( B )

It is clear that the event B has become new outcome space. Event A and B are called
     independent if occurrence of B does not influence on probability of A.

                    P( A | B)  P( A) and P( B | A)  P( B)

It can also be written as:
                     P( A  B)  P( A) P( B)

Note that only one of the above equations is independent.

Conditional probability of A given B is
the area of AB divided by the area of
B. It makes sense since we take it as a
fact that B certainly has happened. So
probability of A given B will be                A
defined by the set B only.
In some sense we normalise the area of              AB
AB by the area of B
                      The Law of total probability

In many cases when direct calculation of probability is not known it is easier to
   divide an event into smaller parts and calculate their probability and then take
   weighted average of them. This can be done using the law of total probability.
Let B1, B2,,,Bn be partition of , I.e. they are mutually exclusive (BiBj=) and their
   sum is  (1n Bi= ) then from the axioms of probability:
                         P ( A)   P ( A | Bi ) P ( Bi )
                                   i 1
(Here we do inverse what we did before: remove normalisation of A by the set Bi and
   then sum over all of them. (P(A|Bi)P(Bi) is probability of A with respect to the
   original outcome space).
This law is a useful tool to calculate probabilities.
Consider a box with N balls, M of them are red and N-M are white. We make two
   draws. We don’t know what is the first ball. What is probability of the second ball
   being red. (Hint: Use partition as ({R1} {W1}). Then use law of total probability
   for ({R2}. Here subscript shows the first or the second draw.)
                                           Bayes’s theorem
Bayes’s theorem is a tool that updates probability of an event in the light of an
  evidence. It is written in various forms. All they are equivalent. Let us again
  consider partition of outcome space – B1,B2,,,,Bn so that they are mutually exclusive
  and sum of them is equal to . Then for one of these events (say j-th event) we can
  write:                       P( A | B j ) P( B j ) P( A | B j ) P( B j )
                   P ( B j | A)     n
                                     P( A | Bi ) P( Bi )
                                                                P( A)
                                    i 1

Usually P(Bj|A) is called posterior probability, P(Bj) is prior probability and P(A|Bj) is
   likelihood. It is widely used in statistical inferences.
Example: A box contains four balls. There are two possibilities: a) all balls are white
   (B1) b) two white and two red (B2). A ball is drawn and it is white (event A). What
   is the probability that all balls are white. B1 (all white) and B2 (two white and two
   red) are two possible outcomes with prior probabilities ½. If B1 is true then
   probability of A is 1 and if B2 is true then probability of A is ½. Calculate P(B1|A).
   What is probability P(B2|A)?
Bayes’s postulate: If there is no prior information available then prior probabilities
   should be assumed to be equal.
                              Random variables

Random variable is a function from outcome space to the real line

                                     X:   R

Example: Consider random experiment of tossing a coin twice. The outcome space is:


Define a random variable as

                 X((T,T)) = 0, X((H,T))=X((T,H)) = 1, X((H,H))=2

Example 2: Rolling a die. Outcome space {1,2,3,4,5,6). Define a random variable
X(j) = j.
                   Probability distribution function
Discrete case (the number of elements in outcome space is finite or countable
Probability function p assigns for each possible realisation x of a random variable X
       the probability P(X=x). Obviously xp(x) = 1.
Example: The number of heads turning up in two tosses is random variable with
       probability p(1) = 1/4, p(0) =1/2, p(2) =1/4.
For continuous random variable it is not possible to define probability for each
       realisation since their probability is usually 0. For them it is easy to define a
       distribution function:
                                        F(x) = P(Xx)
i.e. probability that X is less than or equal to x. F(x) has the following properties:
  1) F(- ) = 0, 2) F(x) is a monotonic and increasing function, 3) F(+ ) = 1.
This function is defined for discrete as well as continuous random variables. If
       derivative of F(x) exists (it is usually defined for well behaved continuous
       random variables) then it is called probability density function – f(x) =
       dF(x)/dx. Another relation between xthem is:
                                 F ( x)     f ( x)dx
        Cumulative and density of probability
a) Cumulative probability
   uniform distribution on
   the interval [0,1]
b) Density of probability
   of uniform distribution
   on the interval [0,1]
                       Joint probability distributions
If there are more than one random variable then their joint probability distribution is
    defined similarly. For discrete case:
    p(x,y) = P((X,Y)=(x,y)) = P(X=x,Y=y)
Then xyp(x,y) = 1, p(x,y)0.
The marginal probability function p(x) is derived by summing over all possible values
    of y
                                      pX(x) = yp(x,y)
Conditional probability function of X given Y=y is:
                                    p(x|y) = p(x,y)/pY(y)
Definition for the joint probability distribution for continuous random variables is
 F(x,y) = P(X  x,Yy). Probability density (f(x,y)) is derivative of the probability
    function with respect to its arguments. It has properties:
                                                 
                              f ( x, y )  0,     f ( x, y )dxdy  1
Marginal and conditional probability densities are defined similar to discrete random
  variables by replacing summation with integration.
        Joint probability distributions and independence

Random events {X=x} and {Y=y} are independent if
                           P(X=x, Y=y) = P(X=x)P(Y=y)
The random variables are independent if for all pairs (x,y) this relation holds. It can
   also be written as
                                 p(x,y) = pX(x)pY(y)
And then p(x|y) = pX(x) and p(y|x) = pY(y)

For continuous random variables definition is analogous. It can be defined by
   replacing p with f everywhere.
                    f(x,y) = fX(x)fY(y), f(x|y) = fX(x), f(y|x) = fY(y)
Bayes’s theorem then becomes:
                                f(x|y) = fX(x) f(y|x)/fY(y)
Usually we will drop subscripts X and Y.
Where f(x|y) is posterior probability density, f(x) is prior probability density f(y|x) is
   likelihood of y if x would be observed, f(y) can be considered as a normalisation
                       Expectation values. Moments
If X is a random variable and h(X) is its function then expectation value (discrete case)
    is defined as:
                                     E(h(X)) = xh(x)p(x)
If h(x) = x then it is called the first moment. If h(x) = xn then it is called n-th moment.
    If h(x) = (x-E(X))n then it is called n-th central moment: The second central
    moment is called variance of the random variable. First moment and second central
    moment play important role in statistics and they have special symbols
                            xp ( x )  2   ( x   ) 2 p( x )
                                 x                    x
 - is also called as a standard deviation
When there are more than one random variable and their joint probability function is
    known then their mixed moments also are defined. Most important of them is
    covariance and correlation:                                  cov( x, y )
                cov( x, y )   ( x   x )( y   y ) p ( x, y ),  (x, y) 
                               x, y                                              x y
For continuous random variables expectation values, moments, covariance and
   correlation are defined similarly by replacing summation with integration. If
   random variables are independent then their covariance is 0. Reverse is not true in
Let us take example of tossing a coin. Coin is fair (i.e. probability of head is 0.5 and that of
    tail is 0.5). Define random variable X(H) = 0, X(T)=1. Then expectation value is:
The expectation (first moment) value is 0.5, second moment s 0.5 and standard deviation is
Let us take another example. Assume that density of the probability distribution has the form
    (it is uniform distribution over the interval [0,1]):
                                            0   if x < 0
                                   f (x)   if 0  x  1
                                            0   if x > 1
And the random variable is X(x)=x.
                                                    1                  1
                                        E(x)     xf (x)dx   xdx  0.5
                                                    0                  0
                                                      1                   1
                                        E(x 2 )           x 2 f (x)dx     x dx  3

                                                        0                   0
                                                                  1                             1
                                        E(x  E(x))         2
                                                                   (x  0.5)   2
                                                                                    f (x)dx     (x  0.5)   2
                                                                                                                  dx 
                                                                  0                             0
                             Further reading

1.   Berthold, M. and Hand, DJ (2003) “Intelligent data analysis”
2.   Feller, W. (1968) An Introduction to Probability Theory and Its Applications: v.
3.   Feller, W. (1971) An Introduction to Probability Theory and Its Applications: v.
4.   Mardia, KV, Kent, JT and Bibby, JM (2003) “Mutlivariate analysis”
5.   Jaynes, E. (2003) “The probability theory: Logic of science”

To top