Introduction to Statistics

Document Sample
Introduction to Statistics Powered By Docstoc
					Theoretical Distributions in
 Probability and Statistics
                       Decision-making
In a large family where it is known that there is genetic pre-disposition to
suffer from diabetes, how many children out of a possible 7 are likely to be
affected by diabetes?

A hospital administrator needs to decide how many people to staff the
Accident and Emergency Department of the hospital during 9am to 12pm on
weekdays. How should the administrator decide?

In 2004, the World Health Organisation (WHO) revised the body-mass index
(BMI) definitions for overweight and obese individuals in Asian populations.
Instead of a BMI range of 25 – 29 for defining overweight, and a BMI range
of > 30 for defining obese (as are used in Caucasian populations), the
corresponding ranges for Asian populations are 23 – 27.5 and > 27.5. How
did the scientists at WHO decide on the new ranges?
                                Modeling the outcome variable with
                                some appropriate theoretical
                                framework
Data exploration and Statistical analysis

1. Data checking, identifying problems and characteristics
2. Understanding chance and uncertainty
3. How will the data for one attribute behave, in a
   theoretical framework?
          Data exploration,
Data   categorical / numerical
             outcomes            Model each outcome with
                                 a theoretical distribution
                      Random variable
Definition:
A random variable is a theoretical consideration of the possible outcome of
an event.

Example:
In a survey of 5 students, how many female students are there?

The answer to this is a random variable. The possible outcome are 0, 1, 2,
3, 4 or 5 female students. So the random variable describes what the
answer could have been, prior to finding out the actual answer.

Suppose we know that out of 5 students, there are 4 girls. Then there is no
uncertainty nor variability anymore, the exact answer is known and thus this
is not a random variable anymore.
Discrete random variables
                Probability mass function
Definition:
The PMF describes the probability of the possible events for a random
outcome.

Properties of a probability function:
Example 1:
 Let X denote the number of heads obtained when an unbiased coin is
tossed 3 times. Find the probability distribution of X. Find also P(|X – 2| 
1.2).
         Cumulative distribution function
Definition:
The CDF describes the joint probability of multiple events, and is formally
defined as F(X) = P(X  x) for any real x.

Properties of a CDF:
Example 2:
                     Uniform Distribution
Definition:
A random variable is said to follow a Uniform distribution if any of the
possible outcomes are equally likely.

Mathematically: P(X = x) = constant.

So if there are n possible outcomes, the chance of each of the outcomes is
1 / n.
Example 3:
In a game of chance, a gambler chooses an integer between 13 and 18
inclusive (including 13 and 18). There are equal chances for any number in
the set {13, 14, 15, 16, 17, 18} to be drawn. Let X be the random variable
denoting the number drawn. Find the probability distribution of X and also
P(X < 16).
                      Bernoulli Distribution
A random experiment with two possible outcomes, conveniently defined as
“success” or “failure” is called a Bernoulli trial after Jacob Bernoulli (1654 –
1705). The choice of the event as “success” or “failure” is completely
arbitrary.
Example: a toss of a coin will show either a head or a tail. The “success” event can
be either the head, or the tail.

Conventionally, p denotes the probability of success and 1 – p denotes the probability
of failure.




                                                                Images from www.google.com
                   Binomial Distribution
The number of “success” events out of n repeated trials, each trial resulting
in 2 mutually exclusive outcomes with the repeated trials being mutually
independent, follows a Binomial distribution.
Example 4:
A batch of pregnancy test kit contains 50 kits of which 10% are known to be
defective. If 3 test kits are randomly chosen with replacement from the batch,
what is the probability that:
         (i) all will be defective;
         (ii) none will be defective;
         (iii) at least one will be defective;
         (iv) exactly one will be defective;
         (v) exactly two will be defective;
         (vi) not more than two will be defective.
                   Multinomial Distribution
The Binomial distribution has been used to obtain probabilities for the
number of times an event of interest (out of 2 possible events) occurs when
the same experiment is repeated several times.

Sometimes one is interested to count the number of occurrences of several
events simultaneously. In such a situation the multinomial distribution is
useful.

Assuming there are k possible outcomes, and E1, E2, …, Ek denote the
corresponding number of occurrences of each of the possible outcomes out
of a total of n events, then




with pi = P(Ei).
Example 5:
When snapdragons with pink flowers are crossed, a randomly chosen
offspring has either red (with prob. 0.25), pink (with prob. 0.50) or white (with
prob. 0.25) flowers. What is the probability that among 10 randomly chosen
seeds, 3 will develop white flowers, 2 red ones and 5 pink flowers?
                    Poisson Distribution
The Poisson distribution is usually used to calculate the probabilities of a
number of occurrences of a rare event. Often these cases are such that an
event can occur repeatedly over a long period of time or over a large area;
the distribution applies to the number of occurrences in a small interval of
time or over a small area.

Example: machine breakdowns, arrivals of calls at a telephone exchange,
faults developing in a pipeline, random arrival of customers at a service
station, accident occurrences, radioactive decay, gene mutations at a
particular locus
   Assumptions of a Poisson Distribution
• The outcomes occur randomly.

• The number of outcomes occurring in one time interval or specified region
is independent of the number that occur in any other disjoint time interval or
region.

• The probability that a single outcome will occur during a very short time
interval or in a small region is a very small and is constant.

• The probability of 2 or more outcomes occurring in such a short time
interval or fall in such a small region is negligible.
       Properties of a Poisson Distribution




(A) If X ~ Binomial(n, p), X  Poisson (np) as n  , p  0, with np 
constant. That is, the Poisson distribution arises as the limiting case of the
Binomial distribution.

(B) Suppose that X1 and X2 are independent random variables with X1 ~
Poisson(1) and X2 ~ Poisson(2), then Y = X1 + X2 ~ Poisson(1 + 2). That is,
the sum of two independent Poisson random variables also has a Poisson
distribution.
Example 6:
The number of emergency admissions each day to a hospital is found to
have a Poisson distribution with mean 2.

a) Evaluate the probability that on a particular day there will be no
   emergency admissions.

b) At the beginning of one day, the hospital has 5 beds available for
   emergencies. Calculate the probability that this will be an insufficient
   number for the day.

c) Calculate the probability that there will be exactly 3 admissions
   altogether on two consecutive days.
Example 7:
Oranges are packed in crates each containing 250. On the average 0.6% are
found to be bad when the crates are opened. What is the probability that
there will be more than 2 bad oranges in a crate?
                Recap – Numerical EDA

     • Calculating   informative numbers which summarise the
     dataset
     • What are the numbers useful for describing the age of
     1,059 individuals with diabetes?
        • Location parameters (mean, median, mode)
        • Spread (range, standard deviation, interquartile
        range)
        • Skewness            Properties of means and variances
                           Mean age (54.6 years)
                              in theoretical distributions play
                              important roles in determining
20         30         40
                                           in
                              variations 60 the definitions of the
                               50                   70         80
                              outcomes
                                   AGE
  Mean (Expectation) of a discrete random
                 variable
The expectation of a discrete outcome X, commonly known as the mean of X
or the expected value of X, is denoted as E(X) and defined as




The value of E(X) refers to the average value of x that one can expect after
sampling a large number of values from . E(X) is the long run average of
observations of the variable X.

The expectation of any function g(.) which depends on the random variable X,
g(X), is defined as follows
    Variance of a discrete random variable

The variance of X, or the population variance of X, is denoted by Var(X) and is
defined as




Var(X) is usually denoted by 2, and  is defined to be the standard deviation
of X.
Functions of means and variances
Example 8:
Find the expected score of a single roll of a fair die.
             Continuous random variables

Definition:
A continuous random variable X takes any value in a given range, and
theoretically can be measured to any desired degree of accuracy. (E.g. height,
weight, age, etc.)

When the total number of possible outcome is very large, the histogram will
approximate to a smooth curve called a frequency curve or a probability
density curve. The function represented by this curve is called the frequency
function, or more commonly known as the probability density function,
denoted by f.

As the function f denotes a probability function,
Some notes on continuous random
           variables
Properties of continuous random variables


               The cumulative density function (cdf) of a continuous
               random variable is denoted FX(x) = P(X  x) for any
               real x
                     Uniform Distribution
Definition:
A random variable is said to follow a Uniform distribution in the interval [a, b]
if the probability density function is a constant in the interval.
Normal Distribution

                                   68% of the probability,
                                   1 standard deviation
                                   away




                                   95% of the
                                   probability, 2 SDs
                                   away




40        50       60       70         80

     Exam marks for Mathematics exam
                    Normal Distribution
Also known as the Gaussian distribution.

A useful distribution to model outcomes in the
natural world.




                                                 Images from www.google.com
     Properties of the Normal distribution
- Special case: If  = 0, 2 = 1, the X has a Standard Normal distribution.
Usually, the probability density function of the standard normal is written
(x), and the cdf is written (x).

- If X ~ N(0, 1), and Y = aX + b, then Y ~ N(b, a2). Conversely, if
X ~ N(, 2), and Y = (X – ) / , then Y ~ N(0, 1).

- If X1 ~ N(1, 12) and X2 ~ N(2, 22), and X1 and X2 are mutually
independent, then Y = X1 + X2 ~ N(1 + 2 , 12 + 22).

- The plot of density function f is bell-shaped and symmetrical about the line
x =  with a single peak. So the mean, mode and median of the normal
distribution coincide.

- Practically all of the population (about 99.7%) lies in the interval   3,
about 95% of the population lies in the interval   2 and about 68% of the
population lies in the interval   .
     Properties of the Normal distribution
- Suppose X ~ Binomial(n, p), for large n and relatively large p, the normal
distribution can be used as an approximation and X  N(np, np(1 -– p))

- Suppose X ~ Poisson(), for large , the normal distribution can also be
used as an approximation and X  N(,)

- When the Normal distribution is used to approximate to a discrete
distribution, continuity correction must be used. This is because the discrete
probability P(X = ) is equivalent to the continuous probability of P(  0.5 
X <  + 0.5).

- For example, suppose X is discrete and the normal approximation is used.
Suppose also the question requires to find P(X < 35). This is equivalent to
finding the continuous probability P(X < 34.5), since the discrete value x =
35 is not included in the range X < 35, and so the continuous random
variable cannot be bigger than 34.5. (since 34.5  x < 34.9999…will still
round up to give 35 in the discrete random variable)
            Calculating probabilities for N(0,1)
  - http://www.stat.psu.edu/~babu/418/norm-tables.pdf

  - Cumulative Standard
    Normal table




Images from training.ce.washington.edu
P(Z < 0.45) = ?
              0.67364

P(Z > 1.12) = ? – P(Z < 1.12)
              1
            = 1 – 0.8684
            = 0.1316

P(Z < -0.45) = 1 – P(Z > -0.45)
             = 1 – P(Z < 0.45)
RExcel and Normal distribution
RExcel and Normal distribution
Example 9:
Suppose X ~ N(0, 1), and x takes values from the set X. Find the following
probabilities, by using RExcel.
a)      P(X < x)         for x = 0.65
b)      P(X  x)         for x = 0.123
c)      P(X > x)         for x = 2.78
d)      P(X > x)         for x = 0
Example 10:
X and Y are independent random variables which are both normally
distributed, with X ~ N(100, 25) and Y ~ N(120, 20).
Calculate the following probabilities:
(a)      P(X > 92)
(b)      P(Y > X)
(c)      P(2X + Y < 300)
(d)      P(|X – Y| < 10)
                   Exponential Distribution
Recall that, under certain assumptions, the number of occurrences of rare
events follows a Poisson distribution. Sometimes, the interest may be in the
time till the observation of the event.

Let Yt denote the number of occurrences of rare events in t time units.
Suppose the mean number of events is  per time unit. Then Yt follows a
Poisson distribution with mean = t.

Let X denote the time, measured from an arbitrary moment to the first event.

Then    P(X > x)          = P(No events in an interval of x time units)
                          = P(Yx = 0)
                          = e x

Therefore FX(x) = P(X  x) = 1 – P(X > x) = 1 – ex ,
and             f(x) =  ex

This is called the exponential distribution or the waiting time distribution.
                Exponential Distribution
The waiting time until an event occurs in a Poisson process follows the
exponential distribution.
               Lack of memory property




This is rather relevant to some of you! The waiting time for a bus follows an
Exponential distribution (prove this!), and this property of an Exponential
distribution is rather depressing.
It says that the chance that you have to wait for another 5 minutes for the
bus is exactly the same if you had waited for 20 minutes already and yet still
have not seen it arrive!
Example 11:
 Assume that the number of radioactive particles emitted by a radioactive
substance is 1.5 per second. What is the chance that we have to wait more
than three seconds for the first emission to occur?
Example 12:
Assume that the average time between two subsequent visits of insects to a
certain flower is 12 minutes. You are starting to observe the flower. What is
the chance that you will have to wait for no more than 15 minutes for the first
insect to arrive? What is the chance that the time between the first and
second arriving insect is less than 15 minutes? What is the chance that less
than 3 insects will visit the flower, given that you observe the flower for one
hour?
                               Entropy
Often in medical research, we are interested in predicting the outcome given
some probability statements.

Suppose there are four possible outcomes after chemotherapy treatment:
     (complete remission, partial remission, no change, early death)

If the probabilities of the four outcomes estimated from current data are:
                              (0.90, 0.08, 0.02, 0.00),
you will feel confident about the treatment, since current data intuitively
provided a lot of information and this information seems to suggest a highl
likelihood of positive outcomes.

Similarly, if the probabilities are
                               (0.01, 0.01, 0.08, 0.90)
You will also feel confident that you should avoid undergoing the treatment,
because again, current data provided a lot of information to suggest
negative outcomes.
                                 Entropy
However, if the probabilities are:
                            (0.25, 0.25, 0.25, 0.25)
you actually will not gain additional information from previous data, or
previous data are perfectly uninformative.

Entropy is a statistical measure to quantify the amount of information
available for prediction, and is calculated from using all the probabilities of
the possible outcomes (i.e. from the probability function).

Statistical definition
The entropy of a random variable X with probability function p(x) is defined
to be the quantity
                               Entropy
It can be shown that for a random variable with n possible values, the
entropy is always bounded between 0 and log(n), where:
-        0 corresponds to the situation with perfect information
-        Log(n) corresponds to the situation with no information.

Relative mutual information
It is increasingly common to define the relative mutual information (RMI) as

                         RMI(X) = 1 – [H(X)/log(n)]

to yield a more intuitive information criterion that is bounded between 0 and
1, where:
-        0 corresponds to the situation with no information
-        1 corresponds to the situation with perfect information.
Example 13:
Let X denote the outcome when flipping a fair coin and Y the outcome when
rolling a fair die. Let furthermore Z be one, if two fair dice show a double six
and zero otherwise. Notice that if you want to predict the outcome of these
random variables, you have the best chance to predict Z correctly. Y is
hardest to predict. Calculate the entropies and the relative mutual
information of these three random variables.
     Something fun – practical application of
         what we have learnt so far!
  Very common for students to go through the material on probability and
  theoretical distributions thinking about what’s the relevance of all these in
  real life!

  Let’s look at something fun, which most of you will hopefully have some
  experience with:




Images from www.google.com
                                    Monopoly
- 40 grids possible
- each player moves his avatar
around the game board by
rolling two dice
- Community Chest / Chance
- Acquire properties across the
game board
- Develop properties of the
same colour combination into
houses and hotels
- Aim to bankrupt other players
and be the richest (sounds
familiar?)
- Potential of going to jail if
landing on “Go to jail”
- or if you roll doubles 3 times
in a row
- or if Chance / Community
Chest sends you there.
       Images from www.google.com
                                     Monopoly      Simple probability theory and
                                                   knowledge of dice outcome
                                                   can provide a marginal edge
- 40 grids possible                                in games!
- Every grid equally likely? (or
2.5% chance?)
- What are the properties that
are most likely to be landed on?         Possible outcomes from roll of
                                         two dice:
                                         Prob(X = 2) = 1 in 36
- Computer simulation of                 Prob(X = 3) = 2 in 36
Monopoly, with all the rules and         Prob(X = 4) = 3 in 36
regulations                              Prob(X = 5) = 4 in 36
- turns out that the Jail spot has       Prob(X = 6) = 5 in 36
the highest occupancy rates              Prob(X = 7) = 6 in 36
(5.88%)                                  Prob(X = 8) = 5 in 36
- that inevitably results in the         Prob(X = 9) = 4 in 36
orange properties being the              Prob(X = 10) = 3 in 36
most frequented (8.47%)                  Prob(X = 11) = 2 in 36
                                         Prob(X = 12) = 1 in 36



       Images from www.google.com
            8.19%               7.61%
                    3.06%

                        2.65%

        2.96%
8.47%




                                            7.52%
        2.91%                       2.20%
7.17%




                                            4.61%
                        2.15%


5.88%       6.62%                 4.57%
                                Waiting time?
We can model the waiting time
for someone to land on a
particular grid with an
Exponential distribution.

For example, let’s suppose we
are interested in the most
expensive property on the
board.
       38.2          39.8 32.7 37.8 38.8 39.4 38.4 40.1


36.0                                                      40.2

33.8                                                      39.2



36.6                                                      41.9

34.3                                                      45.3

41.5

45.3                                                      47.9

41.9

39.0                                                      39.6
                                 46.5

       46.1   43.7        46.2           47.6      40.5
           Students should be able to
• know the definitions of the various terminologies and
distributions
• know how to calculate the probability mass/density function for
the theoretical distributions, and in empirical situations
• calculate the probability of specific outcomes, when assuming
a theoretical distribution for these outcomes
• understand the interpretation of entropy and know how to
calculate the entropy

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:8/10/2012
language:
pages:58