Distributions Probability by ZvVYkoIQ

VIEWS: 0 PAGES: 40

									Statistics for Health Research
       Introduction to
      Distributions and
         Probability
              Peter T. Donnan
 Professor of Epidemiology and Biostatistics
              Overview

•   Distributions
•   History of probability
•   Definitions of probability
•   Random variable
•   Probability density function
•   Normal, Binomial and Poisson
    distributions
   Introduction to Probability
       Density Functions

•Normal Distribution /
•Gaussian / Bell curve
•Poisson named after French
Mathematician
•Binomial related to binary factors
(Bernoulli Trials)
  Early use of Normal
      Distribution
• Gauss was a German mathematician
  who solved mystery of where Ceres
  would appear after it disappeared
  behind the Sun.
• He assumed the errors formed a
  Normal distribution and managed to
  accurately predict the orbit of Ceres
    What is the
relationship between
   the Normal or
Gaussian distribution
  and probability?
       Probability

“I cannot believe that God plays
     dice with the cosmos”
        Albert Einstein
“The probable is what usually
         happens”
          Aristotle
          Origins of Probability

• Early interest in permutations Vedic
    literature 400 BC
•   Distinguished origins in betting and
    gambling!
•   Pascal and Fermat studied division of
    stakes in gambling (1654)
•   Enlightenment – seen as helping
    public policy, social equity
•   Astronomy – Gauss (1801)
•   Social and genetic – Galton (1885)
•   Experimental design – Fisher (1936)
      Types of Probability

  Two basic definitions:

1) Frequentist       2) Subjectivist
Classical            Bayesian
Proportion of        Strength of belief
times an event       in event happening
occurs in a long
series of ‘trials’
 Frequentists vs. Bayesians


• Two entrenched camps
• Scientists tend to use the
  frequentist approach
• Bayesians gaining ground
• Most scientists use frequentist
  methods but incorrectly interpret
  results in a Bayesian way!
              Frequentists

   • Consider tossing a fair coin
   • In any trial, event may be a
     ‘head’ or ‘tail’ i.e. binary
   • Repeated tossing gives series
     of ‘events’
   • In long run prob of heads=0.5
THTTHHHHTHHHTHHHTTHTTTHHTTHTTHHHTTTHHTHHHTTTTTHHH
        0.6          0.56                    0.52
              Frequentist
              Probability
• Note the difference between ‘long run’
    probability and an individual trial
•   In an individual trial a head either occurs
    (X=1) or does not occur (X=0)
•   Patient either survives or dies following an
    MI
•   Prob of dying after MI ≈ 30% based on a
    previous long series from a population of
    individuals who experienced MI
         Subjective
         Probability

• Based on strength of belief
• But more akin to thinking of clinician
  making a diagnosis
• Faced with patient with chest pain,
  based on past experience, believes
  prob of heart disease is 20%
• Person tossing coin believes prob of
  head is 1/2
      Comparison of
definitions of Probability
• Problems of subjective probability
• Probability for same patient can vary even
    with same clinician
•   Person can believe prob of head is 0.1
    even if it is a fair coin
•   Subjectivists argue they are more realistic
•   This course sticks to ‘frequentist’ and
    ‘model-based’ methods of probability
Random Variable

• Consider rolling 2 dice and we want
  to summarise the probabilities of all
  possible outcomes
• We call the outcome a random
  variable X which can have any value
  in this case from 2 to 12
• Enumerate all probabilities in sample
  space S
• P (2) = 1/6x1/6 = 1/36, P (3)=2/36,
  P (4) = 3/36, etc…..
   Probability Density Function for
            rolling two dice
6/36


5/36


4/36


3/36


2/36


1/36




        2     3    4    5     6    7   8    9    10    11   12

       1/36   2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36


        2      3    4    5    6    7    8    9    10   11    12
  Probability Density Function for
           rolling two dice

What is probability of getting 12? Answer 1/36
What is probability of getting more than 8? Ans. 10/36

  6/36

  5/36


  4/36


  3/36
  2/36
  1/36




         2   3   4   5   6   7   8   9   10   11   12
   Probability Density Function for
         continuous variable
6/36


5/36


4/36


3/36


2/36


1/36




        2     3    4    5     6    7   8    9    10    11   12

       1/36   2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36


        2      3    4    5    6    7    8    9    10   11    12
        Consider distribution of weight in
         kg; all values possible not just
                     discrete
Probability




              20…….30……40…… 50 ……60…….70…….80…..90….100….110…… 120

                                           Weight in kilograms


              2     3   4    5    6    7    8    9   10   11     12
Probability Density Function
          in SPSS
Use Analyze / Descriptive Statistics / Frequencies
and select no table and charts box as below
Probability Density Function
          in SPSS
Data from ‘LDL Data.sav’ of baseline LDL cholesterol
        Normal Distribution
Note that a Normal or Gaussian
curve is defined by two parameters:
Mean µ and Standard Deviation σ
And often written as N ( µ, σ )
Hence any Normal distribution has mathematical form




Impossible to be integrated so area under the curve
obtained by numerical integration and tabulated!
        Normal Distribution
As noted earlier the curve is symmetrical about the
mean and so p ( x ) > mean = 0.5 or 50%
And          p ( x ) < mean = 0.5 or 50%
And p (a < x < b) = p(b) – p(a)




            50%         50%
    Normal Distribution and
        Probabilities
So we now have a way of working out the probability
of any value or range of values of a variables IF a
Normal distribution is a reasonable fit to the data
p (a < x < b) = p(b) – p(a) which is the area under
the curve between a and b




                 50%      50%
        Normal Distribution

Most of area lies between +1 and -1 SD (64%)
The large majority lie between +2 and -2 SDs (95%)
              Normal Distribution

Probability Density
Function (PDF) =
 How well does my data fit
  a Normal Distribution?
                          Statistics

Baseline LDL
N                         Valid            1383
                          Missing             0
Mean                                   3.454363
Median                                 3.506214
St d. Dev iation                       .9889157
Skewness                                   .039
St d. Error of Skewness                    .066
Minimum                                   .3345
Maximum                                  7.5650




Note median and mean virtually the same
Skewness = 0.039, close to zero
Skewness is measure of symmetry (0=perfect symetry)
Eyeball test - fitted normal curve looks good!
  Try Q-Q plot in Analyze /
Descriptive Statistics/ Q-Q plot

Plot compares Expected
Normal distribution
with real data and if
data lies on line y = x
then the Normal
Distribution is a good
fit
Note still an eyeball
test!
Is this a good fit?
  I used to be Normal until I
discovered Kilmogorov-Smirnoff!
           One-Sample Kolmogorov-Smirnov Test

                                           Baseline LDL
N                                                   1383
Normal Parameters a,b     Mean                3.454363
                          Std. Deviation      .9889157
Most Extreme              Absolute                  .043
Diff erences              Positive                  .043
                          Negative                 -.043
Kolmogorov -Smirnov Z                              1.617
Asy mp. Sig. (2-tailed)                             .011
  a. Test distribution is Normal.
  b. Calculated f rom data.


Eyeball Test indicates distribution is approximately
Normal but K-S test is significant indicating
discrepancy compared to Normal
WARNING: DO NOT RELY ON THIS TEST
 Consider the distribution of survival
times following surgery for colorectal
                cancer
                         Statistics

Time f rom Surgery
N                        Valid              476
                         Missing              0
Mean                                   848.3908
Median                                 835.5000
Std. Dev iation                       582.39657
Skewness                                  2.081
Std. Error of Skewness                     .112
Minimum                                   14.00
Maximum                                 5763.00



      Note median=835 days and mean=848
      Skewness = 2.081, very skewed (> 1.0)
      Strong tail to right! Approximately Normal?
   Try a log transformation for right
         positive skewed data?
                         Statistics

logtime
N                        Valid            476
                         Missing            0
Mean                                  6.4346
Median                                6.7286
Std. Deviation                        .95059
Skewness                               -1.504
Std. Error of Skewness                   .112
Minimum                                  2.67
Maximum                                  8.66




Better but now slightly skewed to left!
 Examples of skewed distributions
       in Health Research
Discrete random variables – hospital admissions,
cigarettes smoked, alcohol consumption, costs
Continuous RV – BMI, cholesterol, BP




                                             30%
           The Binomial
           Distribution
• ‘Binomial’ means ‘two numbers’.
• Outcomes of health research are often
    measured by whether they have occurred
    or not.
•   For example, recovered from disease,
    admitted to hospital, died, etc
•   May be modelled by assuming that the
    number of events n has a binomial
    distribution with a fixed probability of
    event p
         The Binomial
         Distribution

• Based on work of Jakob Bernoulli, a Swiss
    mathematician
•   Refused a church appointment and
    instead studied mathematics
•   Early use was for games of chance but
    now used in every human endeavour
•   When n = 1 this is called a Bernoulli trial
•   Binomial distribution is distribution for a
    series of Bernoulli trials
              The Binomial
              Distribution
• Binomial distribution written as B ( n , p)
    where n is the total number of events and p
    = prob of an event
•   This is a Binomial
    Distribution with
                                                       0.20




    p=0.25 and n=20       Probability of R Successes
                                                       0.15




                                                       0.10




                                                       0.05




                                                       0.00
                                                              0   1   2   3   4   5   6   7   8   9 10 11 12 13 14 15 16 17 18 19 20
                                                                                                  Successes
The Binomial
Distribution
        The Poisson
        Distribution
Poisson distribution (1838), named after its
inventor Simeon Poisson who was a French
mathematician. He found that if we have a
rare event (i.e. p is small) and we know the
expected or mean ( or µ) number of
occurrences, the probabilities of 0, 1, 2 ...
events are given by:
                                     R
                                   e    
                        P( R ) 
                                       R!
         The Poisson
         Distribution
Note similarity to Binomial
In fact when p is small and n is large
B(n, p) ~ P (µ = np)
Also for large values of µ:
P (µ) ~ N ( µ, µ )

Hence if n and p not known
could use Poisson instead
        The Poisson
        Distribution
In health research often used to model the
number of events assumed to be random:
     Number of hip replacement failures,
     Number of cases of C. diff infection,
     Diagnoses of leukaemia around nuclear
     power stations,
     Number of H1N1 cases in Scotland,
     Etc.
                   Summary
•Many of variables measured in Health Research form
distributions which approximate to common distributions
with known mathematical properties       40




•Normal, Poisson, Binomial, etc…         30




•Note a relationship for all centred
                                         20




                                         10

                                                                                     Std. Dev = .96




around the exponential distribution
                                                                                     Mean = -.04
                                          0                                          N = 501.00
                                          -2.95   -1.95   -.95   .05   1.05   2.05


                                              RANNORM




Where e = 2.718
• All belong to the Exponential Family of distributions
• These probability distributions are critical to applying
  statistical methods
           SPSS Practical

• Read in data file ‘LDL Data.sav’
• Consider adherence to statins, baseline
    LDL, min Chol achieved, BMI, duration of
    statin use
•   Assess distributions for normality
•   If non-normal consider a transformation
•   Try to carry out Q-Q plots

								
To top