Exploratory Data Analysis by nikeborome


									Statistical Inference and
The Normal Distribution

            STA 570 401-402
            Spring 2006
Review of Inference

   The group of all individuals we are interested
    in is called the population. We rarely
    actually observe the entire population. If our
    question is “will extending the school year by
    5 days increase student learning?” then we
    are interested in ALL students. We are never
    going to design an experiment involving ALL

   Numerical aspects of the population are
    called parameters. If our population is all
    people who drive to work, one parameter is
    their average drive time each morning.
   Because we rarely see the entire population,
    parameters are typically unknown.
   The goal of inference is to estimate these
    unknown parameters.
Samples and Statistics

   We typically observe a small fraction of the
    population (we’d prefer to see all of it, but
    that just typically isn’t practical). The group
    we observe is called the sample. We see
    them, we can measure them, etc.
   Any numerical aspect of the sample is called
    a statistic. Suppose again we are interested
    in the drive time of all drivers, and we send
    out a survey. The people who respond are
    the sample. Their average drive time is
    called the sample mean.
Statistics to Parameters

   Fortunately, probability theory tells us that if
    our sample is drawn correctly (i.e. randomly),
    then our statistic will be close to our
    parameter, allowing us to make educated
    guesses about the parameter of interest.
   Drawing a random sample is sometimes
    easy, and sometimes difficult (stay tuned,
    we’ll cover this more as we go). For now,
    we’re going to assume we have a good
Remember the main idea

   We do NOT see the parameter, we DO see
    the statistic.
   Probability theory says there is a little “tether”
    connecting the two.
   Imagine seeing a hot air balloon (the
    statistic) on a tether over some treetops. You
    can’t see where on the ground it is tethered
    (the parameter), but you can make a good
Some limitations of the tether idea

   I like the tether idea, but there are limitations
    on how far it applies.
   The “tether” is only probabilistic. It says
    things like “there is a 95% chance the
    statistic will be within (some number) of the
    parameter” and “there is a 99% chance the
    statistic will be within (some other number) of
    the parameter”, and so on.
More on tethers, continued

   To get a larger probability, you have to increase the
    length of the tether. This, I hope, is intuitive. To be
    more sure of the result, you have to give the
    statistics more room to move.
   If you’re aiming at a dartboard, there is a small
    chance you’ll hit the little circle in the middle. There
    is a larger chance you’ll hit the dartboard (it’s
    bigger). There is a great chance you’ll hit the wall.
    The bigger the target, the better the chance of hitting
    it. Hence, the longer the tether, the better the chance
    of finding the parameter.
Binomial distribution review

   Recall a binomial setting consists of a set of
   1) dichotomous (two-valued) responses
   2) equal chance of success for each
   3) independence (responses do not influence
    each other)
Inference with Binomial distributions

   Under the binomial setting, if p is the
    population proportion, then the sample
    proportion phat has a 95% chance of being
    within the region p ± 1.96 sqrt(p(1-p)/n)
   In practice, p is unknown, so we use phat to
    construct our tether length as well. The
    length of the tether (really called the “margin
    of error”) is 1.96 sqrt(phat(1-phat)/n)
Binomial Confidence intervals

   In practice, suppose we have n observations in a
    binomial setting. We can use those to compute phat
    (p remains unknown). A 95% confidence interval for
    p is

   Phat ± 1.96 sqrt(phat(1-phat)/n)
   To get a 90% confidence interval, replace 1.96 with
    1.645. To get a 99% confidence interval, replace
    1.96 with 2.576. Typically large values are used, but
    you could in theory find a 50% confidence interval,
    where the coefficient is 0.674
Another example

   Does a personal phone call make students
    more likely to enroll? Suppose you sample
    200 admitted students at random and make
    a personal phone call encouraging them to
    attend your university. Of those 200, 127
    eventually enroll. Construct a 90%
    confidence interval for the proportion of
    called students who enroll.
Another example continued

   Population = all students who may receive a phone
   Sample = the students you actually called (the 200)
   phat = 127/200 = 63.5%
   For 90% confidence, the margin of error is 1.645
    sqrt(phat(1-phat)/n) = 1.645 sqrt(0.635*0.365/200) =
    1.645 sqrt(0.034) = 0.056.
   The 90% confidence interval is 0.635 ± 0.056, or
    between 57.9% and 69.1%
To repeat, because it’s important

   If you want more confidence (a better chance
    of your interval containing the parameter),
    you have to increase the width of your
    interval (that’s why the coefficients increase,
    from 1.645 for 90% to 2.576 for 99%)
   Larger sample sizes produce more accuracy
    than smaller sample sizes.
Normal Distributions

   So where did the 1.96, the 1.645, and the
    2.576 come from?
   Answer – the normal distribution, also known
    as a Gaussian distribution, the error function,
    the “bell curve”, and probably others.
   In any case, the normal distribution is your
You’ve probably all seen a bell curve…
The Normal distribution is common

   Lots of real data follows a normal shape. For
   1) Many/most biometric measurements
    (heights, femur lengths, skull diameters, etc.)
   2) Scores on many standardized exams (IQ
    tests) are forced into a normal shape before
   3) Many quality control measurements, if you
    take the log first, have a normal shape.
When sampling from a normal

   Normal distributions are typically
    characterized by two numbers, their mean or
    “expected value” which corresponds to the
    peak, and their “standard deviation” which is
    the distance from the mean to the inflection
   Large standard deviations result in “spread
    out” normals. Small standard deviations
    result in “strongly peaked” distributions.
Two normals, corresponding to
different standard deviations.

   Mean=100, std.dev = 16
   Mean=100, std.dev = 4
Probabilities from a Normal

   Normal distributions have a nice property
    that, knowing the mean (μ) and standard
    deviation (σ), we can tell how much data will
    fall in any region.
   Examples – the normal distribution is
    symmetric, so 50% of the data is smaller
    than μ and 50% is larger than μ.
More Normal Probabilities

   It is always true that about 68% of the data
    appears within 1 standard deviation of the
    mean (so about 68% of the data appears in
    the region μ±σ)
Yet more normal probabilities

   It is also true about 95% of the appears
    within 2 standard deviation of the mean, and
    about 99.7% of the data appear within 3
    standard deviations of the mean (so it’s
    VERY rare to go beyond 3 standard
   Preview of coming attractions, the EXACT
    number is that 95% of the data is within 1.96
    standard deviations of the mean. That’s
    where the 1.96 comes from.
95% within 2 standard deviations,
99.7% within 3 standard deviations
Computing more general probabilities

   Suppose you want to know how much data
    appears within 1.5 standard deviations of the
    mean, or how much data appears between
    1.3 and 1.7 standard deviations of the mean.
   Real answer – use SAS or any of several
    other programs.
Another way

   There is another way of computing normal
    probabilities that is 1) the way it used to be
    done, back in pre-handy-computer days, 2)
    useful for understanding more about the
    normal distribution.
   The number of standard deviations an
    observation is from the mean is called the Z-
    score for that observation.
Z-score examples

   If μ=100 and σ=16 (this is true of IQ scores in
    the U.S.), then an observation X=125 is 25
    points above the mean, which corresponds
    to 25/16 = 1.5625 standard deviations above
    the mean.
   If general, a Z-score for an observation X is
   Observations above the mean get positive Z-
    scores, observations below the mean get
    negative Z-scores.
Computing probabilities with Z-scores

   Fortunately, the Z-score is all you need to
    know to compute probabilities from a normal
   The reason is that Z-scores map directly to
   For each Z-score SAS can provide the
    percentile (to be shown in lab). For example,
    if the Z-score is 1, the percentile is 84.13%. If
    the Z-score is 2.3, then the percentile is
Probabilities between Z-scores

   Again, IQ scores are normally distributed with mean
    100 and standard deviation 16.
   How many people have IQ scores between 90 and
   Compute the corresponding Z-scores. For 90, the Z-
    score is (90-100)/16 = -0.625. For 120, the Z-score is
    (120-100)/16 = 1.25.
   Find the corresponding percentiles (SAS). The
    percentile for Z=1.25 is 89.43%. The percentile for
    Z=(-0.625) is 26.6%.
   The amount between these is 89.43 – 26.60 =
Comparing observations from different
normal distributions

   The central idea is that a Z-score
    corresponds to a percentile for the
   If you have observations from multiple
    normal distributions, you can compute the Z-
    score for each observations and compare
    which has the “better” score.

   Suppose you have two students, one with a 23 on
    the ACT (mean 22 and standard deviation 3) and
    another with a 1220 on the SAT (mean 900 and
    standard deviation 250).
   The Z-score for the student with the ACT is (23-22)/3
    = 0.33 while the Z-score for the student with the SAT
    is (1220-900)/250 = 1.28.
   The student with the SAT performed much better
    (relative to peers on the exam).

To top