Docstoc

Stats 845

Document Sample
Stats 845 Powered By Docstoc
					  Stats 845

Applied Statistics
       This Course will cover:
1. Regression
  –   Non Linear Regression
  –   Multiple Regression
2. Analysis of Variance and Experimental
   Design
      The Emphasis will be on:
1. Learning Techniques through example:
2. Use of common statistical packages.
  •   SPSS
  •   Minitab
  •   SAS
  •   SPlus
       What is Statistics?
It is the major mathematical tool of
scientific inference - the art of drawing
conclusion from data. Data that is to some
extent corrupted by some component of
random variation (random noise)
An analogy can be drawn to
data that is affected by
random components of
variation to signals that are
corrupted by noise.
Quite often sounds that are
heard or received by some
radio receiver can be thought
of as signals with
superimposed noise.
The objective in signal theory
is to extract the signal from
the received sound (i.e.
remove the noise to the
greatest extent possible). The
same is true in data analysis.
Example A:
 Suppose we are comparing
the effect of three different
diets on weight loss.
An observation on weight loss
can be thought of as being
made up of two components:
1. A component due to the effect
   of the diet being applied to the
   subject (the signal)
2. A random component due to
   other factors affecting weight
   loss not considered (initial
   weight of the subject, sex of the
   subject, metabolic makeup of
   the subject.) random noise.
Note:
 that random assignment of
subjects to diets will ensure
that this component will be a
random effect.
Example B          In this example we again are
comparing the effect of three diets on weight
gain. Subjects are randomly divided into three
groups. Diets are randomly distributed
amongst the groups. Measurements on weight
gain are taken at the following times -
       - one month
       - two months
       - 6 months and
       - 1 year
 after commencement of the diet.
In addition to both the factors Time and Diet
effecting weight gain there are two random
sources of variation (noise)

     - between subject variation and
     - within subject variation
This can be illustrated in a schematic
fashion as follows:

    Deterministic factors
           Diet
           Time             Response
                            weight gain



        Random Noise
         within subject
        between subject
The circle of Research

                            Questions arise about
                              a phenomenon



 Conclusion are drawn                               A decision is made to
   from the analysis                                    collect data



                                                                            Statistics
                        Statistics
                                                      A decision is made as
       The data is                                     how to collect the
     summarized and                                           data
        analyzed

                            The data is collected
Notice the two points on the
circle where statistics plays
an important role:
1.The analysis of the collected data.
2.The design of a data collection
procedure
    The analysis of the collected
               data.
• This of course is the traditional use of statistics.
• Note that if the data collection procedure is well
  thought out and well designed, the analysis step of
  the research project will be straightforward.
• Usually experimental designs are chosen with the
  statistical analysis already in mind.
• Thus the strategy for the analysis is usually
  decided upon when any study is designed.
• It is a dangerous practice to select the form
  of analysis after the data has been collected
  ( the choice may to favour certain pre-
  determined conclusions and therefore in a
  considerable loss in objectivity )
• Sometimes however a decision to use a
  specific type of analysis has to be made
  after the data has been collected (It was
  overlooked at the design stage)
  The design of a data collection
           procedure
• the importance of statistics is quite
  often ignored at this stage.
• It is important that the data collection
  procedure will eventually result in
  answers to the research questions.
• And will result in the most
  accurate answers for the resources
  available to research team.
• Note the success of a research
  project should not depend on the
  answers that it comes up with but
  the accuracy of the answers.
• This fact is usually an indicator of
  a valuable research project..
Some definitions

important to Statistics
A population:
this is the complete collection of subjects
(objects) that are of interest in the study.
There may be (and frequently are) more
than one in which case a major objective
is that of comparison.
A case (elementary sampling
unit):
This is an individual unit (subject) of the
population.
A variable:

a measurement or type of measurement
that is made on each individual case in the
population.
Types of variables
Some variables may be measured on a
numerical scale while others are
measured on a categorical scale.

The nature of the variables has a great
influence on which analysis will be used. .
For Variables measured on a numerical scale
the measurements will be numbers.

Ex: Age, Weight, Systolic Blood Pressure

For Variables measured on a categorical scale
the measurements will be categories.

Ex: Sex, Religion, Heart Disease
Types of variables

In addition some variables are labeled as
dependent variables and some variables
are labeled as independent variables.
This usually depends on the objectives of
the analysis.

Dependent variables are output or
response variables while the
independent variables are the input
variables or factors.
Usually one is interested in determining
equations that describe how the dependent
variables are affected by the independent
variables
A sample:

Is a subset of the population
         Types of Samples

different types of samples are determined
by how the sample is selected.
     Convenience Samples
In a convenience sample the subjects that
are most convenient to the researcher are
selected as objects in the sample.
This is not a very good procedure for
inferential Statistical Analysis but is
useful for exploratory preliminary work.
           Quota samples
In quota samples subjects are chosen
conveniently until quotas are met for
different subgroups of the population.
This also is useful for exploratory
preliminary work.
         Random Samples
Random samples of a given size are
selected in such that all possible samples
of that size have the same probability of
being selected.
Convenience Samples and Quota samples
are useful for preliminary studies. It is
however difficult to assess the accuracy
of estimates based on this type of
sampling scheme.
Sometimes however one has to be
satisfied with a convenience sample and
assume that it is equivalent to a random
sampling procedure
A population statistic
(parameter):

Any quantity computed from the values
of variables for the entire population.
A sample statistic:


Any quantity computed from the values
of variables for the cases in the sample.
Statistical Decision Making
• Almost all problems in statistics
  can be formulated as a problem of
  making a decision .
• That is given some data observed
  from some phenomena, a decision
  will have to be made about the
  phenomena
Decisions are generally broken
into two types:

• Estimation decisions
and
• Hypothesis Testing decisions.
Probability Theory plays a very
important role in these decisions
and the assessment of error made
by these decisions
Definition:

 A random variable X is a
 numerical quantity that is
 determined by the outcome of a
 random experiment
Example :

 An individual is selected at
 random from a population
 and
 X = the weight of the individual
The probability distribution of a
random variable (continuous) is
describe by:
 its probability density curve f(x).
i.e. a curve which has the
following properties :
• 1. f(x) is always positive.
• 2. The total are under the curve f(x) is
  one.
• 3. The area under the curve f(x) between
  a and b is the probability that X lies
  between the two values.
0.025

 0.02

0.015
                                            f(x)
 0.01

0.005

   0
        0   20   40   60   80   100   120
Examples of some important
  Univariate distributions
1.The Normal distribution
A common probability density curve is the “Normal”
density curve - symmetric and bell shaped
Comment: If m = 0 and s = 1 the distribution is
called the standard normal distribution
     0.03
                                    Normal distribution
    0.025                           with m = 50 and s =15

     0.02
                                                 Normal distribution with
                                                 m = 70 and s =20
    0.015

     0.01

    0.005

       0
            0   20   40   60   80       100      120
                     xm     2

          1               2
f(x)        e        2s
         2s
2.The Chi-squared distribution
with n degrees of freedom


              1     (n  2 ) / 2  x / 2
 f ( x)  n n / 2 x             e if x  0
          2 2
0.5


0.4


0.3


0.2


0.1



      2   4   6   8   10   12   14
Comment: If z1, z2, ..., zn are
independent random variables each
having a standard normal distribution
then
U = z1  z2    zn
       2     2          2

has a chi-squared distribution with n
degrees of freedom.
3. The F distribution with
 n1 degrees of freedom in the
numerator and n2 degrees of
freedom in the denominator
                               n 1  n 2  / 2
                n1 
 f(x)  K x      x
             (n1  2)2
                1                         if x  0
                n 2 
                                   n1 / 2
               n1  n 2  n1 
                              
             
                2  n2 
   where K =
                    n1  n2 
                         
                     2   2 
0.8

0.7

0.6

0.5                   F dist
0.4

0.3

0.2

0.1

  0
      0   1   2   3        4   5   6
Comment: If U1 and U2 are independent
random variables each having Chi-squared
distribution with n1 and n2 degrees of
freedom respectively then

           U 1 n1
        F=
           U 2 n2

has a F distribution with n1 degrees of
freedom in the numerator and n2 degrees of
freedom in the denominator
4.The t distribution with n
  degrees of freedom
                            n1  / 2
                  x 
                       2
        f(x)  K 
                  1
                  n 
               n  1
             
                2 
   where K =
               n 
                n
                 2
          0.4


          0.3


          0.2


          0.1



-4   -2         2   4
Comment: If z and U are independent
random variables, and z has a standard
Normal distribution while U has a Chi-
squared distribution with n degrees of
freedom then

         z
   t=
         U n
has a t distribution with n degrees of
freedom.
•    An Applet showing critical values and tail
     probabilities for various distributions
1.   Standard Normal
2.   T distribution
3.   Chi-square distribution
4.   Gamma distribution
5.   F distribution
The Sampling distribution
      of a statistic
A random sample from a probability
distribution, with density function
f(x) is a collection of n independent
random variables, x1, x2, ...,xn with a
probability distribution described by
f(x).
If for example we collect a random
sample of individuals from a population
and
   – measure some variable X for each of
     those individuals,
   – the n measurements x1, x2, ...,xn will
     form a set of n independent random
     variables with a probability distribution
     equivalent to the distribution of X across
     the population.
A statistic T is any quantity
computed from the random
observations x1, x2, ...,xn.
• Any statistic will necessarily be
  also a random variable and
  therefore will have a probability
  distribution described by some
  probability density function fT(t).
• This distribution is called the
  sampling distribution of the
  statistic T.
• This distribution is very important if one is
  using this statistic in a statistical analysis.
• It is used to assess the accuracy of a
  statistic if it is used as an estimator.
• It is used to determine thresholds for
  acceptance and rejection if it is used for
  Hypothesis testing.
Some examples of Sampling
 distributions of statistics
Distribution of the sample mean for a
sample from a Normal popululation

Let x1, x2, ...,xn is a sample from a normal
population with mean m and standard
deviation s
Let
               x   i
          x   i
                        n
Than
               x   i
          x   i
                        n

has a normal sampling distribution with mean
                   mx  m
and standard deviation
                   sx  s
                            n
0
    20   40   60   80   100
     Distribution of the z statistic

Let x1, x2, ...,xn is a sample from a normal
population with mean m and standard deviation s
Let
                 xm
            z
                 s
                     n
Then z has a standard normal distibution
Comment:

 Many statistics T have a normal distribution
 with mean mT and standard deviation sT.
 Then
             T  mT
          z
               sT
 will have a standard normal distribution.
  Distribution of the c2 statistic for
          sample variance
Let x1, x2, ...,xn is a sample from a normal
population with mean m and standard deviation s
Let
                  xi  x 2
      s2         i              = sample variance
                      n 1
and
          xi  x 
                             2


 s          i
                                 = sample standard deviation
                  n 1
Let
             x        x
                          2
                   i
                                        (n  1)s   2
      c 
       2    i
                                      
                              s
                                  2
                                           s
                                             2




Then c2 has chi-squared distribution with n
= n-1 degrees of freedom.
0 .5               The chi-squared
                     distribution



  0
       0   4   8      12   16   20   24
     Distribution of the t statistic
Let x1, x2, ...,xn is a sample from a normal
population with mean m and standard deviation s
Let
             xm
          t
             s
               n

then t has student’s t distribution with n = n-1
degrees of freedom
Comment:
 If an estimator T has a normal distribution with
 mean mT and standard deviation sT.
 If sT is an estimatior of sT based on n degrees of
 freedom
 Then
               TmT
            t
                sT
 will have student’s t distribution with n degrees of
 freedom. .
 t distribution
standard normal distribution
           Point estimation
• A statistic T is called an estimator of the
  parameter q if its value is used as an
  estimate of the parameter q.
• The performance of an estimator T will be
  determined by how “close” the sampling
  distribution of T is to the parameter, q,
  being estimated.
• An estimator T is called an unbiased
  estimator of q if mT, the mean of the
  sampling distribution of T satisfies mT = q.
• This implies that in the long run the average
  value of T is q.
• An estimator T is called the Minimum
  Variance Unbiased estimator of q if T is an
  unbiased estimator and it has the smallest
  standard error sT amongst all unbiased
  estimators of q.
• If the sampling distribution of T is normal,
  the standard error of T is extremely
  important. It completely describes the
  variability of the estimator T.
          Interval Estimation
          (confidence intervals)
• Point estimators give only single values as
  an estimate. There is no indication of the
  accuracy of the estimate.
• The accuracy can sometimes be measured
  and shown by displaying the standard error
  of the estimate.
• There is however a better way.
• Using the idea of confidence interval
  estimates
• The unknown parameter is estimated with a
  range of values that have a given probability
  of capturing the parameter being estimated.
   Confidence Intervals
• The interval TL to TU is called a (1 - a) 
  100 % confidence interval for the parameter
  q, if the probability that q lies in the range
  TL to TU is equal to 1 - a.
• Here , TL to TU , are
  – statistics
  – random numerical quantities calculated from
    the data.
                      Examples
Confidence interval for the mean of a Normal population
(based on the z statistic).


                      s                      s
   TL  x  z a / 2      to TU  x  z a / 2
                       n                      n

is a (1 - a)  100 % confidence interval for m, the mean of a
normal population.
Here za/2 is the upper a/2  100 % percentage point of the
standard normal distribution.
More generally if T is an unbiased estimator of the parameter
q and has a normal sampling distribution with known
standard error sT then




     TL  T  z a / 2 sT to TU  T  za / 2s T

is a (1 - a)  100 % confidence interval for q.
Confidence interval for the mean of a Normal
population
(based on the t statistic).
                       s                      s
    TL  x  t a / 2      to TU  x  t a / 2
                        n                      n
is a (1 - a)  100 % confidence interval for m, the
mean of a normal population.
Here ta/2 is the upper a/2  100 % percentage point
of the Student’s t distribution with n = n-1 degrees of
freedom.
More generally if T is an unbiased estimator of the parameter
q and has a normal sampling distribution with estmated
standard error sT, based on n degrees of freedom, then



       TL  T  t a / 2s T to TU  T  t a / 2s T


is a (1 - a)  100 % confidence interval for q.
               Common Confidence intervals

Situation                                             Confidence interval
Sample form the Normal distribution with unknown
                                                                   s0
mean and known variance                               x  za / 2
(Estimating m) (n large)                                             n
Sample form the Normal distribution with unknown
                                                                    s
mean and unknown variance (Estimating m)(n small)     x  ta / 2
                                                                     n
Estimation of a binomial probability p
                                                                      p (1  p )
                                                                      ˆ      ˆ
                                                      p  za / 2
                                                      ˆ
                                                                          n
Two independent samples from the Normal                                          2
                                                                             2
distribution with unknown means and known                                   sx s y
variances                                             x  y  za / 2           
(Estimating m1 - m2) (n,m large)                                            n m
Two independent samples from the Normal
                                                                                1 1
distribution with unknown means and unknown but       x  y  ta / 2 s Pooled    
equal variances. (Estimating m1 - m2) ) (n,m small)                             n m
Estimation of a the difference between two binomial
                                                                            p1 (1  p1 ) p2 (1  p2 )
                                                                            ˆ       ˆ     ˆ      ˆ
probabilities, p1-p2                                  p1  p2  za / 2
                                                      ˆ    ˆ                            
                                                                                 n1          n2
Multiple Confidence intervals

In many situations one is interested in estimating not
only a single parameter, q, but a collection of
parameters, q1, q2, q3, ... .

A collection of intervals, TL1 to TU1, TL2 to TU2, TL3
to TU3, ... are called a set of (1 - a)  100 % multiple
confidence intervals if the probability that all the
intervals capture their respective parameters is 1 - a
         Hypothesis Testing
• Another important area of statistical
  inference is that of Hypothesis Testing.
• In this situation one has a statement
  (Hypothesis) about the parameter(s) of the
  distributions being sampled and one is
  interested in deciding whether the statement
  is true or false.
• In fact there are two hypotheses
  – The Null Hypothesis (H0) and
  – the Alternative Hypothesis (HA).
• A decision will be made either to
  – Accept H0 (Reject HA) or to
  – Reject H0 (Accept HA). The following table
    gives the different possibilities for the decision
    and the different possibilities for the correctness
    of the decision
• The following table gives the different
  possibilities for the decision and the
  different possibilities for the correctness of
  the decision

                     Accept H0   Reject H0

             H0      Correct      Type I
           is true   Decision      error
             H0      Type II     Correct
          is false    error      Decision
• Type I error - The Null Hypothesis H0 is
  rejected when it is true.
• The probability that a decision procedure
  makes a type I error is denoted by a, and is
  sometimes called the significance level of
  the test.
• Common significance levels that are used
  are a = .05 and a = .01
• Type II error - The Null Hypothesis H0 is
  accepted when it is false.
• The probability that a decision procedure
  makes a type II error is denoted by b.
• The probability 1 - b is called the Power of
  the test and is the probability that the
  decision procedure correctly rejects a false
  Null Hypothesis.
A statistical test is defined by
• 1. Choosing a statistic for making the
  decision to Accept or Reject H0. This
  statisitic is called the test statistic.
• 2. Dividing the set of possible values of
  the test statistic into two regions - an
  Acceptance and Critical Region.
• If upon collection of the data and evaluation
  of the test statistic, its value lies in the
  Acceptance Region, a decision is made to
  accept the Null Hypothesis H0.
• If upon collection of the data and evaluation
  of the test statistic, its value lies in the
  Critical Region, a decision is made to reject
  the Null Hypothesis H0.
• The probability of a type I error, a, is
  usually set at a predefined level by choosing
  the critical thresholds (boundaries between
  the Acceptance and Critical Regions)
  appropriately.
• The probability of a type II error, b, is
  decreased (and the power of the test, 1 - b,
  is increased) by
1. Choosing the “best” test statistic.
2. Selecting the most efficient experimental
  design.
3. Increasing the amount of information
  (usually by increasing the sample sizes
  involved) that the decision is based.
              Some common Tests
Situation                       Test Statistic              H0         HA        Critical Region
                                        n x  m0          m  m     m  m
Sample form the Normal
distribution with unknown
                                                                                 z < -za/2 or z > za/2
                                z
mean and known variance
                                            s                          m  m    z > za
(Testing m) (n large)
                                                                       m  m    z <-za
                                       n x  m 0          m  m     m  m
Sample form the Normal
distribution with unknown
                                                                                 t < -ta/2 or t > ta/2
                                t
mean and unknown variance
(Testing m) (n small)                      s                           m  m    t > ta
                                                                       m  m    t < -ta
Testing of a binomial
                                         p  p0
                                         ˆ                  p  p     p  p    z < -za/2 or z > za/2
probability p                   z
                                        p0 (1  p0 )                   p  p    z > za
                                             n                         p  p    z < -za
Two independent samples
from the Normal distribution    z
                                       x  y              m1  m 2 m1  m 2    z < -za/2 or z > za/2
with unknown means and                   2   2
known variances                         sx s y
(Testing m1 - m2)
                                                                      m 1  m 2 z > za
(n, m largel)
                                        n m
                                                                       m1  m 2 z < -za

Two independent samples
from the Normal distribution    t
                                          x  y           m1  m 2 m1  m 2 t < -ta/2 or t > ta/2
with unknown means and                             1 1
unknown but equal                    s Pooled                         m 1  m 2 t > ta
variances. (Testing m1 - m2)                       n m

                                                                       m1  m 2 t < -ta

Estimation of a the                             p1  p 2
                                                ˆ    ˆ      p1  p 2   p1  p2 z < -za/2 or z > za/2
difference between two          z
binomial probabilities, p1-p2                   1   1 
                                       p (1  p)  
                                       ˆ      ˆ 
                                                 n1 n2 
                                                                      p1  p 2 z > za

                                                                       p1  p2   z < -za
The p-value approach to
  Hypothesis Testing
 In hypothesis testing we need

       1. A test statistic
       2. A Critical and Acceptance region
          for the test statistic

The Critical Region is set up under the
sampling distribution of the test statistic.
Area = a (0.05 or 0.01) above the critical
region. The critical region may be one tailed or
two tailed
                   The Critical region:



                                              a/2
           a/2




                   za / 2      0
                                     za / 2    z

 Reject H0                                    Reject H0
                             Accept H0
PAccept H 0 when true   P za / 2  z  za / 2   1  a
PReject H 0 when true   Pz   za / 2 or z  za / 2   a
In test is carried out by
      1. Computing the value of the test
         statistic
      2. Making the decision
         a. Reject if the value is in the Critical
            region and
         b. Accept if the value is in the
            Acceptance region.
The value of the test statistic may be in the
Acceptance region but close to being in the
Critical region, or
The it may be in the Critical region but close to
being in the Acceptance region.

To measure this we compute the p-value.
 Definition – Once the test statistic has been
 computed form the data the p-value is defined
 to be:

p-value = P[the test statistic is as or more
            extreme than the observed value of
            the test statistic]

more extreme means giving stronger evidence to
           rejecting H0
Example – Suppose we are using the z –test for the
mean m of a normal population and a = 0.05.
Z0.025 = 1.960
Thus the critical region is to reject H0 if
     Z < -1.960 or Z > 1.960 .
Suppose the z = 2.3, then we reject H0

p-value = P[the test statistic is as or more extreme than
       the observed value of the test statistic]
        = P [ z > 2.3] + P[z < -2.3]
       = 0.0107 + 0.0107 = 0.0214
            Graph


p - value




  -2.3              2.3
If the value of z = 1.2, then we accept H0

p-value = P[the test statistic is as or more extreme than
       the observed value of the test statistic]
        = P [ z > 1.2] + P[z < -1.2]
       = 0.1151 + 0.1151 = 0.2302
23.02% chance that the test statistic is as or more
extreme than 1.2. Fairly high, hence 1.2 is not very
extreme
             Graph


p - value




      -1.2       1.2
          Properties of the p -value
1. If the p-value is small (<0.05 or 0.01) H0 should be
   rejected.
2. The p-value measures the plausibility of H0.
3. If the test is two tailed the p-value should be two
   tailed.
4. If the test is one tailed the p-value should be one
   tailed.
5. It is customary to report p-values when reporting
   the results. This gives the reader some idea of the
   strength of the evidence for rejecting H0
           Multiple testing
Quite often one is interested in performing
collection (family) of tests of hypotheses.
1. H0,1 versus HA,1.
2. H0,2 versus HA,2.
3. H0,3 versus HA,3.
etc.
• Let a* denote the probability that at least one type
  I error is made in the collection of tests that are
  performed.
• The value of a*, the family type I error rate, can
  be considerably larger than a, the type I error rate
  of each individual test.
• The value of the family error rate, a*, can be
  controlled by altering the thresholds of each
  individual test appropriately.
• A testing procedure of this nature is called a
  Multiple testing procedure.
   A chart illustrating Statistical Procedures

                                              Independent variables

Dependent             Categorical                     Continuous               Continuous &
                                                                                Categorical
Variables
                Multiway frequency Analysis       Discriminant Analysis   Discriminant Analysis
Categorical         (Log Linear Model)


               ANOVA (single dep var)            MULTIPLE                 ANACOVA
Continuous     MANOVA (Mult dep var)             REGRESSION
                                                 (single dep variable)
                                                                          (single dep var)
                                                 MULTIVARIATE             MANACOVA
                                                 MULTIPLE                 (Mult dep var)
                                                 REGRESSION
                                                 (multiple dependent
                                                 variable)

Continuous &               ??                                ??                       ??
 Categorical
Next topic: Fitting equations to
              data
              Link

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:5
posted:4/14/2011
language:English
pages:114