Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

Basic principles of probability theory

VIEWS: 6 PAGES: 20

									                        Design of experiment I
•   Motivations
•   Factorial (crossed) design
•   Interactions
•   Block design
•   Nested
                                         Examples
If we have two samples then under mild conditions we can use t-test to test if difference between
       means is significant. When there are more than two sample then using t-test might become
       unreliable.
An example suppose that we want to test effect of various exercises on weight loss. We want to
       test 5 different exercises. We recruit 20 men and assign for each exercises four of them.
       After few weeks we record weight loss. Let us denote i=1,2,3,4,5 as exercise number and
       j=1,2,3,4 person‟s number. Then Yij is weight loss for jth person on the ith exercise
       programme. It is one-way balanced design. One way because we have only one category
       (exercise programme). Balanced because we have exactly the same number of men on
       each exercise programme.
Another example: Now we want to subdivide each exercises into 4 subcategories. For each
       subcategory of the exercise we recruit four men. We measure weight loss after few weeks -
       Yijk. Where:
       i – exercise category
       j – exercise subcategory
       k – kth men.
Then Yijk is weight loss for kth men in the jth subcategory of ith category. Number off
       observations is 5x4x4 = 80. It is two-fold nested design.
We want to test: a) There is no significant differences between categories; b) there is no
       significant difference between different subcategories. It is two-fold nested ANOVA
                                            Examples
We have 5 categories of exercises and 4 categories of diets. We hire for each exercise and
        category 4 persons. There will be 5x4x4=80 men. It is two way crossed design. Two-way
        because we have categorised men in two ways: exercises and diets. This model is also
        balanced: we have exactly same number of men for each exercise-diet.
       i – exercise number
       j – diet number
       k – kth person
       Yijk – kth person in the ith exercise and jth diet.
In this case we can have two different types of hypothesis testing. Assume that mean for each
        exercise-diet combination is ij. If we assume that model is additive, i.e. effects of exercise
        and diet add up then we have: ij = i+j. i is the effect of ith exercise and j is the effect
        of diet. Then we want to test following hypotheses: a) ij does not depend on exercise and
        b) ij does not depend on diet.
Sometimes we do not want to assume additivity. Then we want to test one more hypothesis:
        model is additive. If model is not additive then there might be some problems of
        interpretations with other hypotheses. In this case it might be useful to use transformation
        to make the model additive.
Models used for ANOVA can be made more and more complicated. We can design three, four
        ways crossed models or nested models. We can combine nested and crossed models
        together. Number of possible ANOVA models is very large.
      Factors, levels, responses and randomisation

Factors: Variable that we want to study effect of is called factor. For example
   exercise is a factor, diet is another factor. In medical experiments where effect
   of some drugs are of interest then the drug is a factor.
Levels: Several values of a factor is usually selected and these values are called
   levels. For example different exercise types are levels of the exercise factor.
Response: Result of of experiment or observations
Randomisation: Individuals or subjects for experiments are chosen randomly. I.e.
   if we have n combination of factor levels and we need k replications of each
   combinations then we take from nk individuals randomly (it could be tricky in
   many situations) and each of them is assigned to the factor combination
   randomly with uniform distribution.
Orthogonality: If we take a level of one of the factors and sum over other factor
   levels if the effects of others sum to their average then this design is called
   orthogonal design. Orthogonal designs make analysis of the result simpler.
                              Design of Experiment
In general design of experiment should be done carefully. Usual considerations for design
    are:

•   First the purpose of experiment should be understood carefully
     – What is compared against what? Do you know anything about the system under study? Do
       you know known variations between different subjects of the system (it could be either
       factors you are comparing or subject of experiments). What would you get as a result of
       experiment.
•   Randomisation
     – Care should be exercised when carrying out experiments. If it is not done with care then there
       could be variation or similarity just because of the subjects of study. For example by studying
       people in one city you cannot make decision about the whole world.
•   Blocking
     – To remove variations due to known or suspected effects you may do experiments in blocks.
•   Replication
     – Since every experiment has its own error it is always good idea to replicate each experiment.
       If this is done with care then effects of random fluctuations due to experiment can be reduced
                             Factorial design
Let us assume that we have two factors and for each of them we set three levels. For
   example three exercises and three diets. Exercise is one factor and different
   exercises are levels of this factor, diet is the second factor and different diets are
   levels of this factor. The number of all combinations is 3x3=9. Two way crossed
   or three level, two factor factorial design.

                           Ex1               Ex2               Ex2

          Diet1            Observations      Observations      Observations

          Diet2            Observations      Observations      Observations

          Diet3            Observations      Observation       Observations
                        Factorial design: model
Test of hypothesis that different levels of factors have same effect can be done using linear
   model and anova. Let us consider model for this case
Yijk=X b+e
Where b=(m,b1,b2) and X is (for two factor two level experiments with two replications (if
   we use the formula ~f1+f2 in R where f1 and f2 are two factors)
           1 0 0
           1 0 0
           1 1 0
           1 1 0
           1 0 1
           1 0 1
           1 1 1
           1 1 1
It means for first two observations we have y = m, for third and fourth m+b1, for
    fifth and sixth m+b2 and for last two m+b1+ b2. If we use lm command of R
    then the result will be difference between mean values of responses for factor 1
    level 1 and , differences between effect of level 1 of factor 2 and differences
    between levels two levels of factor 1 and finally differences between effect of
    level 2 of factor 2 and differences between levels two levels of factor 1
                         Factorial design: model

If we want to estimate mean values of effects of each level directly then we can use
    the formula ~f1+f2-1. The resultant matrix will look like:
          1 0 0
          1 0 0
          0 1 0
          0 1 0
          1 0 1
          1 0 1
          0 1 1
          0 1 1
Which means that effect of each factor level will be estimated directly
                    Factorial design: interactions
If we want analyse interactions between two factors then we can fit the model:
yijk=+x1i1+x2j2+x1ix2i 12+ijk
The matrix used in lm with formula ~f1+f2+f1*f2 will look like (when there are two
    levels):
     1 0 0       0
     1 0 0       0
     1 1 0       0
     1 1 0       0
     1 0 1       0
     1 0 1       0
     1 1 1       1
     1 1 1       1
Note that as expected the final column of the matrix is product of the second and third
    columns.
                              Factorial design
Usually factorial design is used for two level experiments. I.e. for each factor
  two levels are selected and then for all possible combination of the levels.
  For each combination experiment also is replicated (let us say nr times).
  Then for n factor two level factorial design we need nr 2n experiments in
  total. As soon as the number of factors become more than three or four the
  number of experiments needed for this type of design becomes extremely
  large.
                        Fractional design
If we increase the number of factors then the number of experiments for full
   factorial experiment increases dramatically. For example if we have 4
   factors then for two levels for each factor we need 16=24 and if the number
   of factors is 8 then we need 256=28. We need to replicate each experiment
   and we need test runs. If we consider all sides then the number of
   experiments can become very large and resources may not be sufficient to
   perform all experiments. To avoid explosion of the number of experiments
   fraction of factorial design is used.
An example of fractional designs: Assume that we have three factors and for
   each of them we use two levels (denote levels -1 and 1). Then the number
   of experiments is 8. These are: (-1,-1,-1), (-1,-1,1), (-1,1,-1), (-1,1,1), (1,-
   1,-1), (1,-1,1), (1,1,-1), (1,1,1). If we take (-1,-1,-1), (-1,1,1), (1,1,-1), (1,-
   1,1) then for each level of each factor we will have exactly same number of
   experiments. If we use the same number of replication for each
   combination of factors then we will have orthogonal design also.
                                  Block design
In many cases before designing the experiment we know that there are some parameters that
   affect the result of experiment in the same way. For example if we are interested in effect
   of some substance to tree growth then we know that such factors as time of year (e.g.
   summer and winter), location of where tree grows (e.g. England and France) will have
   some systematic effect. If we would plant trees randomly over time and location then
   variation due to time and location would mask out the effect of substances we want to
   study. In the previous lectures in case of tests of two means we mentioned that paired
   design (see shoes example) would increase signal substantially. Paired design is a special
   case of more general design of experiment technique - block design.
When we want block design then block becomes one or more additional factors. For example
   if we want to remove variation due to location we can repeat exactly same experiment on
   different location.
Now we we have one factor with 4 levels and one parameter for blocking with 3 levels then
   for randomised block design we would have 4x3 matrix for the experiment.
   Randomisation takes place within blocks
General rule: block whatever you can, randomise the rest.
Blocking should be done with care. If effects of block factors is not as expected then the
   results would be less reliable.
                           Latin square designs
Latin square are kxk squares where on each column and each row numbers from 1
   to k appear only once. For example 3x3 Latin square could look like.

                   1       3        2
                   2       1        3
                   3       2        1
Column, rows are levels of factors one and two and values cells are levels of the
   third factor.
There are higher level of extension of latin squares: for three dimension graeco-latin
   and for four dimension hyper-graeco-latin squares
                                 Interactions

If we are testing two or more factors then it is important to consider interaction
    terms first. So the first question we should ask if there is an interaction
    between factors. As it was mentioned above the model for two factor cases
    will be:
yijk=+x1i1+x2j2+x1ix2i 12+ijk
If we reject hypothesis that “there is no interaction” then interpretation of the
    effects of factors (main effects) could be unreliable. If it is possible
    interactions should be removed before continuing with analysis. One way
    of removing interactions is transforming data. For example if effects are
    multiplicative then log transformation could make them additive.
                     Box and Cox transformation
One way of designing transformation is using variance stabilising transformation of
   observations. Transformation should be applied with care. If transformation is found
   then it might be better to use generalised linear model with the corresponding link
   function.
Box and Cox transformation has a parametric form:
         (y-1)/l if   0
y=
          log(y) otherwise


  Box and Cox have designed a likelihood for
  this transformation. This likelihood function is
  maximised with respect to . Typical plot for
  boxcox transformations look like as in the
  picture. It means that 1/y could be good
  variance stabilising transformation.
                     Nested (hierarchical) design
Sometime it happens that levels of one factor
has nothing to do with that of another factor.                           Top
For example if we want to test performance
of schools then we randomly select several
schools and from each school we choose
several classes and consider results in maths.
                                                 Factor levels
Each class in each school was taught by
some teacher. It is unlikely that the same
teacher taught in several schools. So it would
not be reasonable to consider effect of school                   Sublevels of factor levels
and classes as additive (class from one
school has nothing to do with that from
another school). This type of experiments are
called nested designs

In factorial designs (crossed designs) we analyse interactions between different factors first
and try to remove them. In nested design we want to know interactions. The model (two
fold nested design) is:
 yijk=+x1i1+x1ix2i 12+ijk
There is no interest on 2. They have no meaning.
                              R commands for ANOVA
There are basically two type of commands in R. First is to fit general linear model and second is analyse
    results.
Command to fit linear model is lm and is used
lm(data~formula)
Formula defines design matrix. See help for formula. For example for PlantGrowth data (available in R) we
    can use
data(PlantGrowth)       - load data into R from standard package
lmPlant = lm(PlantGrowth$weight~PlantGrowth$group)

Then linear model will be fitted into data and result will be stored in lmPlant
Now we can analyse them
anova(lmPlant) will give ANOVA table.
If there are more than one factor (category) then for two-way crossed we can use
lm(data~f1*f2) - It will fit complete model with interactions
lm(data~f1+f2) - It will fit only additive model
lm(data~f1+f1:f2) - It will fit f1 and interaction between f1 and f2. It is used for nested models.
Other useful commands for linear model and analysis are
summary(lmPlant) – give summary after fitting
plot(lmPlant)       - plot several useful plots
                    R commands for ANOVA

Another useful command for ANOVA is
confint(lmPlant)

This command gives confidence intervals for some of the coefficients and
   therefore differences between effects of different factors.
To find confidence intervals between any two given effects one can use
   bootstrap.
            ANOVA with generalised linear model
If distribution of a data set is from one of the members of exponential family then the
    command is

lm(data~formula,family=family)


Where family is one of the distributions from exponential family. These are
poisson, binomial, Gamma etc
And all other analysis are done using similar command like anova, summary etc.
Note that if you use glm then anova will give analysis of deviances instead of
   analysis of variances and will not print probabilities by default. You need to
   specify it. For example if you have used Poisson or binomial distributions then
   you can use
anova(glmresult,test=„Chisq)
                           References

1.   Stuart, A., Ord, KJ, Arnold, S (1999) Kendall‟s advanced theory of
     statistics, Volume 2A
2.   Box, GEP, Hunter, WG and Hunter, JS (1978) Statistics for
     experimenters
3.   Berthold, MJ and Hand, DJ. Intelligent Data Analysis
4.   Cox, GM and Cochran WG. Experimental design
5.   Dalgaard, P. Introductory Statistics with R

								
To top