# Distribution theory and statistical inference

Document Sample

```					            Today‟s lecture

• Inferential methods - review
– Bayesian – frequentist
– Parametric, non-parametric, semi-parametric
• A more modern approach to non-parametric
procedures
– Randomisation tests
– Bootstraps
• Next week
– Revision
1
Web site
• Material for last two weeks now on

www.maths.napier.ac.uk/~gillianr

This includes materials for today‟s practical
workshop on bootstraps

2
Parametric and non-parametric
methods
• In both methods we are assuming that the
data we are observing follow some model
• For parametric methods this is a model
based on known probability distributions
• What we are saying is
“IF the model is true – then we can
conclude …. about the model and its
parameters”
3
Parametric and non-parametric
methods
• Non-parametric tests also make assumptions
• They imply that the DATA observed are a random
sample from some unspecified distribution
• What we are saying is
“IF we have observed these data – then we can
conclude …. about the distribution(s) they
come from”

4
Modern non-parametric methods
• Parametrics condition (IF) on the model
• Non-parametrics condition(IF) on the data

• Traditional non-parametric tests used ranks
– This was for practical reasons in pre-computer days
• “Modern” non-parametric methods can use the
data
– Randomisation tests for hypotheses (not so modern as
they go back to RA Fisher
– Bootstrap methods for confidence intervals
5
Randomisation test for a
difference in means
• Tests the null hypothesis that the two samples
come from a common distribution
• So in some ways this is more than a difference in
means or even medians
• It is the same null hypothesis tested by traditional
rank tests (eg Wilcoxon Mann-Whitney test)
• Rank testsare not just a test of medians
Mann-Whitney test is not just a test of medians:
differences in spread can be important
Anna Hart
BMJ 2001; 323: 391-393 (get it from BMJ.com)
6
Obs id sex   weight
1  1 F        84.0
2  2 F        98.0   Sample data set
3  3 F       102.5
4  4 F        84.5   Data on weights (in pounds)
5  5 F       112.5   of 19 young people
6  6 F        50.5
9 female
7  7 F        90.0
10 male
8  8 F        77.0
9  9 F       112.0   Are males or females
10 10 M       112.5   heavier?
11 11 M       102.5   Are the weights of males
12 12 M        83.0   more variable than those of
13 13 M        84.0
females?
14 14 M        99.5
15 15 M       150.0
16 16 M       128.0
17 17 M       133.0
18 18 M        85.0
19 19 M       112.0                          7
T test output from SAS
Variable            sex            N    Mean      Std Dev
weight               F             9   90.111      19.384
weight               M            10   108.95      22.727
weight           Diff (1-2)            -18.84       21.22

T-Tests
Variable    Method               Variances   DF   t Value      Pr > |t|
weight      Pooled               Equal       17      -1.93     0.0702
weight      Satterthwaite        Unequal     17      -1.95     0.0680

Equality of Variances
Variable      Method        Num DF       Den DF   F Value     Pr > F
weight        Folded F           9            8      1.37     0.6645

P value for differences in means is
0.0702 (pooled sd) or
0.0680 (common sd)                                                              8
Permutation/randomisation test
• Here the difference between the means b
(girls - boys) was -18.84 pounds
• Is this more than we would expect by
chance if no difference between M and F?
–   Consider all 19 people
–   select 9 of them at random to be female
–   get weight difference for „females‟ - „males‟
–   this is the randomisation/permutation
distribution under H0

9
Programming the randomisation
test
• This can be done easily
• Details of a SAS macro to do this are on the next
page
• An EXCEL macro to do this is also available. On
th class web page
– hhttp://www.bioss.ac.uk/smart/frames.html
– It incorporates corrections from one of last year‟s
Napier honours students

10
SAS program to do this
• On my web site
• www.maths.napier.ac.uk/~gillianr
–   macro - randmac.sas (submit this first)
–   program rand.sas
–   this reads in data and runs macro
–   you can alter it for your own data
• Go to SAS now if time (this is V8.1)

11
Randomisation distribution of difference in means – actual difference was –18.84
Proportion of the distribution further away from zero than this is 0.0673

E U N Y
FR Q E C
800

700

600

500

400

300

200

100

0

-   -   -   -   -   -   -   -   -   -   -   -   -   -    -   0   2   4   6   8   1   1   1   1   1   2   2   2   2   2   3   3   3
3   2   2   2   2   2   1   1   1   1   1   8   6   4    2                       0   2   4   6   8   0   2   4   6   8   0   2   4
0   8   6   4   2   0   8   6   4   2   0

di f f    I P I T
MD O N

This compares with 0.068 or o.07o2 for t-tests                                                                                                       12
E U N Y
FR Q E C
1900

1800

1700

1600

1500

1400

1300

1200

1100

1000

900

800

700

600

500

400

300

200

100

0

0   0   0   0   1   1   1   2   2   2   3   3   3   3   4    4   4   5   5   5   6   6   6   6   7   7   7   8   8   8   9   9   9   9   1   1
.   .   .   .   .   .   .   .   .   .   .   .   .   .   .    .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   0   0
0   3   6   9   2   5   8   1   4   7   0   3   6   9   2    5   8   1   4   7   0   3   6   9   2   5   8   1   4   7   0   3   6   9   .   .
2   5

I P I T
f r at i o M D O N

13
Conclusions
• For this problem, all methods give very
similar answers for both means and
variances
• This is usually true
• Exceptions are for odd distributions with
possible outliers
• For these a randomisation test is a good
choice
• To go with it, use a bootstrap C.I.
14
Comparing parametric and bootstrap confidence
intervals
Statistics
Lower CL             Upper CL
Variable        sex       N        Mean     Mean      Mean
weight           F      9     75.211     90.111    105.01
weight           M     10     92.692     108.95    125.21
weight    Diff (1-2)          -39.41     -18.84    1.7313

Bootstrap 95% confidence interval for
difference (1000 bootstraps)
(-37.87 to 0.16)
So again, very similar to parametric
15
Bootstraps
• Methods developed in the 1970s
– Rob Tibshirani
• Text book in library by these authors
– also describes randomisation tests

16
What is a bootstrap?
• The data consist of n items
• Take a sample of size n from the data WITH
replacement
– data 1 2 3 4
– possible bootstraps 1 1 2 3 , 1 1 1 1, 1 2 3 4,
• Take lots of bootstrap samples and calculate
the quantity of interest (e.g. the mean) from
each one.
– The variability between the quantities calculated
from the bootstraps is like the variation you
expect in your estimate.                         17
Example of 3 bootstrap samples
DATA           Number     Bootstrap samples
12            8         23           2    12
114                       2          12   114
16                      16          23    16
15                      15          15     2
23                     114          12     2
12                      16          15    16
2                     114          15    12
9                      16          16     9

Mean             25.375     39.5    13.75   22.875
Median             13.5       16       15       12
s.d.           36.30796 46.34652 5.849298 37.22302
18
Go to EXCEL demo
Found on web at
http://www.mste.uiuc.edu/

At University of Illonois
M2T2 project
Mathematics material for
tommorow‟s teachers
Copied on my web page
19
Bootstrap macro for SAS
First submit macro file (randmac.sas)
Then run macro(example in file bootstrap.sas)
%boots(in=new,var=x,seed=233667,fun=mean,
nboots=100,out=result);
Explanation of each is in the sample program
bootstrap.sas
Go to SAS now

20
Other bootstrap macros
Does bootstrap CI for differences in means
Example in program rand.sas
%bootdiff(in=class,var=weight,group=sex,n
boots=1000,seed=45345,out=bootdiff);
Does bootstraps for correlation coefficients
Example in rand.sas
%bootscor(in=new2,var1=score1,var2=score2
,seed=5465,nboots=50     ,out=corr);

21
Pearson‟s correlation coefficient
• Calculated for sample data

r
 ( x  x )( y  y )
 (x  x) ( y  y)
2           2

• It has values between -1 and +1
• 0 represents no association
• We can think of our sample value of r
estimating a population quantity r
• So we can calculate a bootstrap CI for r
22
Bootstrap for correlation
Data consist of pairs of values
(x1, y1) (x2 ,y2) (x3, y3) (x4,y4) (x5,y5)
Bootstraps are samples of pairs – with
replacement eg
(x1,y1) (x1,y1) (x3,y3) (x5,y5) (x5,y5)
The correlations from bootstrap samples will
always be between –1 and +1.
Sample data next
23
Confidence interval for a correlation
coefficient
• Data on agreement in scoring between two
observers from a sample of 30 items scored
• Sample value is 0.966
• How well might this represent the true population
value ?
• Bootstrap confidence interval gives us the interval
and also the bootstrap distribution (see next slide)
• Next we will look at the classical approach to this

24
scor e1
80

70

60

50

40

30

30   40   50     60      70   80   90

scor e2

25
26
Distribution of r
• If we took many small samples, the
distribution of r would not be normal.
• Fisher showed that, for bivariate normal
distributions,
– z = 0.5 ln[(1+r)/(1-r)]
– is approximately normal with a standard error
of 1 / sqrt(n-3)
This can be used to get a confidence interval for r
– r = [exp(2z)-1]/[exp(2z)+1]
27
Example used for bootstraps
R = 0.966 n=30
z = 0.5 ln[ (1+0.966)/(1-0.966)]
= 2.0287
standard error of z = 1/(30-3) = 1/ 27
=0.192
95% C Int for z (2.0287 +/- 1.96 x 0.1925)
(0.93 - 0.98)

Bootstraps gave (0.94-0.99) - near enough
28
Conclusions
• Randomisation tests and bootstraps are a useful
alternative to parametric tests
• BUT they are more bother to do
• But should get easier in future
• In general they vindicate the robustness of
classical tests, even when their assumptions are
not true
• An exception may be data with outliers
• We will investigate this in the practical

29
Summary
• F tests for ratios of variances
• How F ratios are used in regression and
analysis of variance (sketch)
• Randomisation / permutation tests
• Bootstraps
• Correlation coefficient (introduction)

30

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 9 posted: 9/19/2011 language: English pages: 30
How are you planning on using Docstoc?