The Chi-square test - PowerPoint Presentation

Document Sample

```					The Chi-square
test - 2

Peter Shaw
This test is noteworthy..
   Because it works on nominal data.
   It requires COUNTS OF
observations – how many fields
were ploughed? How many plants
were white and how many pink?
   It does not work on ordinal or
continuous data. If the data have
units (Kg, cm, moles etc) you
cannot use 2
   I would genuinely advise that you
do this one by hand not on a PC.
The calculations are trivial, and I
don’t trust a PC to run the correct
model for me!
hitting a dartboard AT
RANDOM:
Where do they go?

1                     35                       75
24                    30                       20
75                                              5
35
All of these patterns could occur by chance but they
are not equally likely.

Plausible        Are you sure?          Very unlikely
To go beyond gut
reactions, you need to
calculate how many
events (dart hits etc)
you would EXPECT in
each category.
2% area     In this case we assume
that expected number
of hits = % area * total
24% area
number of darts.
74% area
Real Life
     At Drax Power Station I set up
seeding trials on 6 mounds of
industrial waste (back in 1991).
Since 1995 orchids flowered in
these plots - but only on the bases
of each mound, never on the top.
     Could this be due to chance?
     Yes! The question is how likely
this is to be chance.
     This is a chi-squared problem.
No orchids
Orchids in many places
How to do it?
   1: set up H1, H0:
 H1: The distribution is non-random
 H0: The distribution is random

   2: Define significance (p=0.05)
   3: For each category calculate how
many events you would expect
under H0. Call this E (for
expected). Call the observed
number of events O.
   4: Calculate (O-E)2/E
Now find 2
   2 = Σ(O-E)2/E

   ie add up all the values of (O-E)2/E
   You need to find the df. This is N-
1, where N = number of categories
dartboard)
   Compare your value of 2 with
tabulated: large values are significant.
My orchid data:
   Drax has 72 experimental plots, of which 12
support orchids.
   36 plots are on the mound top, 36 at mound
base.
   Expected values:
   12 orchid plots out of 72 should be 50:50 mound
base: mound top, IF their distribution were
random.
   Hence expected values are 6 orchid plots on
mound tops, 6 at mound base.
   Observed values are in fact 0 and 12
respectively.
2 = ?
   2 = (0-6)2/6 + (12-6)2/6
   = 6 + 6 = 12, with 1df.
   The critical value for 2 with 1 df at
p = 0.05 is 3.84

   Calculated value > tabulated value,
so result is significant. I reject H0
and accept that the distribution of
plants appears to be non-random.
Plagiarism in psychology UGs??
At a programme board in 2005 a paper was tabled giving resukts
of a trial in which every single essay submitted to an anthropology
module was carefully checked for any form of plagiarism. A
series of quiet gasps went around as people saw the figures;

plagiarised

Biosciences 19             5/27

Psychology 39              12/31
This is a chi-squared question
(O-
O total      E      O-E     E)2/E

5.00 27.00   7.91    -2.91     1.07

12.00 31.00   9.09     2.91     0.93

sum     17.00 58.00                     2.01
chi-sq = 2, df = 1,   In other words this pattern
p>0.05 (in fact       is non significant; it is
p>0.1)                easily within the expected
range of random noise.
Now try the other half of the results; the number of students
whose referencing was missing or poor.

% v poor     Raw data
refs

Biosciences 30            8/27

Psychology 3              1/31
Until 1999 a doctor callled Harold Shipman worked as a GP in north
Manchester. His mortality rates seemed a little high compared with a
neighboring GP practice.
The practice is that when deaths occur at home under GP supervision
another medical practice often signs the death certificate.

Shipman’s practice           Next GP practice along

People
served 3100                         9800

Death certs
signed / year
47                          14
One little catch:
   This concerns the “Expected” values.
Remember that 2 involves the term (O-
E)2/E. If you had a tiny value of E, 2 would
be huge. (As E -> 0, 2 -> infinity).
   This is not a problem most of the time.
   However, it is a problem if E gets too small,
and there is a rule of thumb to guide you
here:

   If E <5, the 2 value may be unreliable.
Let’s take the dartboard
example
   We have 3 zones, comprising 2,
24 and 74% of the area.
   Throw 100 darts randomly at
these we expect them to contain
2%       2, 24 and 74 darts respectively.
   Our E values are hence 2, 24
24%           and 74.

74% area
3 dartboards – let’s use the
Chi-sq to assess their
likelihood.                                                 75
35
1                                                      20
30                   5
24                                  35
75                                          O       75 20 5
O   35 30 35
O   1     24 75
E       2    24 74
E   2   24 74
E   2     24 74

O       95 5
O   65 35
O       25 75 2 = Σ (O-E)2/E
E       26 74
= (25-26)2/26         E   26 74
+(75-74)2/74
E       26 74
=1/26 + 1/74
2 = 0.052 2 = 79.05            2 = 247.5
   In Sheffield we surveyed
tombstones with lichen cover.
There were 2 types of stone:
millstone and marble.

   One day we found 80 marble and
120 millstone tombstones. 80 of
conizeiodes but none of the marble
ones did. Is this significant?
2 way chi-square
This is very common, but needs a little thought. Here we have a
distribution of counts in 2 crossed categories (eg M/F * did/did not gain a
score, habitat type * present/absent). It is possible to test H0: random
distribution. If H0 is rejected you may conclude that the distribution is
not random, but you can’t go on to identify which observations / classes /
treatments are responsible for this effect.

M               F
Habitat 1              25             5
Habitat 2              10             15
Habitat 3              5              35
What are the constraints on this? That you sampled a certain total
number of individuals, and a certain total fell into each gender and a
certain total into each habitat. Given these totals we can predict the
expected number of observations under a random distribution
Obs:                  M               F      Sum
Habitat 1              25              5      30
Habitat 2              10             15      25
Habitat 3              5              35      40
Sum                    40             55      95

Expected              M               F              Sum
Habitat 1              30*40/95       30*55/95        30
Habitat 2              25*40/95       25*55/95        25
Habitat 3              40*40/95       40*55/95        40
Sum                    40             55      95
Expected
habitat 1     12.63158      17.36842       30
Habitat 2     10.52632      14.47368       25
Habitat 3     16.84211      23.15789       40
Sum           40            55             95

chi-squared
habitat 1              12.11074561 8.807814992
Habitat 2              0.026315789 0.019138756
Habitat 3              8.326480263 6.05562201
Sum                    20.46354167 14.88257576
Yates’ Correction (the continuity
correction)
This correction can probably be ignored under most circumstances – in
fact I would never use it, preferring instead a home-grown Monte-
Carlo approach (next slide..), but this correction could matter if you are
using chi-square to look for associations between events and have
small sample size.                         If you crank through the
chi-square calculation on
Sp1
this association matrix you
Present Absent sum
Sp2                                   find that chi-square =
Present a            b a+b
Absent c             d c+d
(a+b)(c+d)(a+c)(b+d)
O      Sp1                  If you crank through the
Present Absent sum   chi-square calculation on
Sp2                         this association matrix you
Present 0      20      20   find that chi-square =
Absent 40      60     100
(a+b)(c+d)(a+c)(b+d)

E       Sp1                   (0*60-20*40)2*120
Present Absent sum    20*100*40*80
Sp2
Present 6.7    13.3 20       = 12 QED
Absent 33.3    66.6 100
sum     40     80   120

Sum(O-E)2/E = 12 QED
Yates’ correction, contd!
Yates’ correction is a fudge applied to this
calculation in cases where E values are <5 or N <
100. It goes as follows:

(a+b)(c+d)(a+c)(b+d)

Essentially this reduces the calculated chi-square value by
reducing the (ad-bc)2 term on the top line, making the test
more conservative (=more likely to accept H0). But I
wouldn’t do it that way…
More on small samples
Each time we have a PhD student who studies primates in
analyses on small datasets where the E values are
incorrigibly low. Since some M. Res. Primatology students

The good news: the E>4 rule is over cautious, and in my
experience you can get away with E values as low as 2 and
still get accurate confidence levels.
How do I know? Because there is a back-door solution, not
available in the books or major packages, which I use that
allows me blithely to ignore E<5 and Yates’ corection. It is
a Monte-Carlo empirical determination of significance.
Monte-Carlo Chi-square
Remember the dartboard? A Monte-Carlo determination involves first
calculating your actual 2 and ‘writing this down’ (in a PC memory).
Let’s say that your data involves 20 observations. Then I get a PC to
randomly ‘throw 20 darts’ at a dartboard of the same construction as your
E values, and calculate a 2 value. This is a random 2 value. This is
stored, and a second set of random darts is thrown, and a second 2
value calculated. This is repeated >200 times, to give you an empirical
insight into what would be expected from random positioning of your 20
observations. Al the PC then has to do is compare your real 2 value
with these random ones to derive a safe, dependable significance level.

The catch is that you have to use a bit of old DOS code I wrote to do this,
hence the reason why these students keep knocking on my door..
Ymke’s warthogs..
Number of respondents Number of respondent
to questionnaire     reporting problems
with Wart hog
39                     0
13                     4
10                     0
21                     5
7                     0
31                     3
Sum:              121                   12
Oh Dear!! Not an E value >4
Number of respondents Number of respondent    expected wrthg
to questionnaire     reporting problems
with Wart hog
39                    0       3.867768595
13                    4       1.289256198
10                    0       0.991735537
21                    5       2.082644628
7                    0       0.694214876
31                    3       3.074380165
121                  12                 12

So I loaded O and E values into my MC-chi-sq programme…
And the MC-output said:
2 = 15.34 (correct!)
It calculated 1000 random 2 values based on
“throwing 12 darts” at a board divided into 6 zones
whose relative sizes were 3.87, 1.29,… 3.07 and
found that 95% of them were <11.72 and 99% of
them were < 16.11.

So what was the significance of 2 = 15.34?
The standard tables, crudely ignoring E<5
problems, give this a significance level of
0.01>p>0.005.
Chi-square and model fitting
This is a whole lecture in itself! There is one simple, neat, easy-to-
understand way to use chi-square to see whether two variables are
associated. All that matters is that you can plot a graph of the variables:
consider the scattergraph of the relationship between two arbitrary
variables:
How many points in each
Plant mass                           quadrant (sum = 40)?
g                                    What is H0, and why?

median

H0: 10 in each

median Fertiliser, g
2 = 30.9, 1df            76
***                       obs

35 obs

77
obs
35 obs

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 4 posted: 2/20/2013 language: simple pages: 31
How are you planning on using Docstoc?