Statistics for AS biology Biology All

Document Sample

```					                                           AS statistics page 1

Statistics for AS Biology
Descriptive Statistics
Repeated measurements in biology are
mean
rarely identical, due to random errors

number of times each value
and natural variation. If enough                                          95% CI            95% CI
measurements are repeated they can be
plotted on a histogram, like the one on
the right. This usually shows a normal                                                               normal

occurs
distribution
distribution, with most of the repeats                                                               curve
close to some central value. Many
pattern: eg. peoples' heights, number of
peas in a pod, the breathing rate of
insects, etc.                                                                      values

The central value of the normal
small confidence limit,                      large confidence limit,
low variability,                             high variability,    distribution curve is the mean (also
data close together,                         data scattered,      known as the arithmetic mean or
mean is reliable                             mean is unreliable
average). But how reliable is this
mean? If the data are all close
together, then the mean is probably
good, but if they are scattered
widely, then the calculated mean
may not be very reliable. The width of the normal distribution curve is given by the standard
deviation (SD), and the larger the SD, the less reliable the data. For comparing different sets of data,
a better measure is the 95% confidence interval (CI). This is derived from the SD, and is the range
above and below the mean within which 95% of the repeated measurements lie (marked on the
histogram above). You can be pretty confident that the real mean lies somewhere in this range.
Whenever you calculate a mean you should also calculate a confidence limit to indicate the quality

In Excel the mean is calculated using the formula =AVERAGE (range) , the SD is calculated using
=STDEV (range) , and the 95% CI is calculated using =CONFIDENCE (0.05, STDEV(range), COUNT(range)) .

This spreadsheet shows two sets of
data with the same mean. In group A
the confidence interval is small
compared to the mean, so the data are
reliable and you can be confident that
the real mean is close to your
calculated mean. But in group B the
confidence interval is large compared
to the mean, so the data are unreliable,
as the real mean could be quite far
away from your calculated mean. Note
that Excel will always return the
results of a calculation to about 8
decimal places of precision. This is meaningless, and cells with calculated results should always be
formatted to a more sensible precision (Format menu > Cells > Number tab > Number).

HGS Biology                                                                                           NCM/11/01
AS statistics page 2

Plotting Data
Once you have collected data you will want to plot a graph or chart to show trends or relationships
clearly. With a little effort, Excel produces very nice charts. First enter the data you want to plot into
two columns (or rows) and select them.

Drawing the Graph. Click on the chart wizard         . This has four steps:
1. Graph Type. For a bar graph choose Column and for a scatter graph (also known as a line graph)
choose XY(Scatter) then press Next. Do not choose Line.
2. Source Data. If the sample graph looks OK, just hit Next. If it looks wrong you can correct it by
clicking on the Series tab, then the red arrow in the X Values box, then highlight the cells
containing the X data on the spreadsheet. Repeat for the Y Values box.
3. Chart Options. You can do these now or change them later, but you should at least enter suitable
titles for the graph and the axes and probably turn off the gridlines and legend.
4. Graph Location. Just hit Finish. This puts the chart beside the data so you can see both.

Changing the Graph. Once you have drawn the graph, you can now change any aspect of it by
double-clicking (or sometimes right-clicking) on the part you want to change. For example you can:
 move and re-shape the graph
 change the background colour (white is usually best!)
 change the shape and size of the markers (dots)
 change the axes scales and tick marks
 add a trend line or error bars (see below)

Lines. To draw a straight "line of best fit" right click on a point, select Add Trendline, and choose
linear. In the option tab you can force it to go through the origin if you think it should, and you can
even have it print the line equation if you are interested in the slope or intercept of the trend line. If
instead you want to "join the dots" (and you don't often) double-click on a point and set line to
automatic.

Error bars. These are used to show the confidence intervals on the graph. You must already have
entered the 95% confidence limits on the spreadsheet beside the X and Y data columns. Then
double-click on the points on the graph to get the Format Data Series dialog box and choose the Y
Error Bars tab. Click on the red arrow in the Custom + box, and highlight the range of cells
containing your confidence limits. Repeat for the Custom - box.

HGS Biology                                                                                    NCM/11/01
AS statistics page 3

Problems
1.   Here are the results of an investigation into the rate of photosynthesis in the pond weed Elodea.
The number of bubbles given off in one minute was counted under different light intensities,
and each measurement was repeated 5 times. Use Excel to calculate the means and 95%
confidence limits of these results, then plot a graph of the mean results with error bars and a
line of best fit.

light     repeat 1    repeat 2      repeat 3    repeat 4     repeat 5
intensity
(Lux)
0           5           2              0          2            1
500         12           4              5          8            7
1000           7          20             18         14           24
2000          42          25             31         14           38
3500          45          40             36         50           28
5000          65          54             72         58           36

HGS Biology                                                                                NCM/11/01
A2 statistics page 1

Statistics for A2 Biology
There is a bewildering variety of statistical tests available, and it is important to choose the right
one. This flow chart will help you to decide which statistical test to use, and the tests are described
in detail on the next 5 pages.

normal        Pearson correlation coefficient
data          =CORREL (range 1, range 2)
0 = no correlation
Testing for a                 1 = perfect correlation
correlation

non-normal Spearman correlation coefficient
data       =CORREL (range 1, range 2)
Plot                                      on ranks of data
scatter                                    0=no correlation/ 1=perfect correlation
graph

Testing for              Finding how one              Linear regression
a relation              factor affects another       Add Trendline to graph and
between 2 sets                                            Display Equation.
Gives slope and intercept of line
Calculate                                             What
mean and                                               kind
95% CI from                                                                           same          Paired t-test
of                           individuals   =TTEST(range1, range2, 2, 1)
replicates                                            test?                                        If P<5% then significant difference
If P>5% then no significant difference
Testing for             2 sets
Measurements

a difference
between sets                              different     Unpaired t-test
individuals   =TTEST(range1, range2, 2, 2)
Plot                                        If P<5% then significant difference
bar                                         If P>5% then no significant difference
graph

ANOVA
What                                                                >2 sets                      Tools menu > Data analysis > Anova
start       kind                                                                                            If P<5% then significant difference
here         of                                                                                             If P>5% then no significant difference
data?

2 test
Frequencies (counts)

Comparing observed
counts to a theory                     =CHITEST(obs range, exp range)
If P<5% then disagree with theory
If P>5% then agree with theory

What         Testing for a difference        2 test
kind        between counts                  =CHITEST(obs range, exp range)
of                                         If P<5% then significant difference
test?                                        If P>5% then no significant difference

Testing for an association             2 test for association
between groups of counts               =CHITEST(obs range, exp range)
If P<5% then significant association
If P>5% then no significant association

HGS Biology                                                                                                                              NCM/11/01
A2 statistics page 2

Statistics to Test for a Correlation
Correlation statistics are used to investigate an association between two factors such as age and
height; weight and blood pressure; or smoking and lung cancer. After collecting as many pairs of
measurements as possible of the two factors, plot a scatter graph of one against the other. If both
factors increase together then there is a positive correlation, or if one factor decreases when the
other increases then there is a negative correlation. If the scatter graph has apparently random points
then there is no correlation.

Positive Correlation                Negative Correlation                 No Correlation
variable 2

variable 2

variable 2
variable 1                           variable 1                        variable 1

There are two statistical tests to quantify a correlation: the Pearson correlation coefficient (r), and
Spearman's rank-order correlation coefficient (rs). These both vary from +1 (perfect correlation)
through 0 (no correlation) to –1 (perfect negative correlation). If your data are continuous and
normally-distributed use Pearson, otherwise use Spearman. In both cases the larger the absolute
value (positive or negative), the stronger, or more significant, the correlation. Values grater than 0.8
are very significant, values between 0.5 and 0.8 are probably significant, and values less than 0.5 are
probably insignificant.

In Excel the Pearson coefficient r is calculated using the formula: =CORREL (X range, Y range) . To
calculate the Spearman coefficient rs, first make two new columns showing the ranks (or order) of
the two sets of data, and then calculate the Pearson correlation on the rank data. The highest value is
given a rank of 1, the next highest a rank of 2 and so on. Equal values are given the same rank, but
the next rank should allow for this (e.g. if there are two values ranked 3, then the next value is
ranked 5).

In this example the size of breeding
pairs of penguins was measured to
see if there was correlation between
the sizes of the two sexes. The scatter
graph      and     both     correlation
coefficients clearly indicate a strong
positive correlation. In other words
large females do pair with large
males. Of course this doesn't say why,
but it shows there is a correlation to
investigate further.

HGS Biology                                                                                         NCM/11/01
A2 statistics page 3

Linear Regression to Investigate a Causal Relationship.
If you know that one variable causes the changes in the other variable, then there is a causal
relationship. In this case you can use linear regression to investigate the relation in more detail.
Regression fits a straight line to the data, and gives the values of the slope and intercept of that line
(m and c in the equation y = mx + c).

The simplest way to do this in Excel is to
plot a scatter graph of the data and use the
trendline feature of the graph. Right-click on
a data point on the graph, select Add
Trendline, and choose Linear. Click on the
Options tab, and select Display equation on
chart. You can also choose to set the
intercept to be zero (or some other value).
The full equation with the slope and
intercept values are now shown on the chart.

In this example the absorption of a yeast cell suspension is plotted against its cell concentration
from a cell counter. The trendline intercept was fixed at zero (because 0 cells have 0 absorbance),
and the equation on the graph shows the slope of the regression line.

The regression line can be used to make quantitative predictions. For example, using the graph
above, we could predict that a cell concentration of 9 x 107 cells per cm3 would have an absorbance
of 1.37 (9 x 0.152).

HGS Biology                                                                                   NCM/11/01
A2 statistics page 4

T-Test to Compare Two Sets of Data
Another common form of data analysis is to compare two sets of measurements to see if they are the
same or different. For example are plants treated with fertiliser taller than those without? If the
means of the two sets are very different, then it is easy to decide, but often the means are quite close
and it is difficult to judge whether the two sets are the same or are significantly different. To
compare two sets of data use the t-test, which tells you the probability (P) that there is no difference
between the two sets. This is called the null hypothesis.

P varies from 0 (impossible) to 1 (certain). The higher the probability, the more likely it is that the
two sets are the same, and that any differences are just due to random chance. The lower the
probability, the more likely it is that that the two sets are significantly different, and that any
differences are real. Where do you draw the line between these two conclusions? In biology the
critical probability is usually taken as 0.05 (or 5%). This may seem very low, but it reflects the facts
that biology experiments are expected to produce quite varied results. So if P > 5% then the two sets
are the same (i.e. accept the null hypothesis), and if P < 5% then the two sets are different (i.e. reject
the null hypothesis). For the t test to work, the number of repeats should be at least 5.

In Excel the t-test is performed using the formula: =TTEST (range1, range2, tails, type) . For the
examples you'll use in biology, tails is always 2 (for a "two-tailed" test), and type can be either 1 for
a paired test (where the two sets of data are from the same individuals), or 2 for an unpaired test
(where the sets are from different
individuals). The cell with the t test P
should be formatted as a percentage
(Format menu > cell > number tab >
percentage).      This       automatically
multiplies the value by 100 and adds the
% sign. This can make P values easier to
read and understand. It’s also a good idea
to plot the means as a bar chart with error
bars to show the difference graphically.

In the first example the yield of potatoes
in 10 plots treated with one fertiliser was
compared to that in 10 plots treated with
another fertiliser. Fertiliser B delivers a
larger mean yield, but the unpaired t-test
P shows that there is a 8% probability
that this difference is just due to chance.
Since this is >5% we accept the null
hypothesis that there is no significant
difference between the two fertilisers.

In the second example the pulse rate of 8
individuals was measured before and
after eating a large meal. The mean pulse
rate is certainly higher after eating, and
the paired t-test P shows that there is
only a tiny 0.005% probability that this
difference is due to chance, so the pulse
rate is significantly higher after a meal.

HGS Biology                                                                                    NCM/11/01
A2 statistics page 5

ANOVA to Compare >2 sets of Data
The t test is limited to comparing two sets of data, so to compare many groups at once you need
analysis of variance (ANOVA). From the Excel Tools menu select Data Analysis then ANOVA
Single Factor. This brings up the ANOVA dialogue box, shown here.
 Enter the Input Range by clicking in
the box then selecting the range of
cells containing the data, including
 Check that the columns/rows choice
is correct (this example is in three
columns), and click in Labels in
First Row if you have included
appear in the results table.
 Leave Alpha at 0.05 (for the usual
5% significance level).
 Click in the Output Range box and
click on a free cell on the worksheet,
which will become the top left cell
of the 8 x 15-cell results table.
 Finally press OK.

The output is a large
data table, and you may
it all. At this point you
should plot a bar graph
using the averages
column for the bars and
the variance column for
the error bars.

The most important cell
in the table is the P-
value, which as usual is
the probability that the
null hypothesis (that
there is no difference
between any of the data
sets) is true. This is the
same as a t-test
probability, and in fact if you try ANOVA with just two data sets, it returns the same P as a t test. If
P > 5% then there is no significant difference between any of the data sets (i.e. the null hypothesis is
true), but if P < 5% then at least one of the groups is significantly different from the others.

In the example on this page, which concerns the grain yield from three different varieties of wheat,
P is 0.14%, so is less than 5%, so there is a significant difference somewhere. The problem now is
to identify where the difference lies. This is done by examining the variance column in the summary
table. In this example, varieties 2 and 3 are very similar, but variety 1 is obviously the different one.
So the conclusion would be that variety 1 has a significantly lower yield than varieties 2 and 3.

HGS Biology                                                                                   NCM/11/01
A2 statistics page 6

Chi-squared Test for Frequency Data
Sometimes the data from an experiment are not measurements but counts (or frequencies) of
things, such as counts of different phenotypes in a genetics cross, or counts of species in different
habitats. With frequency data you can’t usually calculate averages or do a t test, but instead you do a
chi-squared (2) test. This compares observed counts with some expected counts and tells you the
probability (P) that there is no difference between them. In Excel the 2 test is performed using the
formula: =CHITEST (observed range, expected range) . There are three different uses of the test
depending on how the expected data are calculated.

Sometimes the expected data can be calculated from a quantitative theory, in which case you
1.   are testing whether your observed data agree with the theory. If P < 5% then the data do not
agree with the theory, and if P > 5% then the data do agree with the theory. A good example is a
genetic cross, where Mendel’s laws can be used to
predict frequencies of different phenotypes. In this
example Excel formulae are used to calculate the
expected values using a 3:1 ratio of the total
number of observations. The 2 P is 53%, which is
much greater than 5%, so the results do indeed
support Mendel’s law. Incidentally a very high P
(>80%) is suspicious, as it means that the results
are just too good to be true.

2.     Other times the expected data are calculated by assuming that the counts in all the categories
should be the same, in which case you are testing whether there is a difference between the
counts. If P < 5% then the counts are significantly
different from each other, and if P > 5% then there
is no significant difference between the counts. In
the example above the sex of children born in a
hospital over a period of time is compared. The
expected values are calculated by assuming there
should be equal numbers of boys and girls, and the
2 P of 6.4% is greater than 5%, so there is no
significant difference between the sexes.

If the count data are for categories in two groups, then the expected data can be calculated by
3.    assuming that the two groups are independent. If P < 5% then there is a significant association
between the two groups, and if P > 5% then the two groups are independent. Each group can have
counts in two or more categories, and the observed frequency data are set out in a table, called a
contingency table. A copy of this table is then made for the expected data, which are calculated for
each cell from the corresponding totals of the observed data, using the formula
E = column total x row total / grand total . In this example the flow rate of a stream (the two categories
fast / slow) is compared to the type of stream bed (the four categories weed-choked / some weeds /
shingle / silt) at 50
different sites to see if
there is an association
between them. The 2 P
of 1.1% is less than 5%,
so there is an association
between flow rate and
stream bed.

HGS Biology                                                                                      NCM/11/01
A2 statistics page 7

HGS Biology                          NCM/11/01
A2 statistics page 8

Problems
1. In a test of two drugs 8 patients were given one drug and 8 patients another drug. The number of hours of relief from
symptoms was measured with the following results:
Drug A        3.2    1.6      5.7     2.8       5.5    1.2     6.1     2.9
Drug B        3.8    1.0      8.4     3.6       5.0    3.5     7.3     4.8
Find out which drug is better by calculating the mean and 95% confidence limit for each drug, then use an
appropriate statistical test to find if it is significantly better than the other drug.

2. In one of Mendel's dihybrid crosses, the following types and numbers of pea plants were recorded in the F2
generation:
Yellow round seeds     Yellow wrinkled seeds        Green round seeds     Green wrinkled seeds
289                     122                         96                       39
According to theory these should be in the ratio of 9:3:3:1. Do these observed results agree with the expected ratio?

3. The areas of moss growing on the north and south sides of a group of trees were compared.
North side of tree      20    43    53     86    70     54
South side of tree      63    11    21     54     9     74
Is there a significant difference between the north and south sides?

4. Five mammal traps were placed in randomly-selected positions in a deciduous wood. The numbers of field mice
captured in each trap in one day were recorded. The results were:
Trap A         B     C     D     E
no. of mice 22        26    21    8    23
Trap D caught far fewer mice than the others. Did this happen by chance or is the result significant?

5. In an investigation into pollution in a stream, the concentration of nitrates was measured at six different sites, and a
diversity index was calculated for the species present.
Site                      1         2        3       4         5         6
Conductivity (S)      413.3     439.7      726     850     567.3     766.7
Diversity index         7.51      5.17     4.49    3.82      5.88      3.74
Is there a correlation between conductivity and diversity, and how strong is it? (The diversity index is calculated
from biotic data, so is not normally distributed.)

6. The blood groups of 400 individuals, from 4 different ethnic groups were recorded with the following results:
Ethnic group      Blood Group O     Blood Group A      Blood Group B      Blood Group AB
1                 46                40                  7                  3
2                 48                39                 12                  2
3                 53                33                 12                  4
4                 55                30                 13                  3
Is there as association between blood group and ethnic group?

7. The effect of enzyme concentration on rate of a reaction was investigated with the following results.
Enzyme concentration (mM)            0     0.1     0.2    0.5     0.8     1.0
Rate (arbitrary units)               0     0.8     1.1    3.2     6.6     7.2
Plot a graph of these results, fit a straight line to the data, and find the slope of this line. Use the slope to predict the
rate at an enzyme concentration of 0.7mM.

HGS Biology                                                                                                        NCM/11/01
A2 statistics page 9

Alternative flow chart, showing the non-parametric tests (which are not available in Excel):

Parametric Test                 Non-Parametric Test

Testing for a           Pearson correlation coefficient     Spearman correlation coefficient
correlation             =CORREL (range 1, range 2)          =CORREL (range 1, range 2)
0 = no correlation                  on ranks of data
1 = perfect correlation             0=no correlation
1=perfect correlation
Plot
scatter
graph            Finding how
one factor              Linear regression
affects another         Add Trendline to graph and
Testing for                                      Display Equation.
a relation                                      Gives slope and intercept of line
between 2 sets

Parametric Test                 Non-Parametric Test
Calculate                                            What
mean and                                              kind                     same          Paired t-test
95% CI from                                             of                      individuals   =TTEST(range1, range2, 2, 1)        Wilcoxon Matched Pairs test
replicates                                           test?                                   If P<5% then sig. difference        Not available in Excel
If P>5% then no sig. difference

2 sets
Testing for
a difference                                                 Parametric Test                 Non-Parametric Test
Measurements

between sets
different     Unpaired t-test
individuals   =TTEST(range1, range2, 2, 2)        Mann-Whitney U-test
If P<5% then sig. difference        Not available in Excel
Plot                                    If P>5% then no sig. difference
bar
graph

Parametric Test                 Non-Parametric Test
What                                                                                       ANOVA
start       kind                                                                                      Tools menu >Data analysis >Anova
here                                                                          >2 sets                                                     Kruskal-Wallis test
of                                                                                       If P<5% then sig. difference        Not available in Excel
data?                                                                                      If P>5% then no sig. difference
Frequencies (counts)

Comparing observed                 2 test
counts to a theory                 =CHITEST(obs range, exp range)
If P<5% then disagree with theory
If P>5% then agree with theory

Testing for a difference
2 test
What
kind        between counts             =CHITEST(obs range, exp range)
of                                    If P<5% then sig. difference
test?                                   If P>5% then no sig. difference

Testing for an association         2 test for association
between groups of counts           =CHITEST(obs range, exp range)
If P<5% then sig. association
If P>5% then no sig. association

HGS Biology                                                                                                                                                        NCM/11/01

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 73 posted: 5/22/2011 language: English pages: 12
How are you planning on using Docstoc?