# Analysis of differences in species composition - Biology by hcj

VIEWS: 5 PAGES: 64

• pg 1
```									 1.) Alpha diversity –the diversity at a
single point
 2.) Beta diversity – the diversity in a
habitat
 3.) Gamma diversity – the diversity in
a landscape
Organism      # found        pi            ln(pi)       pi*ln(pi)
Water boatman       1   0.047619048    -3.044522438   -0.144977259
Hempitera           1   0.047619048    -3.044522438   -0.144977259
Water strider       2   0.095238095    -2.351375257   -0.223940501
Bivalvia            8   0.380952381    -0.965080896   -0.367649865
Gastropoda          3   0.142857143    -1.945910149   -0.277987164
Shiner              5   0.238095238    -1.435084525   -0.341686792
Sunfish             1   0.047619048    -3.044522438   -0.144977259
21                                 -1.646196099
H'               1.646196099
   Simpson’s Diversity Index is 1-D where

Organism            n      n(n-1)
Water boatman            1      0
Hempitera                1      0
Water strider            2      2
Bivalvia                 8     56
Gastropoda               3      6
Shiner                   5     20
Sunfish                  1      0
Sum                    21      84
0.2
Simpson's     0.8
   Diversity indices provide a summary of
richness and evenness by combining these
two facets of diversity into a single statistic.
   Common ones are Shannon-Weiner index,
Simpson’s index and the log alpha index.
   library(BiodiversityR)
   data(dune)
   Diversity.1 <- diversityresult(dune, index =
'Shannon', method = 's')
   Diversity.2 <- diversityresult(dune, index =
'Simpson', method = 's')
   Diversity.3 <- diversityresult(dune, index =
'Logalpha', method = 's')
   cbind(Diversity.1, Diversity.2, Diversity.3)
   Note that it is not
statistically
meaningful to use
parametric or not
parametric statistics
to determine if sites
differ “significantly”
from each other in
terms of diversity
n = 47 + 35 + 7 + 5 + 3
                                                  +2
Michigan     fi       fi log fi      fi log2 fi   n = 99
Oak      47    78.58859932     131.4078286
Corn      35    54.04238155      83.4451144
Blackberry    7     5.91568628     4.999334881
Beech       5    3.494850022     2.442795335
Cherry      3    1.431363764     0.682934075
Other      2    0.602059991     0.181238117
Louisiana    fi      fi log fi      fi log2 fi
Oak      48    80.69957939    135.6754607
Pine     23    31.31974023    42.64896209
Grape     11    11.45531954    11.92948597
Corn      13    14.48126358    16.1313073
Blueberry    8    7.224719896    6.524572197
Other      2    0.602059991    0.181238117
n = 48 + 23 + 11 + 13 +
                                               8+2
n = 105

   Basic diversity indices are often used to
measure alpha diversity
   The function diversity in the vegan library
finds the most commonly used diversity
indices.
   library(vegan)
   data(BCI)
   H <- diversity(BCI)
   H
◦ This will give you the Shannon-Weiner diversity
index for each site.
   To measure Simpson’s index of diversity, just
use
   H.simpson <- diversity(BCI, index =
'simpson')
   H.simpson
   Beta diversity typically relies upon
dimensionless metrics based on species
dissimilarity or turnover between samples.
S = # of specimens
C = shared species

A       B                      A          B
0.25

D            E
C                              C

Site 1                        Site 2
A         B        A             B

D              E
C                     C

A A       B       A A        B

A A                A A
C                 C
   For distances/dissimilarity indices, the
smaller the value, the less the difference in
species composition.
   Distance/dissimilarity matrices provide
information on ecological
distances/dissimilarities between all pairs of
   Under Euclidean distance, you can use each
species as an axis.
   Constrained to be between 0 and 1
◦ Bray-Curtis
◦ Kulczynski
   Values can be larger than 1
◦ Euclidean distance
◦ Hellinger distance
◦ Chi-square distance
   Presence/absence
◦ Jaccard
◦ Sorenson
   Most of the measures of distance/dissimilarity
are sensitive to the numbers of individuals
encountered
   In many cases, it is worthwhile to transform the
data into proportions
   Species proportions for each site are known as
species profiles.
   library(BiodiversityR)
   data(dune)
   community.hel <- disttransform(dune, method =
'hellinger')
   hellinger.distance <- vegdist(community.hel,
method = 'euclidean')
   hellinger.distance
   You may want to test for a correlation among
the values of an environmental gradient and
ecological distance matrices.
   For this, you will use a Mantel test – a test of
the correlation between two matrices
   data(dune.env)
   envir.euc <- vegdist(dune.env\$A1, method =
'euclidean')
   ecology.bray <- vegdist(dune, method =
'bray')
   mantel(envir.euc, ecology.bray, 'pearson')
   The correlation is fairly low (r = 0.2379) but
the results are significant (p = 0.042).
   The function betadiver in vegan estimates any
of the 24 indices of beta diversity reviewed by
Koleff et al. (2003)
   library(vegan)
   data(sipoo)
   What are the variables?
   ?sipoo
   Where A & B are the number of species in
samples A & B and C is the number of species
shared by the two samples
   d.sor
   The ANOSIM (ANalysis Of SIMilarity) test is
very useful for comparing among categorical
variables (e.g. management techniques)
   anosim(ecology.bray, dune.env\$Management)
   Hierarchical clustering analysis can provide a
method for examining how similar sites are
to each other.
   library(cluster)
   distmatrix <- vegdist(dune, method = 'bray')
   Cluster.1 <- agnes(distmatrix, method =
'single')
   This is an
example of
agglomerative
clustering
   It starts by
treating each site
as a cluster of 1
   The nearest two
sites are joined to
form a cluster.
The process
continues until a
cluster has been
created that
contains all the
sites
   Divisive hierarchical clustering starts at the
top and splits that level.
   The splitting continues until each observation
has been split out.
   Cluster.4 <- diana(distmatrix)
   plot(Cluster.4, which.plot = 2)
   Often species
variation
follows multiple
than just one, in
which case
ordination is
typically more
appropriate
   The clustering techniques so far shows the
ecological distance among clusters rather than
among all the sites.
   Cophenetic correlation is the correlation between
the distance of the original distance matrix and
the cophenetic distance (the distance at which
clusters are combined in a dendrogram).
   If the cophenetic correlation is large, than the
distance portrayed in the dendrogram is a good
representation of distances between individual
sites
   copheneticdist <- cophenetic(Cluster.1)
   mantel(distmatrix, copheneticdist)
   Cafeillo (Faramea
occidentalis) is a small
tree ranging from Cuba
central Mexico through
Brazil
   Imagine that we want to investigate the
abundance of this species on Barro Colorado
Island (Panama)
   We hypothesize that abundance is affected by
precipitation

Image from
http://www.fs.fed.us/global/iitf/pdf/shrubs
/Faramea%20occidentalis.pdf
   library(BiodiversityR)
   data(faramea)
   What are the names of the variables?

   Plot counts of the species against
precipitation
   Count.model1 <- lm(Faramea.occidentalis ~
Precipitation, data = faramea, na.action =
na.exclude)
   summary(Count.model1)
   fitted(Count.model1)
   residuals(Count.model1)
   Residuals are the part of the data that weren’t
modeled and so shouldn’t show any pattern
   library(akima)
   surface.gam <-
residualssurface(Count.model1,
na.omit(faramea), 'UTM.EW', 'UTM.NS', gam =
T, npol = 2, plotit = T, bubble = F, fill = T)
   This creates a trend-surface that models the
changes in the response variable
(i.e. the residuals)
   Residuals are
highest in the
center, decreasing
as you move away
indicating a
spatially-
autocorrelated
population
   The assumptions
of a linear
regression model
aren’t met here.
   Generalized linear models (GLMs) fit a wider,
more general class of models and can cope
with situations where observations are not
normally distributed
   GLMs are characterized by two functions
◦ The link function describes how the mean of the
response variable depends on linear predictors (the
explanatory variables)
◦ The variance function captures how the variance of
the response variable depends upon the mean
   Count.model3 <- glm(Faramea.occidentalis ~
Precipitation, family = poisson,
data=faramea, na.action=na.exclude)
   summary(Count.model3)
   sum(resid(Count.model3, type = "pearson")
^2, na.rm=TRUE) /41

   If this is >1, it is overdispersed
   Count.model4 <- glm(Faramea.occidentalis ~
Precipitation, family = quasipoisson,
data=faramea, na.action=na.exclude)
   summary(Count.model4)
   The negative binomial GLM is often used
where dispersion is >1 or where organisms
are clumped (and not randomly distributed)
   This is VERY useful in ecology!
   library(MASS)
   Count.model5 <-
glm.nb(Faramea.occidentalis ~ Precipitation,
maxit = 5000, init.theta = 1, data = faramea,
na.action = na.exclude)
   summary(Count.model5)
   Generalized additive models (GAMs) are more
general than GLMs.
◦ They are based on smoothing.
◦ Because of this smoothing, the relationships with
the explanatory variables are no longer linear
   Let’s fit a negative binomial GAM to the tree
data
   Count.model6 <- gam(Faramea.occidentalis
~ s(Precipitation), family = poisson(), data =
na.omit(faramea))
   summary(Count.model6)
   Let’s try including precipitation squared in
the negative binomial GLM
   Count.model7 <-
glm.nb(Faramea.occidentalis ~ Precipitation +
I(Precipitation^2), maxit = 5000, init.theta =
1, data = faramea, na.action = na.exclude)
   summary(Count.model7)
Species will often have
unimodel distributions, so
a 2nd-order polynomial
model will often be used
   Does the age of the forest affect the
presence/absence of Cafeillo?
   One of the easiest ways to begin exploring
with contingency tables
   table1 <-
table(faramea\$Faramea.occidentalis>0,
faramea\$Age.cat)
   table1
   Presabs.1 <- chisq.test(table1)
   Presabs.1\$observed

   Presabs.1\$expected

   Presabs.1
   Presabs.model2 <-
glm(Faramea.occidentalis>0 ~ Age.cat, family
= binomial(link = logit), data = faramea,
na.action = na.exclude)
   summary(Presabs.model2)
   Because of the logit link, you would need to
calculate out y = exp(x)/(1+exp(x)) to get the
estimates for each site.
   Or you can type
◦ predict(Presabs.model2, type = 'response', se.fit =
T)
   The model performance where presence
depends upon forest age wasn’t very
impressive
   Let’s try a quasibinomial model
   Presabs.model3 <-
glm(Faramea.occidentalis>0 ~ Age.cat, family
na.action = na.exclude)
   summary(Presabs.model3)
   Earlier, we saw that precipitation and
precipitation squared may have an effect on the
numbers of Cafeillo.
   Other variables might potentially have an effect
as well – geology, age of the forest, elevation,
etc.
   Let’s see what’s important!
   Presabs.model6 <- glm(Faramea.occidentalis>0
~ Precipitation + I(Precipitation^2) + Geology +
Age.cat + Elevation + I(Elevation^2), family =
binomial(link = logit), data = faramea, na.action
= na.exclude)
   summary(Presabs.model6)
   Rather than having a model that includes all
variables (and may overfit!), it may be
worthwhile to figure out which variables you
can remove to better balance simplicity (fewer
variables) and explained deviance
   drop1(Presabs.model6, test = 'Chi')
   Presabs.model7 <-
glm(Faramea.occidentalis>0
~ Precipitation +
I(Precipitation^2) + Geology
+ Age.cat + Elevation, family