Gene expression
Statistics 246, Week 3, 2002
Thesis: the analysis of gene
expression data is going to be big
in 21st century statistics
Many different technologies, including
High-density nylon membrane arrays
Serial analysis of gene expression (SAGE)
Short oligonucleotide arrays (Affymetrix)
Long oligo arrays (Agilent)
Fibre optic arrays (Illumina)
cDNA arrays (Brown/Botstein)*
Total microarray articles indexed in Medline
600
500
Number of papers
400
300
200
100
0
1995 1996 1997 1998 1999 2000 2001
(projected)
Year
Common themes
• Parallel approach to collection of very large
amounts of data (by biological standards)
• Sophisticated instrumentation, requires some
understanding
• Systematic features of the data are at least as
important as the random ones
• Often more like industrial process than single
investigator lab research
• Integration of many data types: clinical,
genetic, molecular…..databases
Biological background
Transcription
DNA
G TAAT C C T C
| | | | | | | | |
CATTAG GAG
RNA
polymerase
mRNA
Idea: measure the amount of mRNA to see which
genes are being expressed in (used by) the cell.
Measuring protein might be better, but is currently
harder.
Reverse transcription
Clone cDNA strands, complementary to the mRNA
mRNA G U AA U C C U C
Reverse
transcriptase
cDNA
CATTAG GAG
A TTAA T A G G
CC ACTTTG G A G A G
C AC AA G G A G A G
T TT T A G G
AG GAG
cDNA microarray experiments
mRNA levels compared in many different contexts
Different tissues, same organism (brain v. liver)
Same tissue, same organism (ttt v. ctl, tumor v. non-tumor)
Same tissue, different organisms (wt v. ko, tg, or mutant)
Time course experiments (effect of ttt, development)
Other special designs (e.g. to detect spatial patterns).
cDNA microarrays
cDNA clones
cDNA microarrays
Compare the genetic expression in two samples of cells
PRINT SAMPLES
cDNA from one cDNA labelled red/green
gene on each spot
e.g. treatment / control
normal / tumor tissue
HYBRIDIZE SCAN
Add equal amounts of Laser Detector
labelled cDNA samples
to microarray.
Biological question
Differentially expressed genes
Sample class prediction etc.
Experimental design
Microarray experiment
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Estimation Testing Clustering Discrimination
Biological verification
and interpretation
Some statistical questions
Image analysis: addressing, segmenting, quantifying
Normalisation: within and between slides
Quality: of images, of spots, of (log) ratios
Which genes are (relatively) up/down regulated?
Assigning p-values to tests/confidence to results.
Some statistical questions, ctd
Planning of experiments: design, sample size
Discrimination and allocation of samples
Clustering, classification: of samples, of genes
Selection of genes relevant to any given analysis
Analysis of time course, factorial and other special
experiments…..…...& much more.
Some bioinformatic questions
Connecting spots to databases, e.g. to sequence,
structure, and pathway databases
Discovering short sequences regulating sets of
genes: direct and inverse methods
Relating expression profiles to structure and
function, e.g. protein localisation
Identifying novel biochemical or signalling
pathways, ………..and much more.
Part of the image of one channel false-coloured on a white (v. high) red
(high) through yellow and green (medium) to blue (low) and black scale
Does one size fit all?
Segmentation: limitation of the
fixed circle method
SRG Fixed Circle
Inside the boundary is spot (foreground), outside is not.
Some local backgrounds
Single channel
grey scale
We use something different again: a smaller, less variable value.
Quantification of expression
For each spot on the slide we calculate
Red intensity = Rfg - Rbg
fg = foreground, bg = background, and
Green intensity = Gfg - Gbg
and combine them in the log (base 2) ratio
Log2( Red intensity / Green intensity)
Gene Expression Data
On p genes for n slides: p is O(10,000), n is O(10-100), but growing,
Slides
slide 1 slide 2 slide 3 slide 4 slide 5 …
1 0.46 0.30 0.80 1.51 0.90 ...
2 -0.10 0.49 0.24 0.06 0.46 ...
Genes 3 0.15 0.74 0.04 0.10 0.20 ...
4 -0.45 -1.03 -0.79 -0.56 -0.32 ...
5 -0.06 1.06 1.35 1.09 -1.09 ...
Gene expression level of gene 5 in slide 4
= Log2( Red intensity / Green intensity)
These values are conventionally displayed
on a red (>0) yellow (0) green (2M, ~1,800 types Neocortex
Two principles: “zone-to-zone projection”, and “glomerular convergence”
Of interest: the hardwiring of the
vertebrate olfactory system
• Expression of a specific odorant receptor gene by
an olfactory neuron.
• Targeting and convergence of like axons to specific
glomeruli in the olfactory bulb.
The biological question in this case
Are there genes with spatially
restricted expression patterns within
the olfactory bulb?
Layout of the cDNA Microarrays
• Sequence verified mouse cDNAs
• 19,200 spots in two print groups of 9,600 each
– 4 x 4 grid, each with 25 x24 spots
– Controls on the first 2 rows of each grid.
pg1 pg2
Design: How We Sliced Up the Bulb
A
P D
V L
M
Design: Two Ways to Do the
Comparisons
Goal: 3-D representation of gene expression
Compare all samples to a Multiple direct comparisons
common reference between different samples
sample (e.g., whole bulb) (no common reference)
A M A M
V V
R
D D
L P L P
An Important Aspect of Our Design
Different ways of estimating
the same contrast:
e.g. A compared to P
Direct = A-P
Indirect = A-M + (M-P) or
A-D + (D-P) or
-(L-A) - (P-L)
How do we combine these?
Analysis using a linear model
Define a matrix X so that E(M)=X
Use least squares estimates for A-L, P-L, D-L, V-L, M-L
In practice, we use robust regression.
Estimates for other estimable contrasts follow in the usual way.
A L
m1 0 0 0 1 1
P L
m2 1 0 0 0 0
E D L
V L
mn 1 1 0 0 0
M L
ˆ X' X 1 X' M
The Olfactory Bulb Experiments
completed so far
Contrasts & Patterns
Because of the connectivity of our experiment, we can estimate
all 15 different pairwise comparisons directly and/or indirectly.
For every gene we thus have a pattern based on the 15
pairwise comparisons.
Gene #15,228
Contrasts & patterns:another way
Instead of estimating pairwise comparisons between each of the six
effects, we can come closer to estimating the effects themselves by
doing so subject to the standard zero sum constraint (6 parameters, 5
d.f.).
What we estimate for A, say, subject to this constraint, is in reality an
estimate of
A - 1/6(A + P + D + V + M + L).
This set of parameter estimates gives results similar to, but better than,
the ones we would have obtained had we carried out the experiments
with whole-bulb reference tissue.
In effect we have created the whole-bulb reference in silico.
Alternative pattern representation
Gene #15,228
once again.
Reconstruction of the Bulb as a Cube:
Expression of Gene # 15,228
High
Low
Expression
Level
Patterns, More Globally...
Can we identify genes with interesting
patterns of expression across the bulb?
Two approaches:
1. Find the genes whose expression fits
specific, predefined patterns.
2. Perform cluster analysis - see what
expression patterns emerge.
Clustering procedure
Start with a sets of genes exhibiting some minimal level of differential
expression across the bulb; here ~650 were chosen from all 15 contrasts.
Carry out hierarchical clustering, building a dendrogram: Mahalanobis
distance and Ward agglomeration (minimum variance) were used.
Now consider all clusters of 2 or more genes in the tree. Singles are
added separately.
Measure the heterogeneity h of a cluster by calculating the 15 SDs
across the cluster of each of the pairwise effects, and taking the largest.
Choose a score s (see plots) and take all maximal disjoint clusters with
h < s. Here we used s = 0.46 and obtained 16 clusters.
Plots guiding choice of clusters of genes
Number of Number
clusters of genes
(patterns)
Cluster heterogeneity h (max of 15 SDs)
PA DA VA MA
LA DP VP MP
LP VD MD LD
MV
LA LV LM
Red :genes chosen
Blue:controls
15 p/w effects
The 16 groups systematically arranged (6 point representation)
Validation of Gene # 15,228 Expression
Pattern by RNA In Situ Hybridization
CTX CTX
AOB
AOB
MOB
MOB
gluR #15,228
Gene 15,228: another in situ view
384
(group 3)
D
L M
V
3-dimension reconstruction from in-situ data
15,228
5,291
8,496
384
Are the pattens we found real?
Here‟s how we attempted to show that the answer is a qualified yes.
Each cluster average (pattern) has a „strength‟ we can measure by
its root-mean-square (RMS). The n=16 clusters we found have an average
RMS of av= 0.3. Both n and av being strongly determined by our
heterogeneity cut-off score of s=0.46.
Now consider randomizing the labels (e.g. P-A) on our hybridizations and
repeating the entire analysis, keeping the cut-off score at 0.46. We typically
get fewer, “weaker” patterns, with less contrast in the red-green
patchwork. One such is on the next page.
500 independent random relabellings had a mean av value of 0.18, an SD
of 0.07 and a max av value of 0.26, cf. 0.3 in our data. Our clusters are
definitely „non-random‟ in some sense.
Real
Random
Problem
We later tried all this with a different set of data, one
which made use of reference mRNA had generally
lower S/N, and where the inveestigator sought fewer
interesting patterns.
We found that the patterns the previous method
discovered were similarly quite distinct in av values
from those in randomly labelled hybs, but this time,
the av values were „significantly‟ lower than random.
It all depends where you are on the curve.
Where next?
I feel that we need a new idea. The previous one
doesn‟t seem to have worked. Or did it?
Just clustering and taking averages seems too easy….
But maybe clustering is all there is to patterns, once
we have decided on the appropriate and context
dependent profile to cluster, and selected the genes,
but I keep wondering…
Some statistical research stimulated
by microarray data analysis
Experimental design : Churchill & Kerr
Image analysis: Zuzan & West, ….
Data visualization: Carr et al
Estimation: Ideker et al, ….
Multiple testing: Westfall & Young , Storey, ….
Discriminant analysis: Golub et al,…
Clustering: Hastie & Tibshirani, Van der Laan,
Fridlyand & Dudoit, ….
Empirical Bayes: Efron et al, Newton et al,….
Multiplicative models: Li &Wong
Multivariate analysis: Alter et al
Genetic networks: D‟Haeseleer et al and more
In closing: The pervasiveness of
microarray technology
and the statistical problems that go with it
Hybridization of target DNA or RNA to large
numbers of probes attached to a solid support
in a microarray format has a much wider
applicability.
All such applications have their own statistical
problems. Here are two relating to the
previous lectures.
Meiosis data in which all exchanges
are precisely located (from microarrays)
Yeast
Figure courtesy of J Derisi
Exon Arrays can validate Exon Predictions
and assemble Gene Structures
One or more Probes per Predicted Exon
Predicted exon Predicted exon
• Verify predicted exons on a genome-wide scale.
• Group exons into genes via co-regulation.
This and the next slide courtesy of Rosetta
Tiling arrays can identify exons and
refine gene structures
Predicted exon Predicted exon
10 bp steps
Oligonucleotides
60 bp in length
“60-mers”
Acknowledgments
Statistical collaborators
Ngai Lab (Berkeley )
Yee Hwa Yang (Berkeley)
Cynthia Duggan
Sandrine Dudoit (Berkeley)
Jonathan Scolnick
Ingrid Lönnstedt (Uppsala)
Dave Lin
Natalie Thorne (WEHI)
Vivian Peng
Mauro Delorenzi (WEHI)
Percy Luu
Elva Diaz
CSIRO Image Analysis Group
John Ngai
Michael Buckley
Ryan Lagerstorm
LBNL
Matt Callow
WEHI
Glenn Begley
RIKEN Genomic Sciences Center
Suzie Grant
Yasushi Okazaki
Rob Good
Yoshihide Hayashizaki
PMCI
Chuang Fong Kong