Embed
Email

Gene expression

Document Sample
Gene expression
Shared by: HC111123123233
Categories
Tags
Stats
views:
11
posted:
11/23/2011
language:
English
pages:
63
Gene expression





Statistics 246, Week 3, 2002

Thesis: the analysis of gene

expression data is going to be big

in 21st century statistics



Many different technologies, including

High-density nylon membrane arrays

Serial analysis of gene expression (SAGE)

Short oligonucleotide arrays (Affymetrix)

Long oligo arrays (Agilent)

Fibre optic arrays (Illumina)

cDNA arrays (Brown/Botstein)*

Total microarray articles indexed in Medline



600





500

Number of papers









400





300





200





100





0

1995 1996 1997 1998 1999 2000 2001

(projected)

Year

Common themes



• Parallel approach to collection of very large

amounts of data (by biological standards)

• Sophisticated instrumentation, requires some

understanding

• Systematic features of the data are at least as

important as the random ones

• Often more like industrial process than single

investigator lab research

• Integration of many data types: clinical,

genetic, molecular…..databases

Biological background

Transcription



DNA

G TAAT C C T C

| | | | | | | | |

CATTAG GAG

RNA

polymerase









mRNA

Idea: measure the amount of mRNA to see which

genes are being expressed in (used by) the cell.

Measuring protein might be better, but is currently

harder.

Reverse transcription

Clone cDNA strands, complementary to the mRNA





mRNA G U AA U C C U C

Reverse

transcriptase









cDNA









CATTAG GAG

A TTAA T A G G

CC ACTTTG G A G A G

C AC AA G G A G A G

T TT T A G G

AG GAG

cDNA microarray experiments

mRNA levels compared in many different contexts



Different tissues, same organism (brain v. liver)



Same tissue, same organism (ttt v. ctl, tumor v. non-tumor)



Same tissue, different organisms (wt v. ko, tg, or mutant)



Time course experiments (effect of ttt, development)



Other special designs (e.g. to detect spatial patterns).

cDNA microarrays



cDNA clones

cDNA microarrays

Compare the genetic expression in two samples of cells



PRINT SAMPLES

cDNA from one cDNA labelled red/green

gene on each spot









e.g. treatment / control

normal / tumor tissue

HYBRIDIZE SCAN

Add equal amounts of Laser Detector

labelled cDNA samples

to microarray.

Biological question

Differentially expressed genes

Sample class prediction etc.



Experimental design





Microarray experiment

16-bit TIFF files

Image analysis

(Rfg, Rbg), (Gfg, Gbg)

Normalization

R, G

Estimation Testing Clustering Discrimination





Biological verification

and interpretation

Some statistical questions



Image analysis: addressing, segmenting, quantifying



Normalisation: within and between slides



Quality: of images, of spots, of (log) ratios



Which genes are (relatively) up/down regulated?



Assigning p-values to tests/confidence to results.

Some statistical questions, ctd

Planning of experiments: design, sample size



Discrimination and allocation of samples



Clustering, classification: of samples, of genes



Selection of genes relevant to any given analysis



Analysis of time course, factorial and other special

experiments…..…...& much more.

Some bioinformatic questions



Connecting spots to databases, e.g. to sequence,

structure, and pathway databases



Discovering short sequences regulating sets of

genes: direct and inverse methods



Relating expression profiles to structure and

function, e.g. protein localisation



Identifying novel biochemical or signalling

pathways, ………..and much more.

Part of the image of one channel false-coloured on a white (v. high) red

(high) through yellow and green (medium) to blue (low) and black scale

Does one size fit all?

Segmentation: limitation of the

fixed circle method









SRG Fixed Circle



Inside the boundary is spot (foreground), outside is not.

Some local backgrounds









Single channel

grey scale









We use something different again: a smaller, less variable value.

Quantification of expression

For each spot on the slide we calculate





Red intensity = Rfg - Rbg

fg = foreground, bg = background, and





Green intensity = Gfg - Gbg

and combine them in the log (base 2) ratio



Log2( Red intensity / Green intensity)

Gene Expression Data

On p genes for n slides: p is O(10,000), n is O(10-100), but growing,





Slides

slide 1 slide 2 slide 3 slide 4 slide 5 …

1 0.46 0.30 0.80 1.51 0.90 ...

2 -0.10 0.49 0.24 0.06 0.46 ...

Genes 3 0.15 0.74 0.04 0.10 0.20 ...

4 -0.45 -1.03 -0.79 -0.56 -0.32 ...

5 -0.06 1.06 1.35 1.09 -1.09 ...





Gene expression level of gene 5 in slide 4

= Log2( Red intensity / Green intensity)



These values are conventionally displayed

on a red (>0) yellow (0) green (2M, ~1,800 types Neocortex

Two principles: “zone-to-zone projection”, and “glomerular convergence”

Of interest: the hardwiring of the

vertebrate olfactory system



• Expression of a specific odorant receptor gene by

an olfactory neuron.



• Targeting and convergence of like axons to specific

glomeruli in the olfactory bulb.

The biological question in this case





Are there genes with spatially

restricted expression patterns within

the olfactory bulb?

Layout of the cDNA Microarrays

• Sequence verified mouse cDNAs

• 19,200 spots in two print groups of 9,600 each

– 4 x 4 grid, each with 25 x24 spots

– Controls on the first 2 rows of each grid.









pg1 pg2

Design: How We Sliced Up the Bulb



A





P D









V L





M

Design: Two Ways to Do the

Comparisons

Goal: 3-D representation of gene expression

Compare all samples to a Multiple direct comparisons

common reference between different samples

sample (e.g., whole bulb) (no common reference)



A M A M





V V

R

D D



L P L P

An Important Aspect of Our Design



Different ways of estimating

the same contrast:

e.g. A compared to P

Direct = A-P

Indirect = A-M + (M-P) or

A-D + (D-P) or

-(L-A) - (P-L)









How do we combine these?

Analysis using a linear model

Define a matrix X so that E(M)=X



Use least squares estimates for A-L, P-L, D-L, V-L, M-L

In practice, we use robust regression.

Estimates for other estimable contrasts follow in the usual way.

A  L 

m1  0 0 0 1 1

P  L 

m2  1 0 0 0 0

E     D L 

  

    V  L 

mn  1 1 0 0 0  

M  L

ˆ  X' X 1 X' M



The Olfactory Bulb Experiments

completed so far

Contrasts & Patterns

Because of the connectivity of our experiment, we can estimate

all 15 different pairwise comparisons directly and/or indirectly.



For every gene we thus have a pattern based on the 15

pairwise comparisons.



Gene #15,228

Contrasts & patterns:another way

Instead of estimating pairwise comparisons between each of the six

effects, we can come closer to estimating the effects themselves by

doing so subject to the standard zero sum constraint (6 parameters, 5

d.f.).



What we estimate for A, say, subject to this constraint, is in reality an

estimate of



A - 1/6(A + P + D + V + M + L).



This set of parameter estimates gives results similar to, but better than,

the ones we would have obtained had we carried out the experiments

with whole-bulb reference tissue.



In effect we have created the whole-bulb reference in silico.

Alternative pattern representation



Gene #15,228

once again.

Reconstruction of the Bulb as a Cube:

Expression of Gene # 15,228



High









Low



Expression

Level

Patterns, More Globally...

Can we identify genes with interesting

patterns of expression across the bulb?

Two approaches:

1. Find the genes whose expression fits

specific, predefined patterns.

2. Perform cluster analysis - see what

expression patterns emerge.

Clustering procedure

Start with a sets of genes exhibiting some minimal level of differential

expression across the bulb; here ~650 were chosen from all 15 contrasts.



Carry out hierarchical clustering, building a dendrogram: Mahalanobis

distance and Ward agglomeration (minimum variance) were used.



Now consider all clusters of 2 or more genes in the tree. Singles are

added separately.



Measure the heterogeneity h of a cluster by calculating the 15 SDs

across the cluster of each of the pairwise effects, and taking the largest.



Choose a score s (see plots) and take all maximal disjoint clusters with

h < s. Here we used s = 0.46 and obtained 16 clusters.

Plots guiding choice of clusters of genes









Number of Number

clusters of genes

(patterns)









Cluster heterogeneity h (max of 15 SDs)

PA DA VA MA









LA DP VP MP









LP VD MD LD









MV

LA LV LM

Red :genes chosen

Blue:controls



15 p/w effects

The 16 groups systematically arranged (6 point representation)

Validation of Gene # 15,228 Expression

Pattern by RNA In Situ Hybridization





CTX CTX

AOB

AOB





MOB

MOB









gluR #15,228

Gene 15,228: another in situ view

384

(group 3)









D



L M



V

3-dimension reconstruction from in-situ data









15,228





5,291





8,496





384

Are the pattens we found real?



Here‟s how we attempted to show that the answer is a qualified yes.



Each cluster average (pattern) has a „strength‟ we can measure by

its root-mean-square (RMS). The n=16 clusters we found have an average

RMS of av= 0.3. Both n and av being strongly determined by our

heterogeneity cut-off score of s=0.46.



Now consider randomizing the labels (e.g. P-A) on our hybridizations and

repeating the entire analysis, keeping the cut-off score at 0.46. We typically

get fewer, “weaker” patterns, with less contrast in the red-green

patchwork. One such is on the next page.



500 independent random relabellings had a mean av value of 0.18, an SD

of 0.07 and a max av value of 0.26, cf. 0.3 in our data. Our clusters are

definitely „non-random‟ in some sense.

Real









Random

Problem



We later tried all this with a different set of data, one

which made use of reference mRNA had generally

lower S/N, and where the inveestigator sought fewer

interesting patterns.



We found that the patterns the previous method

discovered were similarly quite distinct in av values

from those in randomly labelled hybs, but this time,

the av values were „significantly‟ lower than random.

It all depends where you are on the curve.

Where next?



I feel that we need a new idea. The previous one

doesn‟t seem to have worked. Or did it?



Just clustering and taking averages seems too easy….



But maybe clustering is all there is to patterns, once

we have decided on the appropriate and context

dependent profile to cluster, and selected the genes,

but I keep wondering…

Some statistical research stimulated

by microarray data analysis

Experimental design : Churchill & Kerr

Image analysis: Zuzan & West, ….

Data visualization: Carr et al

Estimation: Ideker et al, ….

Multiple testing: Westfall & Young , Storey, ….

Discriminant analysis: Golub et al,…

Clustering: Hastie & Tibshirani, Van der Laan,

Fridlyand & Dudoit, ….

Empirical Bayes: Efron et al, Newton et al,….

Multiplicative models: Li &Wong

Multivariate analysis: Alter et al

Genetic networks: D‟Haeseleer et al and more

In closing: The pervasiveness of

microarray technology

and the statistical problems that go with it





Hybridization of target DNA or RNA to large

numbers of probes attached to a solid support

in a microarray format has a much wider

applicability.



All such applications have their own statistical

problems. Here are two relating to the

previous lectures.

Meiosis data in which all exchanges

are precisely located (from microarrays)





Yeast









Figure courtesy of J Derisi

Exon Arrays can validate Exon Predictions

and assemble Gene Structures



One or more Probes per Predicted Exon





Predicted exon Predicted exon









• Verify predicted exons on a genome-wide scale.

• Group exons into genes via co-regulation.







This and the next slide courtesy of Rosetta

Tiling arrays can identify exons and

refine gene structures





Predicted exon Predicted exon









10 bp steps





Oligonucleotides

60 bp in length

“60-mers”

Acknowledgments

Statistical collaborators

Ngai Lab (Berkeley )

Yee Hwa Yang (Berkeley)

Cynthia Duggan

Sandrine Dudoit (Berkeley)

Jonathan Scolnick

Ingrid Lönnstedt (Uppsala)

Dave Lin

Natalie Thorne (WEHI)

Vivian Peng

Mauro Delorenzi (WEHI)

Percy Luu

Elva Diaz

CSIRO Image Analysis Group

John Ngai

Michael Buckley

Ryan Lagerstorm

LBNL

Matt Callow

WEHI

Glenn Begley

RIKEN Genomic Sciences Center

Suzie Grant

Yasushi Okazaki

Rob Good

Yoshihide Hayashizaki

PMCI

Chuang Fong Kong


Related docs
Other docs by HC111123123233
DGN
Views: 7  |  Downloads: 0
Chapter 5
Views: 2  |  Downloads: 0
SAMPLE JOB DESCRIPTION
Views: 0  |  Downloads: 0
December 12, 2003, Vol
Views: 1  |  Downloads: 0
????????? (System Theory)
Views: 0  |  Downloads: 0
December 24, 2006
Views: 0  |  Downloads: 0
?????????? ???????
Views: 12  |  Downloads: 0
MANAGEMENT 602
Views: 4  |  Downloads: 0
Section 17110
Views: 5  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!