Normalization exercise DNA Microarray Data Analysis Workshop
Document Sample


Normalization exercise
DNA Microarray Data Analysis Workshop - Thailand
By H. Bjørn Nielsen
Strongly inspired by Laurent Gautier
1 Affymetrix Arrays
During this exercise, you will use data from oligonucleotides arrays sold by the
company Affymetrix (one of the primary manufacturer of oligonucleotide chips). The
way data are stored is worth a look. It will help us in understanding what we
manipulate and how to access what we want.
1.1 Affymetrix design
As you may have read too many times already, Affymetrix arrays are constituted of
short probes (oligonucleotides), with several probes related to one gene. Most of the
probe pair sets are constituted of 20 probe pairs.
1.2 Affymetrix files
The data you are about to use are stored in two different types of files, the CDF files
(standing for Chip Definition File) and the CEL files.
CDF files are used to store information related to a specific type of oligonucleotide
array. All the arrays belonging to a given type will share this same information. As
the quantity of information in a CDF can be rather large, this is an important point.
This is stored in the package as only one cdf object, to which the corresponding Cel
files will refer.
In most cases, the user of the R-package affy does not have to worry about CDF data,
since the package automatically takes care of this.
The CEL file holds the intensity of the probes from a single GeneChip hybridization.
1
1.3 Loading the package
The affy package dedicated to the analysis of oligonucleotide arrays needs to be
loaded. If you restart you R session you will have to load it again. It is a component of
the Bioconductor project)
Chunk Get_ready:
library(affy)
library(affydata)
And some more stuff we will need:
Chunk Get_ready2:
source("http://www.cbs.dtu.dk/courses/thaiworkshop/exerci
ses/source.norm.r")
1.4 Loading data
In the course of this exercise, you will use two datasets.
For this exercise, will use a subset of the data from a dilution series. Enter
help(Dilution) if you want to know more).
Chunk Load_data:
data(Dilution)
affybatch <-Dilution
1.5 Observing the raw data
• Make an image of the data.
Chunk Image_chip:
image(affybatch[1], transfo = I)
You can also visualize log-transformed data:
Chunk Image_chip_log:
image(affybatch[1], transfo = log)
• What comment can you make about the transformation on the data?
• Do you think the picture is more detailed with a transformation?
It is usually wise to have a glance at the images before going further. Experimental
problems might be revealed and subsequent action concerning the data could be
taken.
The probe intensities stored in an AffyBatch can be accessed by the method
intensity. Scatterplots of probe intensities represent an interesting view on your
data, as shown below. However, Affymetrix chips contain several hundreds of
thousands of probes. This makes traditional scatterplots very heavy to display or print
and hard to overview. To overcome the problem, we can use a hexagonal binding
plot. In brief, the plotting plane is broken into hexagons and the number of ‘dots’ that
fall in each hexagon is counted. The hexagons are then colored according to the
2
number of points they represent. The darker the hexagon, the more points it
represents.
Chunk Hexbin_plots:
opar <-par("mfrow")
par(mfrow = c(1, 2))
h <-hexbin(intensity(affybatch)[, 1],
intensity(affybatch)[,2])
plotHexbin(h, main = "Raw intensities")
[1] "done ‘grayscale’"
h <-hexbin(log(intensity(affybatch)[, 1]),
log(intensity(affybatch)[,2]))
plotHexbin(h, main = "Log transform intensities")
[1] "done ‘grayscale’"
par(mfrow = opar)
• Comments about the transformation on the data?
• Plot the intensities of different CEL files against each others.
• What observation can you make?
3
The function pairs plot a matrix of scatterplots. We included a special variant of this
function for this course: pairs.custom It works on objects of class AffyBatch and
displays hexbin plots on the triangular lower part of the plot (with the identity line in
green) and density estimates on the diagonal
Do and observe. Chunk Pairs_raw
pairs.custom(affybatch)
.
Compare this with the log transformed pairs.plot
Chunk Pairs log transformed
pairs.custom(affybatch,transfo=log)
4
A superposition of the density estimates for the intensities on each chip is also a very
helpful plot:
Chunk Density_plots:
opar<-par("mfrow")
par(mfrow=c(2,1))
hist(affybatch,main="Raw intensities",log=FALSE)
hist(affybatch,main="Log transformed
intensities",log=TRUE)
par(mfrow=opar)
Raw intensities
0 5000 10000 15000 20000
Intensity
Log transformed intensities
6 8 10 12 14
log intensity
• Comments?
The log transformation can sometimes be necessary. To create an instance of class
AffyBatch containing log-transformed probe intensities, one can do:
affybatch.log <- affybatch
intensity(affybatch.log) <- log(intensity(affybatch))
5
2.6 Normalization (Scaling)
The purpose of normalization is to adjust (or correct) a signal in order to make the
comparison with other signals more meaningful.
Qspline normalization
The normalization method ’Qspline’ is one of the normalization method available in
the affy package. If we decide to apply the normalization step on log transformed
data: Chunk Normalize_qspline:
affybatch.qsp <-normalize(affybatch.log,
method="qspline")
Chunk Pairs_norm_qspline:
pairs.custom(affybatch.qsp)
6
A plot of the distribution of the probe intensities is a useful diagnostic plot:
Chunk Density_norm_qspline:
hist( affybatch.qsp, main = "Normalized by qspline")
Normalized by qspline
2.0 2.2 2.4 2.6 2.8 3.0 3.2
log intensity
• What observation can you make?
• Compare to the plot of non-normalized values (use the identity line drawn in green)
• How would you perform the normalization method constant?
Complete data pre-processing
The normalization step is done at the probe level. Data analysts are often interested in
the expression values. The complete preprocessing of raw A ymetrix data into
expression values can be done using background correction, normalization, perfect
match correction and summary value computation. Several options exist for each step.
Should you need to pre-process A ymetrix data, you could do this:
eset <-expresso(affybatch, widget=TRUE)
7
3 cDNA arrays
cDNA microarrays differ much from A ymetrix arrays, and the strategies to
normalize data are different.
3.1 Load the package and data
The package marray addresses normalization for cDNA arrays. We will use the
dataset swirl. More information about it can be found using the help system.
library(marray)
data(swirl)
3.2 Observing your arrays
Once again, observing the spatial distribution of your intensities can show interesting
features (mainly artefacts)
maImage(swirl)
The function maImage has several options. You can see them using the help system,
and eventually want to explore some of them.
3.3 Normalization
For cDNA array data the boxplot is often used.
maBoxplot(swirl)
You can normalize cDNA array data like this:
swirl.norm <-maNormMain(swirl)
You can observe the effect of normalization:
maBoxplot(swirl.norm)
Here again, there are different strategies to normalize, some being very computer
intensive. Help files for maNormMain can tell you more.
8
4 Bibliography and links
• Laurent Gautier, Leslie Cope, Benjamin M. Bolstad and Rafael A. Irizarry. affy -
analysis of Affymetrix data at the probe level. Bioinformatics, 2003
• Robert Gentleman and Ross Ihaka. R: A language for data analysis and graphics.
Journal of Computational and Graphical Statistics, 5(3), 1996.
• Steen Knudsen. A Biologist’s Guide to Analysis of DNA Microarray Data. Wiley,
New York, 2002.
• C Workman, LJ Jensen, H Jarmer, R Berka, L Gautier, HB Nielsen, HH Saxild, C
Nielsen, S Brunak, and S Knudsen. A new non-linear normalization method to reduce
variability in dna microarray experiments. Genome Biology, 2002.
• http://www.bioconductor.org/
9
Get documents about "