GGG 298 – Lab 4
1) What is Bioconductor?
2) Bioconductor Packages
What is Bioconductor?
The default biocLite.R installation comes with the basic Bioconductor packages which allows you do
necessary processing of microarrays. You performed the necessary installations to add on the
To have a look at the bioconductor packages installed (and other CRAN packages too).
Packages in library '/usr/lib/R/library':
aCGH Classes and functions for Array Comparative
Genomic Hybridization data.
affy Methods for Affymetrix Oligonucleotide Arrays
affydata Affymetrix Data for Demonstration Purpose
affyio Tools for parsing Affymetrix data files
affyPLM Methods for fitting probe-level models
affyQCReport QC Report Generation for affyBatch objects
annaffy Annotation tools for Affymetrix biological
annotate Annotation for microarrays
AnnotationDbi Annotation Database Interface
base The R Base Package
beadarray Quality assessment and low-level analysis for
Illumina BeadArray data
beadarraySNP Normalization and reporting of Illumina SNP
Biobase Biobase: Base functions for Bioconductor
biomaRt Interface to BioMart databases (e.g. Ensembl,
Wormbase and Gramene)
The first column is the name of the library/package while the second column is the description of the
package. The power of R/Bioconductor lies in these packages. These packages are maintained by an
R-developer half-way around the globe and there's a lot of pride and hard-work that has gone into the
package. Most of R-developer community are scientists themselves and usually after publishing a
piece of work, ie a new algorithm or method, they would publish their package onto CRAN or
Nobody can remember all the packages installed in your system. So don't try to remember all
of them. There are thousands in the CRAN library and some are no longer maintained. The
thing about opensource is that if you like it, use it. If you find it absolutely useless, don't use
If you not found your favorite Bioconductor package in your library list but wish to have it there,
you can always do the following. The first line loads the biocLite R script from the bioconductor
website. The is the installation software for bioconductor. If you had used my biocInstall.R script
earlier, you would have noticed that it uses this same line to install the many Bioconductor packages
on your system.
source is a function for you to load R scripts (or tiny R programs). If you develop your own
programs, you can use source to load them into your workspace which can be used by you later.
The bioconductor team also made it a policy that every bioconductor package that is uploaded or
published is well documented via a pdf vignette. This is one of the key features of Bioconductor. We
will look through some vignettes available on bioconductor. They're slightly more user-friendly than
the help documents, with how-tos and nice examples. If fact, some of the examples from those
vigenettes are used in this lab.
Let's take a look at the Biobase vignette.
Please select a vignette:
1: Biobase - An introduction to Biobase and ExpressionSets
2: Biobase - Bioconductor Overview
3: Biobase - esApply Introduction
4: Biobase - Notes for eSet developers
5: Biobase - Notes for writing introductory 'how to' documents
6: Biobase - quick views of eSet instances
A drop down menu appears. Press 1[ENTER] to select Biobase – An introduction to Biobase and
ExpressionSets. If you are using windows, your default PDF viewer Adobe acrobat will start. For
linux users, your default pdf viewer will start. You can change your default pdf viewer by exporting
this variable into your environment.
Some packages have their own user guides which can be in more detail. For example
Or you can use the good old help to help you on the Bioconductor packages.
My usual workflow would be to look at the vignettes first. Vignettes offer a gloss overview
of the Bioconductor package. It's usually lighter material than the help but occasionally it
can be insufficient for customized analysis. It that happens, I turn to the help command. And
if it goes real bad and i'm completely stuck, I turn to the Bioconductor forum and ask the
Load a Bioconductor package you find interesting
Locate the Vignette for that package
SNP/CNVs (Single Nucleotide Polymorphisms / Copy Number Variations)
There are few Genotyping packages available to Bioconductor. Most of these packages are fairly
new, being released only last year or earlier.
The few packages which I think are worth looking into are V.J.Carey's 'GGBase', 'GGtools' and
Robert Scharpf's 'SNPchip'. V.J Carey has been with the Bioconductor project for quite some time
now, and I think he delivers very good packages.
Here are some outputs of SNPchip which are available in its Vignettes
Plots for first 3 chromosomes and a whole genome plot. This dataset has only 5500 probes, as it is a
sample dataset. A real Affy V6 SNP chip will be far denser and have over 500,000 datapoints. I
wouldn't recommend running these analysis on your desktop computer.
Output from GGtools
beadarraySNP is a Bioconductor package for the analysis of Illumina genotyping BeadArray data,
especially derived from the GoldenGate assay. The functionality includes importing dataﬁles
produced by Illumina software, doing LOH analysis and performing copy number analysis.
The package contains an artiﬁcially small example to show some of the key aspects of analysis.
Normally an experiment contains 96 samples spread over four probe panels of about 1500 SNPs
each. The example has data from 2 colorectal tumors and corresponding leukocyte DNA of 1 probe
(or OPA) panel. This particular OPA panel contains probes on chromosomes 16-20 and X and Y. In
this case all dataﬁles have been put in 1 directory. The import function read.SnpSetIllumina
can be invoked to use all separate directories that are used with the Illumina software.
We start by loading in the data into a SnpSetIllumina object. A samplesheet is used to identify
the samples in the experiment. This ﬁle can be created by the Illumina software. The next code shows
the example samplesheet. The import function searches for the line starting with [Data]. The ”Sample
Well”, ”Sample Plate”, and ”Sample Group” columns are not used, but the other columns should
have meaningful values. The ”Pool ID” and ”Sentrix ID” should be identical for all samples in 1
samplesheet. Samples from multiple experiments or multiple probe panels can be combined after
they are imported.
> datadir <-system.file("testdata", package = "beadarraySNP")
> readLines(paste(datadir, "4samples_opa4.csv", sep = "/"))
 "Investigator Name,Anon,,,,,"
 "Project Name,LOH,,,,,"
 "Experiment Name,LOH,,,,,"
> SNPdata <-read.SnpSetIllumina(paste(datadir, "4samples_opa4.csv",
+ sep = "/"), datadir)
SnpSetIllumina (storageMode: list)
assayData: 1453 features, 4 samples
element names: call, callProbability, G, R
sampleNames: 106NB, 106TV, 108NB, 108TV
varLabels and varMetadata description:
featureNames: rs935971, rs963598, ..., rs3093505 (1453 total)
fvarLabels and fvarMetadata description:
experimentData: use 'experimentData(object)'
The targets ﬁle contains extra information on the samples which are important during normalization.
The NorTum column indicates normal and tumor samples. The tumor samples are not used for the
invariant set in between sample normalization. The Gender column is used to properly normalize
the sex chromosomes. The following lines show a possible way to add these columns to the
phenoData slot of the object. If a NorTum and/or Gender column is in the phenoData slot
they will be automatically used later on.
> pd <-read.AnnotatedDataFrame(paste(datadir, "targets.txt", sep = "/"),
+ sep = "\t")
> pData(SNPdata) <-cbind(pData(SNPdata), pData(pd))
Illumina Sentrix arrays contain 96 wells to process up to 96 samples. Each well can be used with a
separate set of probes or OPA panel. The example shows the QC of the full experiment form which
the example ﬁles were taken. This experiment consisted of 24 samples with 4 probe panels each
> qc <-calculateQCarray(SNPdata)
> plotQC(QC.260, "greenMed")
Other types of plots (”intensityMed”,”greenMed”,”redMed”,”validn”,”annotation”, ”samples”) show
the median intensity, median red intensity or identifying information.
> par(mfrow = c(2, 2), mar = c(4, 2, 1, 1))
> reportSamplePanelQC(QC.260, by = 8)
> SNPdata <-removeLowQualitySamples(SNPdata, 1500, 100, "OPA")
GS0005704-OPA 4 subsamples, 0 removed
Based on these ﬁndings, low quality samples can be removed from the experiment
Normalization to calculate copy number is a multi-step process. In Oosting(2007) we have
determined that the following procedure provides the optimal strategy:
1. Perform quantile normalization between both colors of a sample. This is allowed because the
frequencies of both alleles throughout a sample are nearly identical in practice. This action
also neutralizes any dye bias.
2. Scale each sample using the median of the high quality heterozygous SNPs as the
normaliza¬tion factor. Genomic regions that show copy number alterations are likely to show
LOH(loss of heterozygosity), or are harder to genotype leading to a decrease quality score of
3. Scale each probe using the normal samples in the experiment. Assume that these samples
are diploid, and have a copy number of 2.
> SNPnrm <-normalizeBetweenAlleles.SNP(SNPdata)
> SNPnrm <-normalizeWithinArrays.SNP(SNPnrm, callscore = 0.8,
+ relative = TRUE, fixed = FALSE, quantilepersample = TRUE)
> SNPnrm <-normalizeLoci.SNP(SNPnrm, normalizeTo = 2)
Although the OPA panels contain distinct chromosomes, also a few spurious SNPs on other
chromosomes are in it. We ﬁrst select the probes that are located on the chromosomes for this OPA
> SNPnrm <-SNPnrm[featureData(SNPnrm)$CHR %in% c("4", "16", "17",
+ "18", "19", "20", "X", "Y"), ]
> reportSamplesSmoothCopyNumber(SNPnrm, normalizedTo = 2, smooth.lambda = 4)
A ﬁgure is created of all 4 samples in the dataset with copynumber along the chromosomes.
> reportSamplesSmoothCopyNumber(SNPnrm, normalizedTo = 2, paintCytobands = TRUE,
+ smooth.lambda = 4, organism = "hsa", sexChromosomes = TRUE)
By using the organism a ﬁgure is created that shows all chromosomes in 2 columns. Experimental
data is plotted wherever it is available in the dataset.
Oosting J, Lips EH, van Eijk R, Eilers PH, Szuhai K, Wijmenga C, Morreau H, van Wezel T. High-
resolution copy number analysis of paraﬃn-embedded archival tissue using SNP BeadArrays.
Genome Res. 2007 Mar;17(3):368-76. Epub 2007 Jan 31.
The version number of R and packages loaded for generating the vignette were:
• R version 2.10.0 (2009-10-26), x86_64-unknown-linux-gnu
• Locale: LC_CTYPE=en_US.UTF-8, LC_NUMERIC=C, LC_TIME=en_US.UTF-8,
LC_COLLATE=en_US.UTF-8, LC_MONETARY=C, LC_MESSAGES=en_US.UTF-8,
LC_PAPER=en_US.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C,
• Base packages: base, datasets, graphics, grDevices, grid, methods, stats, tools, utils
• Other packages: beadarraySNP 1.12.0, Biobase 2.6.0, limma 3.2.1, lodplot 1.1, quantreg 4.43,
quantsmooth 1.12.0, SparseM 0.80