Lecture2 by SUSB


									Metadata with Bioconductor

         Educational Materials
   ©2004–2007 R. Gentleman, W. Huber

                                       – p. 1/44
Integrative biology - fuse information from biological
research databases of many different types.
In this lecture we focus attention on resources that can
help to make use of metadata in different analyses.

                                                           – p. 2/44
     Alternative Strategies
The biomaRt package provides a set of tools that allow
you to access online databases such as Ensembl,
Vega, Uniprot, MSD, Wormbase.
We have begun development of a new set of
metadata packages that make use of a database
(SQLite) rather than R’s environments. This will result
in smaller size, slightly slower access, but much more
general queries.
Make use of the chip manufacturer’s resources (for
Affymetrix this is the NetAffx site).

                                                          – p. 3/44
              Per chip annotation
An early design decision was that we should provide metadata on a
per chip-type basis.
> library("hgu133a")
> ls("package:hgu133a")

 [1]   "hgu133a"                     "hgu133aACCNUM"
 [3]   "hgu133aCHR"                  "hgu133aCHRLENGTHS"
 [5]   "hgu133aCHRLOC"               "hgu133aENTREZID"
 [7]   "hgu133aENZYME"               "hgu133aENZYME2PROBE"
 [9]   "hgu133aGENENAME"             "hgu133aGO"
[11]   "hgu133aGO2ALLPROBES"         "hgu133aGO2PROBE"
[13]   "hgu133aLOCUSID"              "hgu133aMAP"
[15]   "hgu133aMAPCOUNTS"            "hgu133aOMIM"
[17]   "hgu133aORGANISM"             "hgu133aPATH"
[19]   "hgu133aPATH2PROBE"           "hgu133aPFAM"
[21]   "hgu133aPMID"                 "hgu133aPMID2PROBE"
[23]   "hgu133aPROSITE"              "hgu133aQC"
[25]   "hgu133aREFSEQ"               "hgu133aSUMFUNC"
[27]   "hgu133aSUMFUNC_DEPRECATED"   "hgu133aSYMBOL"
[29]   "hgu133aUNIGENE"
                                                                    – p. 4/44
       A brief description
These packages contain R environments, which are
used as hash tables.
For each package, data provenance information is
provided, e. g. ? hgu133a
Quality control information is available, e. g.
hgu133a(). This reports how many of each of the
different types of mappings were found.

                                                   – p. 5/44
 Accessing annotation packages
You can access the data directly using any of the standard subsetting
or extraction tools for environments: get, mget, $ and [[.
> get("201473_at", hgu133aSYMBOL)

[1] "JUNB"

> mget(c("201473_at","201476_s_at"), hgu133aSYMBOL)

$ 201473_at
[1] "JUNB"

$ 201476_s_at
[1] "RRM1"

> hgu133aSYMBOL$"201473_at"

[1] "JUNB"

> hgu133aSYMBOL[["201473_at"]]

[1] "JUNB"
                                                                        – p. 6/44
                   Metadata I
EntrezGene is a catalog of genetic loci that connects
    curated sequence information to official nomenclature.
    It replaced LocusLink.
UniGene defines sequence clusters. UniGene focuses on
   protein-coding genes of the nuclear genome
   (excluding rRNA and mitochondrial sequences).
RefSeq is a non-redundant set of transcripts and proteins
    of known genes for many species, including human,
    mouse and rat.
Enzyme Commission (EC) numbers are assigned to
   different enzymes and linked to genes through
InParanoid A collection of eukaryotic ortholog groups,
    similar to the COGS data base
                                                            – p. 7/44
                  Metadata II
Gene Ontology (GO) is a structured vocabulary of terms
   describing gene products according to molecular
   function, biological process, or cellular component
PubMed is a service of the U.S. National Library of
   Medicine. PubMed provides a rich resource of data
   and tools for papers in journals related to medicine
   and health. While large, the data source is not
   comprehensive, and not all papers have been

                                                          – p. 8/44
                  Metadata III
OMIM Online Mendelian Inheritance in Man is a catalog
  of human genes and genetic disorders.
NetAffx Affymetrix’ NetAffx Analysis Center provides
   annotation resources for Affymetrix GeneChip
KEGG Kyoto Encyclopedia of Genes and Genomes; a
   collection of data resources including a rich collection
   of pathway data.
IntAct Protein Interaction data, mainly derived from
Pfam Pfam is a large collection of multiple sequence
   alignments and hidden Markov models covering many
   common protein domains and families.

                                                              – p. 9/44
                Metadata IV
Chromosomal Location Genes are identified with
   chromosomes, and where appropriate with strand.
Data Archives The NCBI coordinates the Gene
   Expression Omnibus (GEO); TIGR provides the
   Resourcerer database, and the EBI runs

                                                     – p. 10/44
       Working with Metadata
Suppose we are interested in the gene BAD.
> gsyms <- unlist(as.list(hgu133aSYMBOL))
> whBAD <- grep("^BAD$", gsyms)
> gsyms[whBAD]

  1861_at 209364_at
    "BAD"     "BAD"

> hgu133aGENENAME$"1861_at"

[1] "BCL2-antagonist of cell death"

                                             – p. 11/44
                       BAD Pathways
Find the pathways that BAD is associated with.
> BADpath <- hgu133aPATH$"1861_at"
> kegg <- mget(BADpath, KEGGPATHID2NAME)
> unlist(kegg)

        "Neurodegenerative Disorders"
             "VEGF signaling pathway"
                     "Focal adhesion"
          "Insulin signaling pathway"
"Amyotrophic lateral sclerosis (ALS)"
                  "Colorectal cancer"
                  "Pancreatic cancer"
           "Chronic myeloid leukemia"
                                                 – p. 12/44
                       BAD Pathways
We can get the GeneChip probes and the unique EntrezGene loci in each of these
pathways. First, we obtain the Affymetrix IDs, and then, in the next slide we map them to
>   allProbes <- mget(BADpath, hgu133aPATH2PROBE)
>   str(allProbes)

List of 10
 $ 01510: chr   [1:81] "206679_at" "209462_at" "203382_s_at" "200602_at" ...
 $ 04210: chr   [1:161] "207163_s_at" "203808_at" "203809_s_at" "211453_s_at" ...
 $ 04370: chr   [1:135] "207163_s_at" "203808_at" "203809_s_at" "211453_s_at" ...
 $ 04510: chr   [1:414] "200801_x_at" "217211_at" "AFFX-HSAC07/X00351_3_at" "AFFX-
 $ 04910: chr   [1:251] "212186_at" "214358_at" "214584_x_at" "221928_at" ...
 $ 05030: chr   [1:41] "1861_at" "209364_at" "208478_s_at" "211833_s_at" ...
 $ 05210: chr   [1:168] "205209_at" "208218_s_at" "208219_at" "208222_at" ...
 $ 05212: chr   [1:156] "205209_at" "208218_s_at" "208219_at" "208222_at" ...
 $ 05218: chr   [1:136] "207163_s_at" "203808_at" "203809_s_at" "211453_s_at" ...
 $ 05220: chr   [1:163] "202123_s_at" "205209_at" "208218_s_at" "208219_at" ...

                                                                                        – p. 13/44
                      BAD Pathways
And then we can map these to their Entrez Gene values using the code below.
>   getEG = function(x) unique(unlist(mget(x, hgu133aENTREZID)))
>   allEG = sapply(allProbes, getEG)
>   str(allEG)

List of 10
 $ 01510: int   [1:37] 320 333 348 351 572 581 596 598 834 836 ...
 $ 04210: int   [1:81] 207 208 317 329 330 331 355 356 472 572 ...
 $ 04370: int   [1:66] 207 208 572 842 998 1432 3265 3791 3845 4772 ...
 $ 04510: int   [1:186] 60 70 71 81 87 88 89 207 208 329 ...
 $ 04910: int   [1:131] 31 32 207 208 369 572 673 801 805 808 ...
 $ 05030: int   [1:18] 572 581 596 598 847 2876 3735 4741 4744 4747 ...
 $ 05210: int   [1:82] 91 207 208 324 332 369 572 581 595 596 ...
 $ 05212: int   [1:72] 91 207 208 369 572 595 598 673 675 842 ...
 $ 05218: int   [1:67] 207 208 369 572 595 673 999 1019 1021 1026 ...
 $ 05220: int   [1:74] 25 91 207 208 369 572 595 598 613 673 ...

                                                                              – p. 14/44
           Annotating a Genome
Bioconductor also provides some comprehensive annotations for
whole genomes (e.g. S. cerevisae).
These packages are like the chip annotation packages, except a
different set of primary keys is used (e.g. for yeast we use the
systematic names such as YBL088C)
> library("YEAST")
> ls("package:YEAST")[1:12]

 [1]   "YEAST"                     "YEASTALIAS"
 [3]   "YEASTCHR"                  "YEASTCHRLENGTHS"
[11]   "YEASTGO"                   "YEASTGO2ALLPROBES"

                                                                   – p. 15/44
          The annotate package
    Functions for harvesting of curated persistent data sources
    functions for simple HTTP queries to web service providers
    interface code that provides common calling sequences for the
    assay based metadata packages such as getSEQ perform web
    queries to NCBI to extract the nucleotide sequence
    corresponding to a GenBank accession number.

> gsq <- getSEQ("M22490")

> substring(gsq,1,40)


M22490: mapped to locus HUMBMP2B; Human bone morphogenetic
protein-2B (BMP-2B) mRNA.

                                                                    – p. 16/44
              The annotate package
      other interface functions include getGO, getSYMBOL, getPMID, and getLL
      functions whose names start with pm work with lists of PubMed identifiers for
      journal articles.

> hgu133aSYMBOL$"209905_at"

[1] "HOXA9"

> pm.getabst("209905_at", "hgu133a")

$ 209905_at
$ 209905_at [[1]]
An object of class ³pubMedAbst³:
Title: Vertebrate homeobox gene nomenclature.
PMID: 1358459
Authors: MP Scott
Journal: Cell
Date: Nov 1992

$ 209905_at [[2]]
An object of class ³pubMedAbst³:
Title: Nomenclature for human homeobox genes.
PMID: 1973146
Authors: PJ McAlpine, TB Shows
Journal: Genomics
                                                                                     – p. 17/44
        Working with GO
An ontology is a structured vocabulary that
characterizes some conceptual domain.
The Gene Ontology (GO) Consortium defines three
ontologies characterizing aspects of knowledge about
genes and gene products.
These ontologies are
   molecular function (MF),
   biological process (BP)
   cellular component (CC).

                                                       – p. 18/44
molecular function of a gene product is what it does at the
biochemical level. This describes what the gene product
can do, but without reference to where or when this activity
actually occurs. Examples of functional terms include
“enzyme," “transporter," or “ligand."
biological process is a biological objective to which the
gene product contributes. There is often a temporal aspect
to a biological process. Biological processes usually
involve the transformation of a physical thing. The terms
“DNA replication” or “signal transduction” describe general
biological processes.
cellular component is a part of a cell that is a component of
some larger object or structure. Examples of cellular
components include “chromosome”, “nucleus” and

                                                            – p. 19/44
      GO Characteristics
                 Number of Terms
            BP             13155
            CC              1864
            MF              7527

Table 1: Number of GO terms per ontology.

                                            – p. 20/44
                 GO hierarchy
GO terms can be linked by two relationships:
    is a: class-subclass relationship, for example, nuclear
    chromosome is a chromosome
    part of: C part of D means that when C is present, it is
    a part of D, but C does not always have to be present.
    For example, nucleus is part of cell.
The ontologies are structured as directed acyclic graphs.
DAGs are similar to hierarchies but a child term can have
multiple parent terms. For example, the biological process
term hexose biosynthesis has two parents, hexose
metabolism and monosaccharide biosynthesis.

                                                              – p. 21/44
             Working with GO
For precision and conciseness, all indexing of GO
resources employs 7-digit tags with prefix GO: for example
GO:0008094. Three basic tasks that are commonly
performed in conjunction with GO are
    navigating the hierarchy, determining parents and
    children of selected terms, and deriving subgraphs of
    the overall DAG constituting GO;
    resolving the mapping from GO tag to natural
    language characterizations of function, location, or
    resolving the mapping between GO tags or terms and
    elements of catalogs of genes or gene products.

                                                            – p. 22/44
   Navigating the hierarchy
Finding parents and children of different terms is handled by
using the PARENT and CHILDREN mappings.
To find the children of "GO:0008094" we use:
> get("GO:0008094", GOMFCHILDREN)
         isa          isa          isa          isa
"GO:0003689" "GO:0004003" "GO:0015616" "GO:0043142"
We use the term offspring to refer to all descendants (children,
grandchildren, and so on) of a node.
Similarly we use the term ancestor to refer to the parents,
grandparents, and so on, of a node.
> get("GO:0008094", GOMFOFFSPRING)
[1] "GO:0003689" "GO:0004003" "GO:0015616" "GO:0017116"
[5] "GO:0043140" "GO:0043141" "GO:0043142"

                                                                   – p. 23/44
                    GO terms
All GO terms are provided in the GOTERM environment.
> GOTERM$"GO:0002021"

GOID = GO:0002021

Term = response to dietary excess

Definition = The physiological process by which
     dietary excess is sensed by the central nervous
     system and results in a reduction in food intake
     and increased energy expenditure.

Ontology = BP

                                                       – p. 24/44
              Searching for terms
Let’s search for terms containing the word chromosome using eapply
and grep. We us the value=TRUE option to grep to retrieve the text. If
there is no match then a zero length character vector is returned.

> chrterms = eapply(GOTERM,
+     function(x) grep("chromosome", Term(x), value=TRUE))
> chrterms[[1]]


Now we can find those that have positive length.

> chrT2 = chrterms[sapply(chrterms, length) > 0]
> chrT2[[1]]

[1] "mitochondrial chromosome"

                                                                         – p. 25/44
               Evidence Codes
The mapping of genes to GO terms is carried out by GOA,
a project run by the European Bioinformatics Institute that
aims to provide assignments of gene products to GO
Four environments in the GO package address the
association between EntrezGene sequence entries and
GO terms:

 GOENTREZID           GO → EntrezGene ID (non-redundant)
 GOALLENTREZID        GO → EntrezGene ID (incl. implied)
 GOENTREZID2GO        EntrezGene → GO term
 GOENTREZID2ALLGO     EntrezGene → GO term

                                                              – p. 26/44
                Evidence Codes
             GO Definition
 [1,]   "IMP"          "inferred from mutant phenotype"
 [2,]   "IGI"          "inferred from genetic interaction"
 [3,]   "IPI"          "inferred from physical interaction"
 [4,]   "ISS"          "inferred from sequence similarity "
 [5,]   "IDA"          "inferred from direct assay"
 [6,]   "IEP"          "inferred from expression pattern"
 [7,]   "IEA"          "inferred from electronic annotation"
 [8,]   "TAS"          "traceable author statement"
 [9,]   "NAS"          "non-traceable author statement"
[10,]   "ND"           "no biological data available"
[11,]   "IC"           "inferred by curator"

                                                               – p. 27/44
             GO Evidence Codes
Find the GO identifier for “transcription factor binding” and use that to
get all EntrezGene IDs with that annotation. We then summarize them,
to show how many are due to different evidence codes.
>   uterms = unlist(eapply(GOTERM, Term))
>   tfb <- names(which(uterms=="transcription factor binding"))
>   gg1 <- get(tfb, GOENTREZID)
>   table(names(gg1))

 26 89    1 48 90     5        1 40

                                                                       – p. 28/44
Consider the gene with EntrezGene ID 7355, SLC35A2
> z <- get("7355", GOENTREZID2GO)
> length(z)

[1] 10

> sapply(z, "[[" , "Ontology")

GO:0000139 GO:0005338 GO:0005351 GO:0005459 GO:0006012
      "CC"       "MF"       "MF"       "MF"       "BP"
GO:0008643 GO:0015780 GO:0015785 GO:0016020 GO:0016021
      "BP"       "BP"       "BP"       "CC"       "CC"

there are 10 different GO terms. Note that these are only the most
specific terms this gene maps to, the other less specific terms can be
obtained using the child-parent relationships in GO.
We get those from the BP ontology by using the helper function
> getOntology(z, "BP")

[1] "GO:0006012" "GO:0008643" "GO:0015780" "GO:0015785"
                                                                       – p. 29/44
                Evidence Codes
We get the evidence codes using getEvidence and can drop codes
using dropEcode:
> getEvidence(z)

GO:0000139 GO:0005338 GO:0005351 GO:0005459 GO:0006012
     "IEA"      "IEA"      "IEA"      "TAS"      "TAS"
GO:0008643 GO:0015780 GO:0015785 GO:0016020 GO:0016021
     "IEA"      "IEA"      "TAS"      "IEA"      "IEA"

> zz <- dropECode(z, code="IEA")
> getEvidence(zz)

GO:0005459 GO:0006012 GO:0015785
     "TAS"      "TAS"      "TAS"

                                                                 – p. 30/44
                              GO graphs
For any set of selected genes, and any of the three GO ontologies the induced GO graph
is the set of GO terms that the genes are associated with, together with all less specific
The term “transcription factor activity” is in the molecular function (MF) ontology and has
the GO label GO:0003700

> library("GO")
> library("GOstats")

> GOTERM$"GO:0003700"

GOID = GO:0003700

Term = transcription factor activity

Synonym = GO:0000130

Secondary = GO:0000130

Definition = The function of binding to a specific DNA
     sequence in order to modulate transcription. The
     transcription factor may or may not also interact
     selectively with a protein or macromolecular

Ontology = MF                                                                             – p. 31/44
              Induced GO graph
The induced graph, based on the MF hierarchy, can be produced using
the GOGraph function of the package GOstats
> tfG = GOGraph("GO:0003700", GOMFPARENTS)
We can plot the induced GO graph using Rgraphviz and the code below.
>   library("Rgraphviz")
>   tfG = removeNode("all", tfG)
>   mt = match(nodes(tfG), names(uterms))
>   stopifnot(!any(is.na(mt)))
>   nattr = makeNodeAttrs(tfG,
+      label     = uterms[mt], shape      = "ellipse",
+      fillcolor = "#f2f2f2", fixedsize = FALSE)

> plot(tfG, nodeAttrs=nattr)

                                                                   – p. 32/44
                      GO graph
                         transcription factor activity

                  DNA binding

          nucleic acid binding             transcription regulator activity



GO relationships for term “transcription factor activity”.

                                                                              – p. 33/44
               Induced GO graphs
> tfch <- GOMFCHILDREN$"GO:0003700"


> tfchild <- mget(tfch, GOTERM)

$ GO:0003705
GOID = GO:0003705

Term = RNA polymerase II transcription factor
     activity, enhancer binding

Definition = Functions to initiate or regulate RNA
     polymerase II transcription by binding an
     enhancer region of DNA.

Ontology = MF

                                                     – p. 34/44
KEGG provides mappings from genes to pathways
We provide these in the package KEGG, you can also
query the site directly using KEGGSOAP or other
One problem with the KEGG is that the data is not in a
form that is amenable to computation. The cMAP
project provides data that is somewhat more useful for
constructing networks.

                                                     – p. 35/44
        Data in KEGG package
KEGGEXTID2PATHID provides mapping from either
   EntrezGene (for human, mouse and rat) or Open
   Reading Frame (yeast) to KEGG pathway ID.
   KEGGPATHID2EXTID contains the mapping in the other
KEGGPATHID2NAME provides mapping from KEGG
   pathway ID to a textual description of the pathway.
   Only the numeric part of the KEGG pathway
   identifiers is used (not the three letter species codes)

                                                             – p. 36/44
         Counts per species
         ath dme   eat hsa mmu rno    sce
Counts   115 116   85 191   185 177    99

  Table 2: Pathway Counts Per Species

                                            – p. 37/44
                 Exploring KEGG
Consider pathway 00362.

[1] "Benzoate degradation via hydroxylation"

Species specific mapping from pathway to genes is indicated by
glueing together three letter species code, e. g. texttthsa, and numeric
pathway code.

[1] "10449" "1891"      "30"      "3032"    "59344" "83875"


[1] "YIL160C" "YKR009C"

                                                                           – p. 38/44
                Exploring KEGG
PAK1 has EntrezGene ID 5058 in humans

[1] "hsa04010" "hsa04360" "hsa04510" "hsa04650" "hsa04660"
[6] "hsa04810" "hsa05120"


[1] "MAPK signaling pathway"

We find that it is involved in 300 pathways. For mice, the MAPK
signaling pathway contains
> mm <- KEGGPATHID2EXTID$mmu04010
> str(mm)

 chr [1:252] "102626" "109689" "109880" "109905" ...

                                                                 – p. 39/44
Two genes are said to be homologous if they have
descended from a common ancestral DNA sequence.
There is one homology package for each species; a
three letter species name (e. g. hsa) and suffix
The current system is going to be changed and
improved. For the time being, one possible alternative
is described in the biomaRt lecture.

                                                     – p. 40/44
Sequence information is becoming widely available
and can be used for a variety of purposes.
The Biostrings package provides a number of useful
string handling and searching functions.
It provides tools to read FASTA files, to carry pattern
matching (the matchPattern function), and has an
implemenation of the Needleman-Wunsch global
alignment algorithm.

                                                         – p. 41/44
Protein-protein interactions
There are a variety of resources for protein interaction data.
We are currently using IntAct (http://www.ebi.ac.uk/intact/)
There is a package, Rintact that can retrieve and work with data
sets from this repository.

                                                                   – p. 42/44
     Protein-protein interactions
The package ppiData provides datasets for all recent HT protein
interaction experiments using Y2H or AP-MS technologies.
> library("ppiData")
> data("Gavin2006BPGraph")
> Gavin2006BPGraph

A graphNEL graph with directed edges
Number of Nodes = 2551
Number of Edges = 19105

> nodes(Gavin2006BPGraph)[1:7]

[1] "YBR238C" "YLR274W" "YDR460W" "YGL240W" "YPR189W"
[6] "YLR347C" "YBR167C"

> edgeL(Gavin2006BPGraph)[[1]]

 [1] 161 843 1032 1356 1394 1753 1754 1755 1756 1757 1758
[12] 1759 1760 1761 1762 1763
                                                                  – p. 43/44
R version 2.6.0 Under development (unstable) (2007-06-12 r41918),
Locale: C
Base packages: base, datasets, grDevices, graphics, methods, splines, stats,
tools, utils
Other packages: AnnotationDbi 0.0.78, Biobase 1.15.15, Category 2.3.14,
DBI 0.2-3, GO 1.17.0, GOstats 2.3.6, KEGG 1.17.0, RBGL 1.13.2,
RSQLite 0.5-4, Rgraphviz 1.15.5, XML 1.9-0, YEAST 1.17.0, annotate 1.15.2,
cMAP 1.15.1, genefilter 1.15.6, geneplotter 1.15.1, graph 1.15.4, hgu133a 1.17.0,
lattice 0.15-8, ppiData 0.1.7, survival 2.31, xtable 1.4-6
Loaded via a namespace (and not attached): KernSmooth 2.22-20,
RColorBrewer 0.2-3, cluster 1.11.7, grid 2.6.0

                                                                               – p. 44/44

To top