; Catch-up
Documents
User Generated
Resources
Learning Center
Your Federal Quarterly Tax Payments are due April 15th

# Catch-up

VIEWS: 4 PAGES: 246

• pg 1
```									Access to scripts, sample data, etc

http://www.ece.ualberta.ca/~markw/
Perl Unit

Regular Expressions
(see notes from Boris Steipe)
To Start:
review extraction of text from a file in Perl

http://www.ece.ualberta.ca/~markw/data_files/myGenbankFile.gb

open (IN, “./myGenbankFile.gb””) or die “can’t open file \$!\n”;

my \$line = <IN>; # read a line from that file

# do something with the line here

print \$line;

close IN;
Prints the first line of the genbank
file to screen

• The code opens up this file and reads in
the first line. Print out that line to the
Using what you know from regular
expressions
1.   Match the word “LOCUS”
2.   Does a word in that line start with “C”? Print “yes” if true.
3.   Match the year at the end of the line
4.   Match the month at the end of the line
5.   How many amino acids are in the record?
– Print that number out to the screen
6. Match the date, and print it out as “MAR 21, 1995”
7. Print each element in that line on its own line
– E.g. LOCUS
CAA28783
425
aa
linear
8. Get the definition line from the file and print only the definition
9. ***** Get the Title line (this is NOT in a predictable location, and is on more
than one line!) and print out the title without extra spaces.
10. ***** Get the author line and print out the authors in the following format:
MJ Radeke, TP Misko, C Hsu, LA Herzenberg….
Solutions
\$line =~ m/LOCUS/;
1

If (\$line =~ m/\bC/){
2      print “yes”;
}
Solutions
\$line =~ m/\d\d\d\d\$/;
3

4   \$line =~ m/\-\w+\-/;
Solutions
\$line =~ m/(\d+)\saa/;
5   print “\$1\n”;

\$line =~ m/(\d+)\-(\w+)\-(\d\d\d\d)/;
6   print “\$2 \$1, \$3 \n”;
Solutions
while (\$line =~ m/(\S+)/g){
print “\$1\n”;
7   }

\$line   = <IN>;
\$line   = <IN>;
8   \$line   =~ /DEFINITION\s+(.*)/;
print   “\$1\n”;
Solutions
while (!(\$line =~ /TITLE/)){
\$line = <IN>;
}
\$line =~ /TITLE\s+(.*)/;
9   \$title = \$1;
\$line = <IN>;
while (!(\$line =~ /JOURNAL/)){
\$title = “\$title \$line”;
\$line = <IN>;
}
\$title =~ s/\s\s+/ /g; # tricky!
print “the title is \$title\n”;
Solutions
while (!(\$line =~ /AUTHORS/)){
\$line = <IN>;
}
\$line =~ /AUTHORS\s+(.*)/g;
my \$authors = \$1;
while (\$authors =~ /(\w+),(\w)\.(\w)?\.?/){
10        \$surname = \$1;
\$initial1 = \$2;
\$initial2 = \$3;
\$initial2 = “” unless \$initial2;
print “\$initial1\$initial2 \$surname, “;
}
Solutions
(alternative solution to #10)
while (!(\$line =~ /AUTHORS/)){
\$line = <IN>;
}
\$line =~ /AUTHORS\s+(.*)/g;
my \$authors = \$1;
while (\$authors =~ /\s+(\S+)\,(\w)[\.\,](\w)?/){
\$surname = \$1;
\$initial1 = \$2;
\$initial2 = \$3;
\$initial2 = “” unless \$initial2;
print “\$initial1\$initial2 \$surname, “;
}
End of day

have fun in the city!
Unix Unit
CPAN: The
Comprehensive Perl
Archive Network
& Makefiles
CPAN – Comprehensive Perl Archive
• This is where you go to download and install most Perl-based
software libraries

• To search for stuff, go to: http://search.cpan.org
– e.g. Search for “LWP”
– Click the first link and have a quick look at the documentation
to see what the “LWP” library does.
• Click on the “an example” link at the top of the page and
stay there as we will use that example in a moment...

– Search for Boulder::Genbank
• Make a mental note of what that module does… we’ll
Installing from CPAN
• There is a program that allows you easy access to
CPAN installations

• Unfortunately, this wants to install files by default
into a location that you probably do not have
permissions to access!!

• Therefore we have to do some re-configuration to
change this default behaviour.

• You can do this on ANY *nix machine you use
Creating a personal
CPAN Config.pm
First you need to specify a location to store the CPAN tools
\$ cd ~                        change to your own home folder
\$ mkdir perl                  create a new folder called “Perl”

Then start CPAN, using the command line as follows:

\$   export FTP_PASSIVE=1      this is not necessary at U of A
\$   perl -MCPAN -e shell
Creating a personal CPAN
Config.pm
The first time CPAN runs it will ask you a lot of questions.
Just accept the defaults UNTIL you get to this one:
Parameters for the 'perl Makefile.PL' command?
Typical frequently used settings:

PREFIX=~/perl   non-root users (please see manual for more hints)

Your choice:   [INSTALLDIRS=site] PREFIX=~/perl...

Type “PREFIX=~/perl”
This will tell CPAN to install
your modules in the /perl folder
that you just created
Creating a personal CPAN Config.pm
And there’s one more series that you have to answer:
(1) Africa
(2) Asia
(3) Central America
(4) Europe
(5) North America
(6) Oceania
(7) South America
Select your continent (or several nearby continents) [] 5

(1) Bahamas
(3) Mexico
(4) United States
Select your country (or several nearby countries) [] 2

(1) ftp://CPAN.mirror.rafal.ca/pub/CPAN/
(2) ftp://cpan.sunsite.ualberta.ca/pub/CPAN/
(3) ftp://ftp.nrc.ca/pub/CPAN/
(4) ftp://mirror.arcticnetwork.ca/pub/CPAN
(5) ftp://theoryx5.uwinnipeg.ca/pub/CPAN/
Select as many URLs as you like (by number),
put them on one line, separated by blanks, e.g. '1 4 5' [] 2 3 4 5 1
Editing your PERL5LIB environment

Now you also have to tell Perl that it should look in this folder

You only need to do this once – it will now be set every time you login

\$ pico .profile    (note this is “dot profile”, your startup 'profile' in bash)

add the following lines to your .profile file

export PERL5LIB=\$PERL5LIB:\$PWD/perl/share/perl/5.8.7/
export FTP_PASSIVE=1

now save the file and exit

NOW CLOSE ALL YOUR TERMINAL WINDOWS AND START THEM AGAIN!
Finally, we are ready to install stuff!
• Now CPAN knows where it should put things, and Perl
knows where to look for them...
• The command to start the CPAN shell is:
\$ perl -MCPAN -e shell

now try installing something…

cpan> install Boulder::Genbank

CPAN installs often ask you questions as they go along... unless you know
better you should ALWAYS just accept the default (hit “enter”)
Watch for error messages...
• You may see an error message reported
at the end of an installation
Failed 2/28 test scripts, 92.86% okay. 46/388 subtests failed, 88.14%
okay.
*** Error code 29
make: Fatal error: Command failed for target `test'
/usr/ccs/bin/make test -- NOT OK
Running make install
make test had returned bad status, won't install without force

So... now what?? Well, unless you really know what
you are doing, you have no choice but to use force!

cpan> force install Boulder::Genbank

Don't worry, this is safer than it sounds ;-)
Congratulations, you have just installed your first
CPAN module

• If you want to look at what got installed browse to the folder:

./Perl/lib/perl5/site_perl/5.6.2

• If all went well, you should now be able to execute the
following command without error:

\$ perl -e 'use Boulder::Genbank'

IF THAT COMMAND GENERATES AN ERROR
TELL ME NOW!!!
3:3
0
Break!
• After the break we are going to talk about
object oriented perl...
Perl Unit

Object Oriented Perl
3:4
5
Perl Unit: Object Oriented Perl
• The two most common programming “paradigms” are Procedural,
and Object-Oriented

• Procedural programming is most commonly used in simple scripts
where all data and data manipulations are achieved in a single
program with all variables “visible” and all subroutines present in
the same program... can get messy! (this is what you have been
doing so far)

• Object Oriented programming allows you to collect together a set
of data as well as the subroutines that may act on it into an
“Object” and hide them inside of a “magic variable”...
Perl Unit: Object Oriented Perl

• What installed from CPAN was a “module” (actually a
set of related modules)
– a module could be imagined to be an 'empty' object,
or an object 'template'.

• To use an object, you must instantiate it, and fill it with
data

• In Perl, this is most often accomplished using the “new”
command.
Perl Unit: Object Oriented Perl
#!/usr/bin/perl -w

use SomeObject; # first need load the empty object
my \$x = new SomeObject(); # then create an “instance”

OR

my \$x = SomeObject->new(); # -> means “do this to it”

\$x now contains an “instance” of SomeObject!
A commonly used paradigm
#!/usr/bin/perl -w

use SomeObject;
my \$x = SomeObject->new(
'attribute1' => 'value1',
attribute2' => 'value2',
);

\$x has now been created, and initialized with “value1”
for attribute 1, and “value2” for attribute 2
Invoking “methods” of the object
#!/usr/bin/perl -w

use SomeObject;
my \$x = SomeObject->new(
'attribute1' => 'value1'
);

# ask the object to do something with that data
print \$x->getValue('attribute1');
Try your LWP installation
still have the LWP documentation visible
– If not, you could also type 'perldoc LWP' at your
command prompt to get the same information

\$ perldoc LWP

• Copy/paste the LWP example code into a new file
on your system called “test_lwp.pl”

• Edit the file to connect to some website of
interest... for example, cnn.com, as follows...
The code may be simplified to this:
# Create a user agent object
use LWP::UserAgent;
\$ua = LWP::UserAgent->new;

# Create a request
my \$req = HTTP::Request->new(GET => 'http://www.cnn.com');

# Pass request to the user agent and get a Response Object back
my \$res = \$ua->request(\$req);

# Check the outcome of the response
if (\$res->is_success) {
print \$res->content;
}
else {
print \$res->status_line, "\n";
}
*** Script
LWP Example
Now you can retrieve web pages without a browser!
You're FREE!
Explanation Create a new UserAgent
object

# Create a user agent object

use LWP::UserAgent;                                                   Create a
\$ua = LWP::UserAgent->new;                                            new
Request
# Create a request                                                    object
my \$req = HTTP::Request->new(GET => 'http://www.cnn.com');
Pass the Request
# Pass request to the user agent and get a response back              object as an
my \$res = \$ua->request(\$req);
argument to the
# Check the outcome of the response                                   'request' method of
if (\$res->is_success) {                                               the UserAgent to get
print \$res->content;                                              a “Response” object
}                                                                     in return
else {
print \$res->status_line, "\n";
}

Check what the
'is_success' method
Get the content of the response object            returns (T/F)
Exercises: Connecting
Regular Expressions and LWP
1.    Write a script that uses LWP to retrieve a web page from any
address you specify at the command line
– Hint: \$arg = <STDIN>; chomp \$arg;

2.    Write a script that uses LWP to get the function ontology
from GO, then look through it line by line to find the GO Term
associated with a GO id entered by the user

function ontology is here:
http://www.geneontology.org/ontology/function.ontology
e.g. \$ perl test_getGOTerm.pl GO:0033041

Should return “sweet taste receptor activity”

…This will require you to write a clever regular expression!
• The same methodology can be used to
“scrape” information from ANY web page!
–Try writing some “screen-scrapers”
tonight for yourself
–It's far easier than cut-n-paste!

• Screen-scraping is an awful (but
common!) way to do bioinformatics...

• BEWARE OF SCRAPING GENBANK!
Exercise: Replacing & Enhancing
Regular Expressions with OO Modules
• A few minutes ago we installed Boulder::Genbank
• This takes at least some of the “pain” out of building
Regular Expressions for parsing GenBank (and other!)
records.
• Examine the Boulder Genbank Parser code I have made
available to you. ***Script: Boulder Genbank Parser

• Exercise: combine and edit
(a) your regular expression code that modified the
author-name/initial format, with
(b) a piece of Boulder::Genbank code you write
yourself that extracts the author names from a
Genbank flatfile.
Data Integration

Syntactic and Semantic
Integration of Data
The Holy Grail:

Align the promoters of all serine threonine kinases
involved exclusively in the regulation of nitrogen
fixation in root nodules

Retrieve and align 2000nt 5' from every serine/threonine
kinase in Legumes expressed exclusively in the root
whose expression increases 5X or more within 5 hours of
infection by rhizobium but is not activated during the
normal development of the root, and is <40% homologous
in the active site to kinases known to be involved in cell-
cycle regulation in any other species.
➔   Restrict to Legume
➔   SRS

➔   Find all serine/threonine kinases
➔   Blast, SRS, Model Organism DB's, 1' literature,
biological knowledge
➔   Restrict by WT expression pattern
➔   Entirely by hand, primarily from 1' literature

➔Restrict by chip expression experiments
➔   SMD, ArrayExpress, (TIGR?), probably don't exist yet…

➔Identify active sites
➔   Prosite/Prints + biological knowledge

➔Restrict by homology/biological function
➔   SRS, Blast, & by hand from 1' literature

➔Get upstream 2000nt
➔   If available, entirely by hand
What's wrong with this picture??
Options for the Integration of Biological
Data
for Hosts and Consumers

(some of the following slides based on
Lincoln Stein’s GMOD 2003 meeting
presentation)
Starting from the providers
perspective…
(False?) Utopia:
Fully Warehoused Database

Editin                           Other
g Tool                           Tool

Why false?          DATA

Visualiz
e       Query   Analysis
Current Situation:
Modular, Non-Integrated

In what ways
is this good?
DATA
DATA DATA
DATA
DATA DATA
DATA
DATA DATA

Tools
Tools Tools       Tools                Tools Tools
Tools      Tools       Tools

Genome        Expression           Literature
Databases     Databases            Databases
What we want:

Why is this better?
Visuali
ze                   Query

Why is it difficult to achieve?

Literature             Expression           Genome
Data                   Data                 Data
URL Link Integration of Modular System

URL Space
“WWW”

Why is it so successful?
Genome      Stock Center Data   Literature
Data                            Module

Why is this insufficient?
Tools           Tools               Tools
Data Warehouse:
Common Data Model & API
What is an API?            Query

What are the +’s
What are the –’s
Massive Common API

e.g. SeqHound from Blueprint
Direct
DB
Queries

Massive Common DB Schema (Centralized)
Data Federation I:
Common Data Model & API

What are the +’s           Query

What are the –’s
Massive Common API
e.g. CHADO from GMOD
Direct DB
Queries

Massive Common DB Schema (de-centralized)
Data Federation II:
Adaptors & Common API

What are the +’s        Query

What are the –’s
Common API
e.g. BioPerl from open-bio
Aqua              Tan
Common Data Exchange Format

What are the +’s
Query

What are the –’s
Local
Visualize
API
Local
Query API

e.g. TIGRML, MAGEML
BeigeML
Web-Service Architecture

What are the +’s    Visualiz
e
Query

What are the –’s
REGISTRY
Beige

e.g. MOBY-S
Who can
provide/accept
Service API

‘beige’ types of data?
Semantic Web Architecture

What are the +’s Visuali
ze                    Query

What are the –’s
Common Generic, 100% expressive Data and Logic Description Language

e.g. ?? S-MOBY, myGrid
Common Generic, 100% expressive Data and Logic Description Language
Unfortunately, every one of
these integrative methodologies
(alone or in combination) is used
by one organization or another

…so we are nowhere near
achieving integration…
This is not a solved
problem!!
I cannot give you a tool or a methodology that will
answer these types of questions for you.

Generally speaking, the only way to
collect/integrate disparate datasets at the moment
is through brute force.

But there are some tools that make the brute-force
method less backbreaking!
SeqHound
A Canadian Data Warehouse

published in:
http://www.biomedcentral.com/1471-2105/3/32
surf to:
http://www.blueprint.org/seqhound
Update cycle report:
http://seqhound.blueprint.org/report.html
• Starting April 3rd, access to SeqHound via
seqhound.blueprint.org may no longer be
available.

• Here’s what you need to do from now on
Unleashed Informatics
• http://bond.unleashedinformatics.com
Make sure
you select
******
Now go back to…
This is where you get SeqHound
Click this!
Save the file
To disk
To install
• Go to where you saved the file

\$   tar –zxvf seqhound-perl-4.0.tar.gz
\$   cd seqhound-perl-4.0
\$   mv *.pm ~/perl/share/perl/5.8.7/
\$   mv .shoundremrc ~/
\$   cd ~

• Note that /…share/perl/5.8.7 is also
where you are installing your CPAN modules.
We’re just making sure that everything is in its
correct place so that Perl can find it
Your .shoundremrc file content
Edit the file to contain the following:

[remote]
server1 = dogboxonline.unleashedinformatics.com
CGI = /cgi-bin/seqrem
port=80

[soap]
server1 = dogboxonline.unleashedinformatics.com
CGI = /soap/services/
port=8081
Need to add several more
CPAN modules for the new
SeqHound
\$ perl       –MCPAN –e shell
cpan>        install IO::Wrap
cpan>        install MIME::Tools
cpan>        install SOAP::Lite

The SOAP::Lite installation will ask you a lot of questions.
Simply accept all of the defaults (i.e. hit “enter” to every one)
SeqHound
A Data Warehouse for Acadmic and
Commercial Use

Based on a presentation by
Ian Donaldson
April 23, 2004

A program of the Samuel Lunenfeld                          A University of Toronto affiliated
Research Institute                                                         research institute
What is SeqHound?
Biological data sets exist in various locations, formats and data
structures that serve to represent each single data set.

For example, a biological sequence database may exist in
ASN.1 format, an expression database may exist in a flat-file
format and both data structures may make reference to an
organism but by different names (say by taxon identifier and
organism name).
What is SeqHound?
Since biological data sets are increasingly collected
individually but used together, it would be most convenient to

• in one place
• by a means that is independent of their original format
• where they are represented in a single, cohesive data
representation.
• in some way that does not require the data user to be
responsible for this collection and storage of these data.

Question: How can this be achieved?
What is SeqHound?

Sequences   Sequence annotation         Structure           Literature   Interactions

SeqHound

Access Methods (API)

User
What is SeqHound?
An application programming interface (API) provides a uniform
method to access data in table columns and to data inside binary
large objects.

Again, sets of functions in the API communicate with different
modules. API functions will only be available for those data
“modules” that have been Imported.

The API is available locally (in the C language) or remotely via an
http interface that can be accessed via corresponding APIs written
in C, C++, Java, Perl and BioPerl.
Who are the users of SeqHound?

Local Programmer

• bioinformatics developer in a biotech start-up
• graduate student anywhere in the world
Remote Programmer
• bioinformatics developer in a high-throughput lab

Web user
Help: Manual and resources
List of API functions
(scroll down to “Sequence Fetch
FASTA” and click on
SHoundGetFasta)
Description
of what that
function
does, and how,
in Perl.
What can the remote API do?
SEQUENCE FETCH - FASTA
SHoundGetFasta

SHoundGetDefline
SEQUENCE FETCH – GENBANK FLAT
FILE
SHoundGetGenBankff
SHoundGiFrom3D

SHoundDNAFromProtein

SHoundProteinFromDNA
ShoundGetReferenceIDFromGi

SHoundTaxIDFromGi

SHoundProteinsFromTaxIDIII
STRUCTURE FETCH
SHoundGet3D

SHoundGetXML3D

SHoundGetPDB3D
COMPLETE GENOME ITERATORS
SHoundProteinsFromOrganism

SHoundAllGenomes

SHoundChromosomeFromGenome

SHoundDNAFromOrganism
REDUNDANT (EQUIVALENT)
SEQUENCES

SHoundRedundantGroup

SHoundFirstOfRedundantGroupFromID
TAXONOMY
SHoundGetTaxNameFromTaxID

SHoundGetTaxChildNodes

SHoundGetTaxParent
SEQUENCE NEIGHBOURS
SHoundGetBlastResult

SHoundNeighboursFromGi

SHound3DNeighboursFromGi
FUNCTIONAL ANNOTATION

SHoundGOIDFromGi

SHoundOMIMFromGi
FUNCTIONAL ANNOTATION
SHoundCDDIDFromGi

SHoundLLIDFromGi
GO HIERARCHY
SHoundGODBGetNameByID

SHoundGODBGetChildrenOf

SHoundGODBGetParentOf
RPS BLAST DOMAINS
SHoundGetDomainsFromGi

SHoundGetGisByDomainId
Test your set up using example code

#!/usr/bin/perl -w
use strict;
use SeqHound;

print "Starting Program\n";
my \$init = SHoundInit("TRUE", "myapp");
print "SHoundInit \$init\n";
my \$isinit = SHoundIsInited();
print "SHoundIsInited \$isinit\n";
my \$gi = SHoundFindAcc("CAA28783");
print "SHoundFindAcc returned gi number \$gi\n";
my \$fasta = SHoundGetFasta(\$gi);
print "SHoundGetFasta returned FASTA\n\$fasta\n";

***Script:
Seq Hound Getting Started
Pre-computed blasts
• Unfortunately, not all SeqHound functions
are available in Perl, including the Blast
functions.
• However, pre-computed Blasts are
available!
• These are called “Neighbours”
• e.g. SHoundNeighboursFromGi
– look up the function yourself… it isn’t exactly
like the others…
Exercise
• Extend the sample code to retrieve the
Neighbours (i.e. BLAST hits) for the gi
number retrieved in your code
• Then get the FASTA for each of those.
• hint: perldoc –f split
– The Perl “split” function will take a string and
split it into an array based on a particular
character… like a “,”

my @ids = split “,”, \$result;
Exercise: Combine everything you
know so far
Starting with a list of gi numbers that are of interest to you

1.   Use SeqHound to retrieve the GenBank file
2.   Use Boulder::Genbank to extract the journal reference
3.   Use Boulder::Genbank plus your own regular expression to
retrieve the SwissProt record ID (can’t do this in Perl
SeqHound, unfortunately, but you can in Java!)
4.   Use LWP to retrieve that SwissProt record

What you just built is commonly referred to as a “workflow”.

Workflows are often still written exactly as you just did, but
there are now better ways!!
Break for a glass of water
What data-retrieval problems are
still not solved?
●   Many biological data types are still not "covered" by
SeqHound
● e.g. microarray, two-hybrid, citations

●   I want to be able to move data from one place to another
without having to reformat it

●   I want the machine to tell me what data is available,
what I can do with it, and to automate the process of
doing it.

●   I want to be able to build complex analytical pipelines
without having to write a lot of code!
What new 'tool' is needed?
A mechanism by which a researcher is able
to simultaneously interact with multiple
sources of highly disparate biological data
regardless of the underlying data format or
schema.

The mechanism must also allow for the
automated, dynamic identification of data,
and the relationships between data from
different sources.
BioMOBY – the Genome Canada Data
Integration Platform
●   Determine the API for
● A web-services registry
● An ontology-based, data-driven, service discovery system
● A simple data representation and transport layer

●   That fulfills the following criteria:
● Presents the biologist with only relevant choices

● Presents the biologist with all such choices, without necessitating

their prior knowledge.
● Constructs and executes queries against disparate service API’s

without human intervention
● Presents the resulting output in a predictable, machine-readable

format
● Is extendible into non-biological areas of study

● Requires minimal effort by service providers
BioBabel

Issues and Tools for Integration
of Biological Data
From Disparate Sources

STUDENTS: PLEASE READ THIS SECTION (up to BioMoby) ON
YOUR OWN TIME, AS I AM NOT GOING TO TEACH IT, BUT IT
MAY STILL BE INTERESTING TO YOU…
A Blast using the “Sevenless” protein (a
tyrosine protein kinase) from Drosophila
retrieves:
Capital R with a ‘1’
●Sevenless                         With a ‘c’ and no ‘1’
●Ros-1
With a ‘c’ and a ’1’,
●c-Ros
lower case R
●c-ros-1

●c-ros-1 unknown protein (?!!??!?!)

●Tyrosine-specific protein kinase

●Tyrosine protein kinase
Three
●Transmembrane tyrosine-specific protein kinase
ways to
●UR2 sarcoma oncogene
say it
●Kinase related protein ros precursor

And that’s just in Genbank… in SwissProt it is
described as:
•   Tyrosine-protein kinase receptor
The Tower of Babel analogy
for the genomics era:
● Dispersed creative process
● No overall coordination

● Massive, multidisciplinary projects

● Long duration

● How should the “end product” look?

Unfortunately, we have to live with it for at least the next few years…
So what can we do?
If we want to integrate data in a flexible, distributed manner, then
there are some things that we clearly MUST do.

e.g. For sequence data, we must agree:
● How to describe its structure and function

● common vocabulary!

● What a sequence query “looks like”

● common data representation

● How to execute a sequence query

● common API (or at least, self-describing API)

● To make it known that you can answer such queries

● “BioGoogle” of sequence retrieval services
Integration on which level?
• Integration of the way we describe the
data, or “semantic” integration
– Controlled vocabularies
– Ontologies
• Gene Ontology
• Sequence Ontology
• Other
• Integration of “data proper”
– Federation
– Warehousing
– Other...
Data Integration

Semantic Integration...
GO: the Gene Ontology
Consortium
Arose from an ISMB discussion paper presented by
Michael Ashburner (FlyBase) June, 1998.

“The goal of the Gene Ontology (GO) Consortium is
to produce a controlled vocabulary that can be
applied to all organisms even as knowledge of
gene and protein roles in cells is accumulating and
changing. GO provides three structured networks of
defined terms to describe gene product attributes.
GO is one of the controlled vocabularies of the
Open Biological Ontologies.”
What is An Ontology?
Formal definitions

1: A systematic account of existence.

2: An explicit formal specification of how to
represent the objects, concepts and other
entities that are assumed to exist in some area
of interest and the relationships that hold
among them.

3: The hierarchical structuring of knowledge
about things by subcategorizing them
according to their essential (or at least relevant
and/or cognitive) qualities.
Why GO?
• Provide a common gene-product vocabulary &
definition spanning all organisms

• Make it hierarchical to allow a controlled
ambiguity in biological role designations &
functional descriptions.

• relate records in existing genome databases
based on this common vocabulary.
---> DATA INTEGRATION!
GO structure:“DAG”
(Directed Acyclic Graph)
●   Hierarchical tree-like structure

●   Vertices ('branches') are unidirectional from
~less specific to ~more specific nodes.

●   Any node may have multiple parents.

●    Any path from a node must not lead back to
that node, hence acyclic.
●   (necessary for automated annotation)
Generic DAG Structure

Multiple
Parents
Cross-
branching
Current GO Tree:
roots, nodes and vertices
Biological Process   Mol. Function   Cell Component

Is a               Is a       Is a

Death          Chaperone             Cell
Part Of

Membrane
Example
Cell
Communication

Signal             Response to
Transduction        External stimulus

response         Stim. Resp.      perception

SOS                                          Calcium ion
Phototransduction
release
How GO affects data discovery:
•   Genome Database Queries
• “Retrieve all proteinase inhibitors localized in the vacuolar
lumen which are involved in regulation of apoptosis”
– GO:0004866 proteinase inhibitor
– GO:0005775 vacuolar lumen
– GO:0006915 apoptosis

•   Literature Database Queries
• “pull out all papers dealing with guard cell membrane-
bound receptors involved in water stress responses”

•   Interpretation of Microarray Data
• Functional clustering of microarray spots according to
their GO "biological-function" annotation.
• Clustering of spots by cellular location?!?
Activity Session: Ontologes
FOR BEGINNERS:
FOR EXPERIENCED:
•Geneontology.org Website
•Explore the GO homepage                       •Browse to the “tools” section of the gene
•GO Mappings                                   ontology website
•GO tools                                      •Install
•GO Browsers “Amigo”                           •Download one of the existing ontologies
Q1 How many genes in Arabidopsis are           •Explore it, and then try creating your own
involved in “specification of floral organ     ontology from scratch!
identity”?
- retrieve their sequences to a file          •Browse the OBO website to see
what other ontologies are out there;
Q2 the ATCEL2 gene has what                        load up one of them that is in DAG-
molecular function? How many other                 Edit format and explore it...
Arabidopsis genes are recorded as
having that activity?                              •http://obo.sourceforge.net/cgi-bin/table.cgi

Browse the OBO website to see what
other ontologies are out there!
http://obo.sourceforge.net/cgi-bin/table.cgi
BioMoby
What does BioMoby do?
The BioMoby Plan
•   Create an ontology of bioinformatics data-types
•   Define an XML representation of this ontology (syntax)
•   Create a public update/edit interface for this ontology
•   Define Web Service inputs and outputs v.v. Ontology
•   Register Services in an ontology-aware Registry

• Machines can find an appropriate service
• Machines can execute that service unattended
• Ontology is community-extensible
Overview of BioMoby Transactions

MOBY hosts & services

Alignment   Sequence
MOBY
Sequence
Gene
names      Align
Express.    Central
Phylogeny
Protein
Primers
Alleles
…
The BioMoby Ontologies
• The Namespace Ontology defines all of the different
categories of data we might want to talk about
– Genbank records, PDB records, PubMed abstracts, etc.
–
• The Object Ontology defines structures for these
different data categories
– FASTA, Blast reports, PDB 3-D structure,

• Both are 100% open to the public for update and
modification

• New data-type?
– First check that it doesn’t already exist in the ontology
– If not, then register it in the ontology
– VOILA, everyone else on earth can understand it and use it!
The Namespace Ontology
• Effectively a list of all different types of data
records, and their abbreviation
– NCBI_gi, PDB, EMBL, PubMed, OMIM, etc.

• Importantly

IN BIOMOBY WE DO NOT ASSOCIATE A
NAMESPACE WITH A PARTICULAR SERVICE
PROVIDER
ANYONE CAN SAY ANYTHING ABOUT ANYTHING!

– If you have some information about a GenBank record,
you are allowed to use the NCBI_gi namespace to
indicate which record you are talking about.
Data-Types in BioMoby

• Data-types are the core of BioMoby’s
interoperable behaviours.

• Every BioMoby data-type begins as an “identifier”
for some piece of data somewhere on the Web.
– Namespace + ID number

• Extra bits of information are then added to the
identifier to say more about it

• The syntax for this is formalized by using the
Object Ontology
The MOBY-S Object Ontology

• Has a similar structure to the Gene
Ontology
– Data Class name at each node
– Edges define the relationships between Classes             node

• Edges define one of three relationships

Edge
– IS A
• Inheritance relationship
• All properties of the parent are present in the child
– HAS A
• Container relationship of ‘exactly 1’                   node
– HAS
• Container relationship with ‘1 or more’
The Simplest Moby Data Object

<Object namespace=‘NCBI_gi’ id=‘111076’/>

The combination of a namespace and an
identifier within that namespace
Object
uniquely identify a data entity, not its
location(s), nor its representation
A Primitive Data-type

ISA    DateTime

ISA     Float

ISA    Integer    <Integer namespace=‘’ id=‘’>38</Integer>

Object ISA    String
A Derived Data-Type

<VirtualSequence namespace=‘NCBI_gi’ id=‘111076’>
<Integer namespace=‘’ id=‘’ articleName=“length”>38</Integer>
</ VirtualSequence >

ISA
Integer
HASA
ISA
Object         String

ISA              Virtual
Sequence
A Derived Data-Type
<GenericSequence namespace=‘NCBI_gi’ id=‘111076’>
<Integer namespace=‘’ id=‘’ articleName=“length”>38</Integer>
<String namespace=‘’ id=‘’ articleName=“SequenceString”>
ATGATGATAGATAGAGGGCCCGGCGCGCGCGCGCGC
</String>
</ GenericSequence >
ISA
Integer
HASA
ISA                               HASA
Object         String

ISA              Virtual   ISA    Generic
Sequence         Sequence
A Derived Data-Type
<DNASequence namespace=‘NCBI_gi’ id=‘111076’>
<Integer namespace=‘’ id=‘’ articleName=“length”>38</Integer>
<String namespace=‘’ id=‘’ articleName=“SequenceString”>
ATGATGATAGATAGAGGGCCCGGCGCGCGCGCGCGC
</String>
</ DNASequence >
ISA
Integer
HASA
ISA                               HASA
Object         String

ISA              Virtual   ISA    Generic   ISA     DNA
Sequence         Sequence         Sequence
Legacy file formats
• Containing “String” allows us to define ontological classes that represent
legacy data types (e.g. the 20 existing sequence formats!)
<NCBI_Blast_Report namespace=‘NCBI_gi’ id=‘115325’>
<String namespace=‘’ id=‘’ articleName=‘content’>
TBLASTN 2.0.4 [Feb-24-1998]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A.
Sch&auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman
(1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.

Query=   gi|1401126
(504 letters)

Database: Non-redundant GenBank+EMBL+DDBJ+PDB sequences
336,723 sequences; 677,679,054 total letters

Searchingdone

Score      E
Sequences producing significant alignments:                           (bits)   Value

gb|U49928|HSU49928 Homo sapiens TAK1 binding protein (TAB1) mRNA...    1009    0.0
emb|Z36985|PTPP2CMR P.tetraurelia mRNA for protein phosphatase t...      58    4e-07
emb|X77116|ATMRABI1 A.thaliana mRNA for ABI1 protein                     53    1e-05
</String>
</NCBI_Blast_Report>
Binaries – pictures, movies
• We base64 encode binaries, and then define a hierarchy of data classes that
Contain String

• base64_encoded_jpeg ISA text/base64 ISA text/plain HASA String

<base64_encoded_jpeg namespace=‘TAIR_image’ id=‘3343532’>
<String namespace=‘’ id=‘’ articleName=‘content’>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
BAgTDFdlc3Rlcm4gQ2FwZTESMBAGA1UEBxMJQ2FwZSBUb3duMQ8wDQYDVQQKEwZUaGF3dGUx
HTAbBgNVBAsTFENlcnRpZmljYXRlIFNlcnZpY2VzMSgwJgYDVQQDEx9QZXJzb25hbCBGcmVl
bWFpbCBSU0EgMjAwMC44LjMwMB4XDTAyMDkxNTIxMDkwMVoXDTAzMDkxNTIxMDkwMVowQjEf
MB0GA1UEAxMWVGhhd3RlIEZyZWVtYWlsIE1lbWJlcjEfMB0GCSqGSIb3DQEJARYQamprM0Bt
</String>
</base64_encoded_jpeg>
Extending legacy data types
•   With legacy data-types defined, we can extend them as we see fit
•   annotated_jpeg ISA base64_encoded_jpeg
•   annotated_jpeg HASA 2D_Coordinate_set
•   annotated_jpeg HASA Description

<annotated_jpeg       namespace=‘TAIR_Image’             id=‘3343532’>

<2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”>
<Integer namespace=‘’ id=‘’ articleName=“x_coordinate”>3554</Integer>
<Integer namespace=‘’ id=‘’ articleName=“y_coordinate”>663</Integer>
</2D_Coordinate_set>

<String namespace=‘’ id=‘’ articleName=“Description”>
This is the phenotype of a ufo-1 mutant under long daylength, 16’C
</String>
<String namespace=‘’ id=‘’ articleName=“content”>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
</String>
</annotated_jpeg>
The same object…
annotated_jpeg ISA base64_encoded_jpeg HASA 2D_Coordinate_set HASA Description

<annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’>

<2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”>
<Integer namespace=‘’ id=‘’ articleName=“x_coordinate”> 3554 </Integer>
<Integer namespace=‘’ id=‘’ articleName=“y_coordinate”> 663 </Integer>
</2D_Coordinate_set>

<String namespace=‘’ id=‘’ articleName=“Description”>
This is the phenotype of a ufo-1 mutant under long daylength, 16’C
</String>
<String   namespace=‘’   id=‘’ articleName=“content”>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
</String>
</annotated_jpeg>
The same object…
annotated_jpeg ISA base64_encoded_jpeg HASA 2D_Coordinate_set HASA Description
<annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’>
<CrossReference>
<Object namespace=“TAIR_Allele” id=“ufo-1”/>
</CrossReference>
<2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”>
<CrossReference>
<Object namespace=‘TAIR_Tissue’ id=‘122’/>
</CrossReference>
<Integer namespace=‘’ id=‘’ articleName=“x_coordinate”> 3554 </Integer>
<Integer namespace=‘’ id=‘’ articleName=“y_coordinate”> 663 </Integer>
</2D_Coordinate_set>
<String namespace=‘’ id=‘’ articleName=“Description”>
This is the phenotype of a ufo-1 mutant under long daylength, 16’C
</String>
MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIJQDCC
How to think about MOBY Objects
and Namespaces

Data perspective X                                     Data perspective Y

Object X                      Object Y

Record in “gi”
Namespace
(Genbank record)
Why define Objects in an ontology?

Bioinformatics service providers are not all experienced
programmers

The Moby Object Ontology provides an environment
within which “naïve” service providers can create new
complex data-types WITHOUT generating new flatfile
formats, and without having to understand XML
Schema

Minimize future heterogeneity between new data-types
to improve interoperability without requiring endless
schema-to-schema mapping efforts.
A portion of the MOBY-S
Object Ontology

…community-built!
How do we use the Moby
Ontologies?
Discovery of services
That consume things
LIKE sequences!

Sequence
Align                   MOBY
Phylogeny               Central
Primers

A sequence is a ___
What is a sequence?
That has these features __

Object
ontology
Let’s see BioMoby in action

Some screenshots of a Gbrowse
Moby browsing session
To retrieve this session
File/Save

This is SCUFL – Simple Conceptual
Unified Flow Language

It is a complete record of everything
you just did, and it can be saved for
use in the Taverna workflow
application that we will look at later…

** Taverna Workflow:
gbrowse_moby_workflow.xml
Browsing myriad resources through
a single interface

• No explicit coordination between providers

• Dynamic discovery of ~appropriate Resources
(Web Services)

• Automated execution of services

• Appropriate rendering of the output

• PIPELINING of output data into next service
Try a MOBY Browsing Session

http://www.mobycentral.cbr.nrc.ca
Using and Editing
BioMoby Workflows
In Taverna
Taverna Workbench
Tom Oinn and Martin Senger
myGrid Project

• Taverna can be obtained from:
http://taverna.sourceforge.net
• Once at the site, click on Download
• Download the version appropriate for your
operating system.
• Between releases, the Moby functionality may be updated and you
can find instructions on how to acquire those updates from:
http://biomoby.open-bio.org/index.php/moby-clients/taverna_plugin
Running
• For a more comprehensive guide on Running and using
Taverna, please refer to
http://taverna.sourceforge.net/usermanual/manual.html
• Assuming that you have downloaded and unzipped
Taverna, you can run it by double-clicking on runme.bat
(windows) or executing runme.sh (Unix/Linux/OS X)
• You may need to:

\$ chmod u+r runme.sh
• Taverna’s splash screen
• Once Taverna has loaded, you will see 3 windows:
– Advanced Model Explorer
– Workflow Diagram
– Available Services
• The Advanced model explorer is Taverna’s primary editor and allows
you to load, save and edit any property of a workflow.
• Workflow diagram contains a read only graphical
representation of your workflow.
• The Available services window lists all of the
services available to a workflow designer.
• Under the node ‘Biomoby
@ …’ Moby services and
Moby data types are
represented.
• The Object ontology is
available as children of
MOBY Objects node
• Services are sorted by
service provider authority
• If you wish to use registries other than the default one,
you can add a new Moby ‘Scavenger’ by choosing to
‘Add new Biomoby scavenger…’
• Enter the registry’s location and click okay.
Creating Workflows

• I have the workflow saved and would
• We will start by adding the Object ontology node
Object to our workflow.
• The Advanced model explorer now shows that we have a
processor called Object
– Object has 3 input ports: id, namespace and article
name
– Object has 1 output port: mobyData
• The Workflow diagram illustrates our processor
• We can discover services that consume our data type,
context click on ‘Object’ and choose ‘Moby Object
Details’
• A window will pop up that tells you what services Object
feeds into and is produced by
• Expanding the Feeds into node results in a list of service
provider authorities
• Expanding an authority, for example,
bioinfo.icapture.ubc.ca, reveals a list of services
• We will choose to add the service called
‘MOBYSHoundGetGenBankFasta’ to our workflow.
• A look at the state of our current workflow.
• And graphically.
• The service consumes Object, with article name
identifier, and produces FASTA, with article name fasta.
• To discover more services, context click on the service
that outputs the data type that you would like to discover
consuming services for and choose Moby Service
Details.
• The resultant window displays the services inputs and
outputs.
• There are also tool tips that show up when your mouse
hovers over any particular input or output that tells you
what namespaces the data type is valid in
• Context clicking on an output reveals a menu with 3 options.
– A brief search for services that consume our datatype
– A semantic search for services that consume our datatype
– Adding a parser to the workflow that understands our datatype
• The result of choosing to add a parser for FASTA to our workflow.
• The parser allows us to extract:
– The namespace and id from FASTA
– The namespace and id from the child String
– The textual content from the child String
• The result of choosing to conduct a brief search for
services that consume FASTA
• We will add the service getDragonBlastText to our
workflow by choosing ‘Add service -…’ from the context
• The current state of our workflow shown graphically.
• A more complex view of our workflow
• Finding services that consume NCBI_BLAST_Text starts
by viewing the details of the service ‘getDragonBlastText’
• Conduct a brief search
• Add the service ‘parseBlastText’ to our workflow
• Our current workflow
• Workflow inputs are added by context clicking on
Workflow inputs in the Advanced model explorer and
choosing ‘Create New Input…’
• The result from adding 2 inputs:
– Id
– namespace
• The workflow input id will be connected to Object’s input
port ‘id’
• Workflow after connecting the workflow input ‘id’
• The workflow input namespace will connect to Object’s
input port ‘namespace’
• Workflow after connection the workflow inputs.
• Workflow outputs are added by context clicking on
Workflow outputs in the Advanced model explorer and
choosing ‘Create New Output…’
• The result from adding 2 workflow outputs:
– moby_blast_ids
– fasta_out
• The output moby_blast_ids will be connected to
parseBlastText’s output port Object(Collection –’hit_ids’)
• The output fasta_out will be connected to
Parse_Moby_Data_FASTA’s output port fasta_’content’
• To run the workflow, click on ‘Tools and Workflow
Invocation’
• Choose ‘Run workflow’
• A prompt to add values to our 2 workflow inputs
• To add a value to the input ‘id’ click on id from the left
pane and choose ‘New Input’
• Enter 656461 as the id
• Choose namespace from the left and click on ‘New Input’
• Enter NCBI_gi as the value for namespace
• Once you are done, click on ‘Run Workflow’
• Our workflow in action
• Once the workflow is complete, we can examine the
results of our workflow.
• A detailed report is available outlining what happened
when and in what order.
• We can examine the intermediate inputs and output, as
well as visualize our workflow.
• If we choose the Graph tab, our workflow is illustrated.
• Intermediate inputs allow us to examine what a service
has accepted as input
• Similarly, Intermediate outputs allows us to examine the
output from any particular service.
• Without the parser, FASTA is represented as a Moby
message, fully enclosed in its wrapper.
• Non-moby services do not expect this kind of message
• Non-moby services expect the just the sequence
and using the Parse_Moby_Data processor, we
can extract just that
• Moby services can interact with the other services in
Taverna.
• Let’s add a Soaplab service.
• We will choose a
nucleic_restriction
soaplab service called
‘restrict’
• Choose the restrict
service and add it
to the workflow.
• We will connect
the output port
fasta_’content’
from the service
Parse_Moby_Da
ta_FASTA to the
input port
‘sequence_direct
_data’ from the
service restrict
• The result of our actions
so far.
• We will need to add
another workflow output
to capture the output of
restrict.
• Create an output called restrict_out
• Connect the output port ‘outfile’ from the service restrict
to the workflow output restrict_out
• Once the connections
have been made, run the
workflow again using the
same inputs.
• The workflow on the left has some
extra services added to it.
– FASTA2HighestGenericSequenceObject
from the authority
bioinfo.icapture.ubc.ca
– runRepeatMasker     from the authority
genome.imim.es
– A Moby parser for the output
– A workflow output Masked_Sequence
• The service runRepeatMasker is configurable, i.e. it
consumes Secondary parameters.
• To edit these parameters, context click on the service
and choose ‘Configure Moby Service’
• The name of the parameter is on the left and the value is
on the right.
• Clicking on the Value will bring up a drop down menu, an
input text field, or any other appropriate field depending
on the parameter.
• The parameter species contains an enumerated list of
possibilities.
• Select human.
• When you have made your selection, you may close the
window.
• Let’s run the workflow
• We will run our workflow with a list
– Click on id in the left pane and then click on New
Input twice
• Enter 656461 and 654321 as the ids
• Enter NCBI_gi as the value for namespace
• Our workflow will now run using each id with the single namespace
• Notice how the workflow is running with iterations. This is
happening because the Enacter is performing a cross-
product on the input
• You can still view intermediate inputs and outputs.
• Using the queryIDs, you can track each invocation of a
moby service through the whole workflow
•   Imagine now that you want to run the workflow using a FASTA sequence that you
input yourself (without the gi identifier)
•   To do this, context click on getDragonBlastText and choose Moby Service Details
–   Expand the Inputs node and context click on FASTA(‘sequence’)
–   Choose Add Datatype – FASTA(‘sequence’) to the workflow
•   A FASTA datatype will be added to the workflow and the appropriate links created
• Notice the datatype FASTA
on the left of the workflow
– Since the datatype FASTA
hasa String, a String was
also added to our workflow
and the appropriate
• We will now have to add
another workflow input and
connect it to the String
component of FASTA.
•   A workflow input ‘sequence’ was
added to the workflow and a
connection was made from the
workflow input to the input port ‘value’
of String.

•   We also removed the link between
MOBYSHoundGetGenBankFasta and
getDragonBlastText by context clicking
on the link in the Advanced model
explorers’ Data links and choosing to

•   Now when we choose to run our
workflow, we will also have the chance
to enter a FASTA sequence
• Go ahead an enter any FASTA sequence as the input to
the workflow input ‘sequence’
• Run the workflow
•   Any results can be saved by simply choosing to Save to disk
– You will be prompted to enter a directory to save the results.
– Each workflow output will be saved in a folder with the same name as a workflow
output and the contents of the folder will be the results
•   You can also choose Excel, which produces an Excel worksheet with
columns representing the workflow outputs and with rows that represent the
actual data.
Load the SCUFL workflow you
saved earlier
• After your gbrowse_moby browsing session, you
saved the workflow.
– http://www.ece.ualberta.ca/~markw/taverna_workflows/

• Load it up again now, look at it, and
perhaps run it

• Please don’t ALL run it…when all of you run the
same workflow simultaneously it will cause an
awful load on the poor service providers! ;-)

```
To top