; Catch-up
Learning Center
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>



  • pg 1
									Access to scripts, sample data, etc

Perl Unit

Regular Expressions
     (see notes from Boris Steipe)
                            To Start:
      review extraction of text from a file in Perl
First, download your own copy of the file at:


 open (IN, “./myGenbankFile.gb””) or die “can’t open file $!\n”;

 my $line = <IN>; # read a line from that file

 # do something with the line here

 print $line;

 close IN;
 Prints the first line of the genbank
            file to screen

• The code opens up this file and reads in
  the first line. Print out that line to the
  screen. It should start with “LOCUS”
 Using what you know from regular
1.   Match the word “LOCUS”
2.   Does a word in that line start with “C”? Print “yes” if true.
3.   Match the year at the end of the line
4.   Match the month at the end of the line
5.   How many amino acids are in the record?
      – Print that number out to the screen
6. Match the date, and print it out as “MAR 21, 1995”
7. Print each element in that line on its own line
      – E.g. LOCUS
8. Get the definition line from the file and print only the definition
9. ***** Get the Title line (this is NOT in a predictable location, and is on more
    than one line!) and print out the title without extra spaces.
10. ***** Get the author line and print out the authors in the following format:
     MJ Radeke, TP Misko, C Hsu, LA Herzenberg….
    $line =~ m/LOCUS/;

    If ($line =~ m/\bC/){
2      print “yes”;
    $line =~ m/\d\d\d\d$/;

4   $line =~ m/\-\w+\-/;
    $line =~ m/(\d+)\saa/;
5   print “$1\n”;

    $line =~ m/(\d+)\-(\w+)\-(\d\d\d\d)/;
6   print “$2 $1, $3 \n”;
    while ($line =~ m/(\S+)/g){
       print “$1\n”;
7   }

    $line   = <IN>;
    $line   = <IN>;
8   $line   =~ /DEFINITION\s+(.*)/;
    print   “$1\n”;
    while (!($line =~ /TITLE/)){
      $line = <IN>;
    $line =~ /TITLE\s+(.*)/;
9   $title = $1;
    $line = <IN>;
    while (!($line =~ /JOURNAL/)){
      $title = “$title $line”;
      $line = <IN>;
    $title =~ s/\s\s+/ /g; # tricky!
    print “the title is $title\n”;
     while (!($line =~ /AUTHORS/)){
        $line = <IN>;
     $line =~ /AUTHORS\s+(.*)/g;
     my $authors = $1;
     while ($authors =~ /(\w+),(\w)\.(\w)?\.?/){
10        $surname = $1;
          $initial1 = $2;
          $initial2 = $3;
          $initial2 = “” unless $initial2;
           print “$initial1$initial2 $surname, “;
       (alternative solution to #10)
while (!($line =~ /AUTHORS/)){
   $line = <IN>;
$line =~ /AUTHORS\s+(.*)/g;
my $authors = $1;
while ($authors =~ /\s+(\S+)\,(\w)[\.\,](\w)?/){
     $surname = $1;
     $initial1 = $2;
     $initial2 = $3;
     $initial2 = “” unless $initial2;
      print “$initial1$initial2 $surname, “;
  End of day

have fun in the city!
Unix Unit
        CPAN: The
Comprehensive Perl
   Archive Network
       & Makefiles
  CPAN – Comprehensive Perl Archive
• This is where you go to download and install most Perl-based
  software libraries

• To search for stuff, go to: http://search.cpan.org
   – e.g. Search for “LWP”
   – Click the first link and have a quick look at the documentation
     to see what the “LWP” library does.
       • Click on the “an example” link at the top of the page and
         stay there as we will use that example in a moment...

   – Search for Boulder::Genbank
       • Make a mental note of what that module does… we’ll
         return to it later
            Installing from CPAN
• There is a program that allows you easy access to
  CPAN installations

• Unfortunately, this wants to install files by default
  into a location that you probably do not have
  permissions to access!!

• Therefore we have to do some re-configuration to
  change this default behaviour.

• You can do this on ANY *nix machine you use
               Creating a personal
                CPAN Config.pm
    First you need to specify a location to store the CPAN tools
$ cd ~                        change to your own home folder
$ mkdir perl                  create a new folder called “Perl”

    Then start CPAN, using the command line as follows:

$   export FTP_PASSIVE=1      this is not necessary at U of A
$   perl -MCPAN -e shell
           Creating a personal CPAN
      The first time CPAN runs it will ask you a lot of questions.
      Just accept the defaults UNTIL you get to this one:
Parameters for the 'perl Makefile.PL' command?
Typical frequently used settings:

  PREFIX=~/perl   non-root users (please see manual for more hints)

Your choice:   [INSTALLDIRS=site] PREFIX=~/perl...

                                    Type “PREFIX=~/perl”
                                    This will tell CPAN to install
                                    your modules in the /perl folder
                                    that you just created
       Creating a personal CPAN Config.pm
     And there’s one more series that you have to answer:
(1) Africa
(2) Asia
(3) Central America
(4) Europe
(5) North America
(6) Oceania
(7) South America
Select your continent (or several nearby continents) [] 5

(1) Bahamas
(2) Canada
(3) Mexico
(4) United States
Select your country (or several nearby countries) [] 2

(1) ftp://CPAN.mirror.rafal.ca/pub/CPAN/
(2) ftp://cpan.sunsite.ualberta.ca/pub/CPAN/
(3) ftp://ftp.nrc.ca/pub/CPAN/
(4) ftp://mirror.arcticnetwork.ca/pub/CPAN
(5) ftp://theoryx5.uwinnipeg.ca/pub/CPAN/
Select as many URLs as you like (by number),
put them on one line, separated by blanks, e.g. '1 4 5' [] 2 3 4 5 1
            Editing your PERL5LIB environment

 Now you also have to tell Perl that it should look in this folder
 for additional libraries...

 You only need to do this once – it will now be set every time you login

$ pico .profile    (note this is “dot profile”, your startup 'profile' in bash)

               add the following lines to your .profile file

   export PERL5LIB=$PERL5LIB:$PWD/perl/share/perl/5.8.7/
   export FTP_PASSIVE=1

              now save the file and exit

        Finally, we are ready to install stuff!
• Now CPAN knows where it should put things, and Perl
  knows where to look for them...
• The command to start the CPAN shell is:
        $ perl -MCPAN -e shell

             now try installing something…

        cpan> install Boulder::Genbank

 CPAN installs often ask you questions as they go along... unless you know
 better you should ALWAYS just accept the default (hit “enter”)
      Watch for error messages...
    • You may see an error message reported
      at the end of an installation
Failed 2/28 test scripts, 92.86% okay. 46/388 subtests failed, 88.14%
*** Error code 29
make: Fatal error: Command failed for target `test'
 /usr/ccs/bin/make test -- NOT OK
Running make install
 make test had returned bad status, won't install without force

    So... now what?? Well, unless you really know what
    you are doing, you have no choice but to use force!

 cpan> force install Boulder::Genbank

           Don't worry, this is safer than it sounds ;-)
   Congratulations, you have just installed your first
                    CPAN module

• If you want to look at what got installed browse to the folder:


• If all went well, you should now be able to execute the
  following command without error:

        $ perl -e 'use Boulder::Genbank'

                    TELL ME NOW!!!
      • After the break we are going to talk about
        object oriented perl...
Perl Unit

Object Oriented Perl
              Perl Unit: Object Oriented Perl
  • The two most common programming “paradigms” are Procedural,
    and Object-Oriented

  • Procedural programming is most commonly used in simple scripts
    where all data and data manipulations are achieved in a single
    program with all variables “visible” and all subroutines present in
    the same program... can get messy! (this is what you have been
    doing so far)

  • Object Oriented programming allows you to collect together a set
    of data as well as the subroutines that may act on it into an
    “Object” and hide them inside of a “magic variable”...
       Perl Unit: Object Oriented Perl

• What installed from CPAN was a “module” (actually a
  set of related modules)
   – a module could be imagined to be an 'empty' object,
     or an object 'template'.

• To use an object, you must instantiate it, and fill it with

• In Perl, this is most often accomplished using the “new”
  Perl Unit: Object Oriented Perl
#!/usr/bin/perl -w

use SomeObject; # first need load the empty object
my $x = new SomeObject(); # then create an “instance”


my $x = SomeObject->new(); # -> means “do this to it”

  $x now contains an “instance” of SomeObject!
      A commonly used paradigm
#!/usr/bin/perl -w

use SomeObject;
my $x = SomeObject->new(
   'attribute1' => 'value1',
   attribute2' => 'value2',

  $x has now been created, and initialized with “value1”
  for attribute 1, and “value2” for attribute 2
      Invoking “methods” of the object
#!/usr/bin/perl -w

use SomeObject;
my $x = SomeObject->new(
   'attribute1' => 'value1'

# ask the object to do something with that data
print $x->getValue('attribute1');
        Try your LWP installation
• Go back to your web browser where you should
  still have the LWP documentation visible
   – If not, you could also type 'perldoc LWP' at your
      command prompt to get the same information

     $ perldoc LWP

• Copy/paste the LWP example code into a new file
  on your system called “test_lwp.pl”

• Edit the file to connect to some website of
  interest... for example, cnn.com, as follows...
The code may be simplified to this:
  # Create a user agent object
  use LWP::UserAgent;
  $ua = LWP::UserAgent->new;

  # Create a request
  my $req = HTTP::Request->new(GET => 'http://www.cnn.com');

  # Pass request to the user agent and get a Response Object back
  my $res = $ua->request($req);

  # Check the outcome of the response
  if ($res->is_success) {
      print $res->content;
  else {
      print $res->status_line, "\n";
                                                            *** Script
                                                            LWP Example
Now you can retrieve web pages without a browser!
                   You're FREE!
                               Explanation Create a new UserAgent

# Create a user agent object

 use LWP::UserAgent;                                                   Create a
 $ua = LWP::UserAgent->new;                                            new
 # Create a request                                                    object
 my $req = HTTP::Request->new(GET => 'http://www.cnn.com');
                                                                        Pass the Request
 # Pass request to the user agent and get a response back              object as an
 my $res = $ua->request($req);
                                                                       argument to the
 # Check the outcome of the response                                   'request' method of
 if ($res->is_success) {                                               the UserAgent to get
     print $res->content;                                              a “Response” object
 }                                                                     in return
 else {
     print $res->status_line, "\n";

                                                                   Check what the
                                                                   'is_success' method
                 Get the content of the response object            returns (T/F)
        Exercises: Connecting
     Regular Expressions and LWP
1.    Write a script that uses LWP to retrieve a web page from any
      address you specify at the command line
     – Hint: $arg = <STDIN>; chomp $arg;

2.    Write a script that uses LWP to get the function ontology
      from GO, then look through it line by line to find the GO Term
      associated with a GO id entered by the user

      function ontology is here:
     e.g. $ perl test_getGOTerm.pl GO:0033041

     Should return “sweet taste receptor activity”

     …This will require you to write a clever regular expression!
• The same methodology can be used to
  “scrape” information from ANY web page!
   –Try writing some “screen-scrapers”
    tonight for yourself
   –It's far easier than cut-n-paste!

• Screen-scraping is an awful (but
  common!) way to do bioinformatics...

  Exercise: Replacing & Enhancing
Regular Expressions with OO Modules
• A few minutes ago we installed Boulder::Genbank
• This takes at least some of the “pain” out of building
  Regular Expressions for parsing GenBank (and other!)
• Examine the Boulder Genbank Parser code I have made
  available to you. ***Script: Boulder Genbank Parser

• Exercise: combine and edit
  (a) your regular expression code that modified the
  author-name/initial format, with
  (b) a piece of Boulder::Genbank code you write
  yourself that extracts the author names from a
  Genbank flatfile.
Data Integration

Syntactic and Semantic
  Integration of Data
               The Holy Grail:

Align the promoters of all serine threonine kinases
involved exclusively in the regulation of nitrogen
fixation in root nodules

Retrieve and align 2000nt 5' from every serine/threonine
kinase in Legumes expressed exclusively in the root
whose expression increases 5X or more within 5 hours of
infection by rhizobium but is not activated during the
normal development of the root, and is <40% homologous
in the active site to kinases known to be involved in cell-
cycle regulation in any other species.
➔   Restrict to Legume
     ➔   SRS

➔   Find all serine/threonine kinases
     ➔   Blast, SRS, Model Organism DB's, 1' literature,
         biological knowledge
➔   Restrict by WT expression pattern
     ➔   Entirely by hand, primarily from 1' literature

➔Restrict by chip expression experiments
     ➔   SMD, ArrayExpress, (TIGR?), probably don't exist yet…

➔Identify active sites
     ➔   Prosite/Prints + biological knowledge

➔Restrict by homology/biological function
     ➔   SRS, Blast, & by hand from 1' literature

➔Get upstream 2000nt
     ➔   If available, entirely by hand
What's wrong with this picture??
Options for the Integration of Biological
      for Hosts and Consumers

(some of the following slides based on
 Lincoln Stein’s GMOD 2003 meeting
Starting from the providers
       (False?) Utopia:
Fully Warehoused Database

  Editin                           Other
  g Tool                           Tool

  Why false?          DATA

              e       Query   Analysis
         Current Situation:
       Modular, Non-Integrated

In what ways
is this good?
                DATA DATA
                                    DATA DATA

Tools Tools       Tools                Tools Tools
                Tools      Tools       Tools

Genome        Expression           Literature
Databases     Databases            Databases
                What we want:

Why is this better?
                  ze                   Query

Why is it difficult to achieve?

   Literature             Expression           Genome
   Data                   Data                 Data
 URL Link Integration of Modular System

            URL Space

Why is it so successful?
 Genome      Stock Center Data   Literature
 Data                            Module

Why is this insufficient?
   Tools           Tools               Tools
         Data Warehouse:
      Common Data Model & API
What is an API?            Query

What are the +’s
What are the –’s
             Massive Common API

e.g. SeqHound from Blueprint

    Massive Common DB Schema (Centralized)
        Data Federation I:
     Common Data Model & API

What are the +’s           Query

What are the –’s
        Massive Common API
e.g. CHADO from GMOD
                        Direct DB

  Massive Common DB Schema (de-centralized)
      Data Federation II:
    Adaptors & Common API

What are the +’s        Query

What are the –’s
               Common API
e.g. BioPerl from open-bio
         Aqua              Tan
     Schema Adaptor   Schema Adaptor
 Common Data Exchange Format

What are the +’s

What are the –’s
                      Query API

          Web-Service Architecture

What are the +’s    Visualiz

What are the –’s

e.g. MOBY-S
      Who can
                                       Service API

‘beige’ types of data?
     Semantic Web Architecture

What are the +’s Visuali
                   ze                    Query

What are the –’s
 Common Generic, 100% expressive Data and Logic Description Language

e.g. ?? S-MOBY, myGrid
 Common Generic, 100% expressive Data and Logic Description Language
   Unfortunately, every one of
these integrative methodologies
(alone or in combination) is used
 by one organization or another

   …so we are nowhere near
    achieving integration…
       This is not a solved
I cannot give you a tool or a methodology that will
answer these types of questions for you.

Generally speaking, the only way to
collect/integrate disparate datasets at the moment
is through brute force.

But there are some tools that make the brute-force
method less backbreaking!
    A Canadian Data Warehouse

              published in:
                     surf to:
               Update cycle report:
 Access to SeqHound Changes
• Starting April 3rd, access to SeqHound via
  seqhound.blueprint.org may no longer be

• Here’s what you need to do from now on
       Unleashed Informatics
• http://bond.unleashedinformatics.com
Register for a free account
Make sure
you select
Now go back to…
This is where you get SeqHound
Click this!
Save the file
To disk
                    To install
• Go to where you saved the file

  $   tar –zxvf seqhound-perl-4.0.tar.gz
  $   cd seqhound-perl-4.0
  $   mv *.pm ~/perl/share/perl/5.8.7/
  $   mv .shoundremrc ~/
  $   cd ~

• Note that /…share/perl/5.8.7 is also
  where you are installing your CPAN modules.
  We’re just making sure that everything is in its
  correct place so that Perl can find it
Your .shoundremrc file content
   Edit the file to contain the following:

   server1 = dogboxonline.unleashedinformatics.com
   CGI = /cgi-bin/seqrem

   server1 = dogboxonline.unleashedinformatics.com
   CGI = /soap/services/
Need to add several more
CPAN modules for the new
  $ perl       –MCPAN –e shell
  cpan>        install IO::Wrap
  cpan>        install MIME::Tools
  cpan>        install SOAP::Lite

The SOAP::Lite installation will ask you a lot of questions.
Simply accept all of the defaults (i.e. hit “enter” to every one)
A Data Warehouse for Acadmic and
        Commercial Use

                                 Based on a presentation by
                                      Ian Donaldson
                                       April 23, 2004

   A program of the Samuel Lunenfeld                          A University of Toronto affiliated
   Research Institute                                                         research institute
What is SeqHound?
Biological data sets exist in various locations, formats and data
structures that serve to represent each single data set.

For example, a biological sequence database may exist in
ASN.1 format, an expression database may exist in a flat-file
format and both data structures may make reference to an
organism but by different names (say by taxon identifier and
organism name).
What is SeqHound?
Since biological data sets are increasingly collected
individually but used together, it would be most convenient to
have access to them:

• in one place
• by a means that is independent of their original format
• where they are represented in a single, cohesive data
• in some way that does not require the data user to be
responsible for this collection and storage of these data.

Question: How can this be achieved?
  What is SeqHound?

Sequences   Sequence annotation         Structure           Literature   Interactions


                                  Access Methods (API)

What is SeqHound?
An application programming interface (API) provides a uniform
method to access data in table columns and to data inside binary
large objects.

Again, sets of functions in the API communicate with different
modules. API functions will only be available for those data
“modules” that have been Imported.

The API is available locally (in the C language) or remotely via an
http interface that can be accessed via corresponding APIs written
in C, C++, Java, Perl and BioPerl.
Who are the users of SeqHound?

  Local Programmer

                     • bioinformatics developer in a biotech start-up
                     • graduate student anywhere in the world
 Remote Programmer
                     • bioinformatics developer in a high-throughput lab

     Web user
Help: Manual and resources
List of API functions
(scroll down to “Sequence Fetch
FASTA” and click on
of what that
does, and how,
in Perl.
What can the remote API do?






















  Test your set up using example code

#!/usr/bin/perl -w
use strict;
use SeqHound;

print "Starting Program\n";
my $init = SHoundInit("TRUE", "myapp");
print "SHoundInit $init\n";
my $isinit = SHoundIsInited();
print "SHoundIsInited $isinit\n";
my $gi = SHoundFindAcc("CAA28783");
print "SHoundFindAcc returned gi number $gi\n";
my $fasta = SHoundGetFasta($gi);
print "SHoundGetFasta returned FASTA\n$fasta\n";

                        Seq Hound Getting Started
        Pre-computed blasts
• Unfortunately, not all SeqHound functions
  are available in Perl, including the Blast
• However, pre-computed Blasts are
• These are called “Neighbours”
• e.g. SHoundNeighboursFromGi
  – look up the function yourself… it isn’t exactly
    like the others…
• Extend the sample code to retrieve the
  Neighbours (i.e. BLAST hits) for the gi
  number retrieved in your code
• Then get the FASTA for each of those.
• hint: perldoc –f split
  – The Perl “split” function will take a string and
    split it into an array based on a particular
    character… like a “,”

    my @ids = split “,”, $result;
 Exercise: Combine everything you
            know so far
Starting with a list of gi numbers that are of interest to you

   1.   Use SeqHound to retrieve the GenBank file
   2.   Use Boulder::Genbank to extract the journal reference
   3.   Use Boulder::Genbank plus your own regular expression to
        retrieve the SwissProt record ID (can’t do this in Perl
        SeqHound, unfortunately, but you can in Java!)
   4.   Use LWP to retrieve that SwissProt record

What you just built is commonly referred to as a “workflow”.

Workflows are often still written exactly as you just did, but
   there are now better ways!!
Break for a glass of water
What data-retrieval problems are
       still not solved?
 ●   Many biological data types are still not "covered" by
      ● e.g. microarray, two-hybrid, citations

 ●   I want to be able to move data from one place to another
     without having to reformat it

 ●   I want the machine to tell me what data is available,
     what I can do with it, and to automate the process of
     doing it.

 ●   I want to be able to build complex analytical pipelines
     without having to write a lot of code!
What new 'tool' is needed?
 A mechanism by which a researcher is able
 to simultaneously interact with multiple
 sources of highly disparate biological data
 regardless of the underlying data format or

 The mechanism must also allow for the
 automated, dynamic identification of data,
 and the relationships between data from
 different sources.
    BioMOBY – the Genome Canada Data
           Integration Platform
●   Determine the API for
    ● A web-services registry
    ● An ontology-based, data-driven, service discovery system
    ● A simple data representation and transport layer

●   That fulfills the following criteria:
      ● Presents the biologist with only relevant choices

      ● Presents the biologist with all such choices, without necessitating

        their prior knowledge.
      ● Constructs and executes queries against disparate service API’s

        without human intervention
      ● Presents the resulting output in a predictable, machine-readable

      ● Is extendible into non-biological areas of study

      ● Requires minimal effort by service providers

       Issues and Tools for Integration
              of Biological Data
           From Disparate Sources

A Blast using the “Sevenless” protein (a
tyrosine protein kinase) from Drosophila
                                           Capital R with a ‘1’
        ●Sevenless                         With a ‘c’ and no ‘1’
                                            With a ‘c’ and a ’1’,
                                            lower case R

        ●c-ros-1 unknown protein (?!!??!?!)

        ●Tyrosine-specific protein kinase

        ●Tyrosine protein kinase
        ●Transmembrane tyrosine-specific protein kinase
                                                        ways to
        ●UR2 sarcoma oncogene
                                                        say it
        ●Kinase related protein ros precursor

 And that’s just in Genbank… in SwissProt it is
 described as:
    •   Tyrosine-protein kinase receptor
      The Tower of Babel analogy
         for the genomics era:
● Dispersed creative process
● No overall coordination

● Massive, multidisciplinary projects

● Long duration

● How should the “end product” look?

    Unfortunately, we have to live with it for at least the next few years…
  So what can we do?
If we want to integrate data in a flexible, distributed manner, then
      there are some things that we clearly MUST do.

e.g. For sequence data, we must agree:
    ● How to describe its structure and function

      ● common vocabulary!

    ● What a sequence query “looks like”

      ● common data representation

    ● How to execute a sequence query

      ● common API (or at least, self-describing API)

    ● To make it known that you can answer such queries

      ● “BioGoogle” of sequence retrieval services
   Integration on which level?
• Integration of the way we describe the
  data, or “semantic” integration
  – Controlled vocabularies
  – Ontologies
     • Gene Ontology
     • Sequence Ontology
     • Other
• Integration of “data proper”
  – Federation
  – Warehousing
  – Other...
Data Integration

Semantic Integration...
         GO: the Gene Ontology
   Arose from an ISMB discussion paper presented by
         Michael Ashburner (FlyBase) June, 1998.

“The goal of the Gene Ontology (GO) Consortium is
   to produce a controlled vocabulary that can be
   applied to all organisms even as knowledge of
 gene and protein roles in cells is accumulating and
changing. GO provides three structured networks of
 defined terms to describe gene product attributes.
   GO is one of the controlled vocabularies of the
            Open Biological Ontologies.”
    What is An Ontology?
           Formal definitions

1: A systematic account of existence.

2: An explicit formal specification of how to
represent the objects, concepts and other
entities that are assumed to exist in some area
of interest and the relationships that hold
among them.

3: The hierarchical structuring of knowledge
about things by subcategorizing them
according to their essential (or at least relevant
and/or cognitive) qualities.
                      Why GO?
• Provide a common gene-product vocabulary &
  definition spanning all organisms

• Make it hierarchical to allow a controlled
  ambiguity in biological role designations &
  functional descriptions.

• relate records in existing genome databases
  based on this common vocabulary.
         ---> DATA INTEGRATION!
        GO structure:“DAG”
    (Directed Acyclic Graph)
●   Hierarchical tree-like structure

●   Vertices ('branches') are unidirectional from
    ~less specific to ~more specific nodes.

●   Any node may have multiple parents.

●    Any path from a node must not lead back to
    that node, hence acyclic.
    ●   (necessary for automated annotation)
Generic DAG Structure

     Current GO Tree:
 roots, nodes and vertices
Biological Process   Mol. Function   Cell Component

          Is a               Is a       Is a

      Death          Chaperone             Cell
                                     Part Of


   Signal             Response to
                                          Cell Adhesion
Transduction        External stimulus

      Radiation          Abiotic         Sensory
      response         Stim. Resp.      perception

SOS                                          Calcium ion
How GO affects data discovery:
 •   Genome Database Queries
         • “Retrieve all proteinase inhibitors localized in the vacuolar
           lumen which are involved in regulation of apoptosis”
             – GO:0004866 proteinase inhibitor
             – GO:0005775 vacuolar lumen
             – GO:0006915 apoptosis

 •   Literature Database Queries
           • “pull out all papers dealing with guard cell membrane-
             bound receptors involved in water stress responses”

 •   Interpretation of Microarray Data
           • Functional clustering of microarray spots according to
             their GO "biological-function" annotation.
           • Clustering of spots by cellular location?!?
        Activity Session: Ontologes
                                               FOR EXPERIENCED:
•Geneontology.org Website
•Explore the GO homepage                       •Browse to the “tools” section of the gene
•GO Mappings                                   ontology website
•GO Downloads                                  •Download “DAG-Edit”
•GO tools                                      •Install
•GO Browsers “Amigo”                           •Download one of the existing ontologies
                                               ('downloads' section of the website)
Q1 How many genes in Arabidopsis are           •Explore it, and then try creating your own
involved in “specification of floral organ     ontology from scratch!
     - retrieve their sequences to a file          •Browse the OBO website to see
                                                   what other ontologies are out there;
Q2 the ATCEL2 gene has what                        load up one of them that is in DAG-
molecular function? How many other                 Edit format and explore it...
Arabidopsis genes are recorded as
having that activity?                              •http://obo.sourceforge.net/cgi-bin/table.cgi

Browse the OBO website to see what
other ontologies are out there!
What does BioMoby do?
               The BioMoby Plan
•   Create an ontology of bioinformatics data-types
•   Define an XML representation of this ontology (syntax)
•   Create a public update/edit interface for this ontology
•   Define Web Service inputs and outputs v.v. Ontology
•   Register Services in an ontology-aware Registry

• Machines can find an appropriate service
• Machines can execute that service unattended
• Ontology is community-extensible
Overview of BioMoby Transactions

    MOBY hosts & services

      Alignment   Sequence
       names      Align
                  Express.    Central
       The BioMoby Ontologies
• The Namespace Ontology defines all of the different
  categories of data we might want to talk about
   – Genbank records, PDB records, PubMed abstracts, etc.
• The Object Ontology defines structures for these
  different data categories
   – FASTA, Blast reports, PDB 3-D structure,

• Both are 100% open to the public for update and

• New data-type?
   – First check that it doesn’t already exist in the ontology
   – If not, then register it in the ontology
   – VOILA, everyone else on earth can understand it and use it!
      The Namespace Ontology
• Effectively a list of all different types of data
  records, and their abbreviation
   – NCBI_gi, PDB, EMBL, PubMed, OMIM, etc.

• Importantly


   – If you have some information about a GenBank record,
     you are allowed to use the NCBI_gi namespace to
     indicate which record you are talking about.
          Data-Types in BioMoby

• Data-types are the core of BioMoby’s
  interoperable behaviours.

• Every BioMoby data-type begins as an “identifier”
  for some piece of data somewhere on the Web.
  – Namespace + ID number

• Extra bits of information are then added to the
  identifier to say more about it

• The syntax for this is formalized by using the
  Object Ontology
     The MOBY-S Object Ontology

• Has a similar structure to the Gene
  – Data Class name at each node
  – Edges define the relationships between Classes             node

• Edges define one of three relationships

  – IS A
     • Inheritance relationship
     • All properties of the parent are present in the child
  – HAS A
     • Container relationship of ‘exactly 1’                   node
  – HAS
     • Container relationship with ‘1 or more’
The Simplest Moby Data Object

 <Object namespace=‘NCBI_gi’ id=‘111076’/>

             The combination of a namespace and an
             identifier within that namespace
             uniquely identify a data entity, not its
             location(s), nor its representation
        A Primitive Data-type

      ISA    DateTime

      ISA     Float

      ISA    Integer    <Integer namespace=‘’ id=‘’>38</Integer>

Object ISA    String
        A Derived Data-Type

<VirtualSequence namespace=‘NCBI_gi’ id=‘111076’>
    <Integer namespace=‘’ id=‘’ articleName=“length”>38</Integer>
</ VirtualSequence >

     Object         String

              ISA              Virtual
     A Derived Data-Type
<GenericSequence namespace=‘NCBI_gi’ id=‘111076’>
    <Integer namespace=‘’ id=‘’ articleName=“length”>38</Integer>
    <String namespace=‘’ id=‘’ articleName=“SequenceString”>
</ GenericSequence >
              ISA                               HASA
     Object         String

              ISA              Virtual   ISA    Generic
                              Sequence         Sequence
     A Derived Data-Type
<DNASequence namespace=‘NCBI_gi’ id=‘111076’>
    <Integer namespace=‘’ id=‘’ articleName=“length”>38</Integer>
    <String namespace=‘’ id=‘’ articleName=“SequenceString”>
</ DNASequence >
              ISA                               HASA
     Object         String

              ISA              Virtual   ISA    Generic   ISA     DNA
                              Sequence         Sequence         Sequence
                      Legacy file formats
• Containing “String” allows us to define ontological classes that represent
  legacy data types (e.g. the 20 existing sequence formats!)
<NCBI_Blast_Report namespace=‘NCBI_gi’ id=‘115325’>
   <String namespace=‘’ id=‘’ articleName=‘content’>
            TBLASTN 2.0.4 [Feb-24-1998]

            Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A.
            Sch&auml;ffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman
            (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search
             programs", Nucleic Acids Res. 25:3389-3402.

            Query=   gi|1401126
                      (504 letters)

            Database: Non-redundant GenBank+EMBL+DDBJ+PDB sequences
                       336,723 sequences; 677,679,054 total letters


                                                                                  Score      E
            Sequences producing significant alignments:                           (bits)   Value

            gb|U49928|HSU49928 Homo sapiens TAK1 binding protein (TAB1) mRNA...    1009    0.0
            emb|Z36985|PTPP2CMR P.tetraurelia mRNA for protein phosphatase t...      58    4e-07
            emb|X77116|ATMRABI1 A.thaliana mRNA for ABI1 protein                     53    1e-05
                   Binaries – pictures, movies
• We base64 encode binaries, and then define a hierarchy of data classes that
  Contain String

• base64_encoded_jpeg ISA text/base64 ISA text/plain HASA String

   <base64_encoded_jpeg namespace=‘TAIR_image’ id=‘3343532’>
      <String namespace=‘’ id=‘’ articleName=‘content’>
             Extending legacy data types
•   With legacy data-types defined, we can extend them as we see fit
•   annotated_jpeg ISA base64_encoded_jpeg
•   annotated_jpeg HASA 2D_Coordinate_set
•   annotated_jpeg HASA Description

    <annotated_jpeg       namespace=‘TAIR_Image’             id=‘3343532’>

        <2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”>
             <Integer namespace=‘’ id=‘’ articleName=“x_coordinate”>3554</Integer>
             <Integer namespace=‘’ id=‘’ articleName=“y_coordinate”>663</Integer>

        <String namespace=‘’ id=‘’ articleName=“Description”>
              This is the phenotype of a ufo-1 mutant under long daylength, 16’C
        <String namespace=‘’ id=‘’ articleName=“content”>
                    The same object…
  annotated_jpeg ISA base64_encoded_jpeg HASA 2D_Coordinate_set HASA Description

<annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’>

   <2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”>
       <Integer namespace=‘’ id=‘’ articleName=“x_coordinate”> 3554 </Integer>
       <Integer namespace=‘’ id=‘’ articleName=“y_coordinate”> 663 </Integer>

   <String namespace=‘’ id=‘’ articleName=“Description”>
   This is the phenotype of a ufo-1 mutant under long daylength, 16’C
   <String   namespace=‘’   id=‘’ articleName=“content”>
                   The same object…
 annotated_jpeg ISA base64_encoded_jpeg HASA 2D_Coordinate_set HASA Description
<annotated_jpeg namespace=‘TAIR_Image’ id=‘3343532’>
   <Object namespace=“TAIR_Allele” id=“ufo-1”/>
   <2D_Coordinate_set namespace=‘’ id=‘’ articleName=“pixelCoordinates”>
               <Object namespace=‘TAIR_Tissue’ id=‘122’/>
        <Integer namespace=‘’ id=‘’ articleName=“x_coordinate”> 3554 </Integer>
        <Integer namespace=‘’ id=‘’ articleName=“y_coordinate”> 663 </Integer>
   <String namespace=‘’ id=‘’ articleName=“Description”>
   This is the phenotype of a ufo-1 mutant under long daylength, 16’C
   How to think about MOBY Objects
           and Namespaces

Data perspective X                                     Data perspective Y

              Object X                      Object Y

                           Record in “gi”
                         (Genbank record)
  Why define Objects in an ontology?

Bioinformatics service providers are not all experienced

 The Moby Object Ontology provides an environment
 within which “naïve” service providers can create new
complex data-types WITHOUT generating new flatfile
  formats, and without having to understand XML

Minimize future heterogeneity between new data-types
 to improve interoperability without requiring endless
         schema-to-schema mapping efforts.
A portion of the MOBY-S
Object Ontology

    How do we use the Moby
                                  Discovery of services
                                  That consume things
                                  LIKE sequences!

           Align                   MOBY
           Phylogeny               Central

                             A sequence is a ___
                            What is a sequence?
                             That has these features __

Let’s see BioMoby in action

 Some screenshots of a Gbrowse
    Moby browsing session
To retrieve this session

            This is SCUFL – Simple Conceptual
            Unified Flow Language

            It is a complete record of everything
            you just did, and it can be saved for
            use in the Taverna workflow
            application that we will look at later…

            ** Taverna Workflow:
Browsing myriad resources through
        a single interface

• No explicit coordination between providers

• Dynamic discovery of ~appropriate Resources
  (Web Services)

• Automated execution of services

• Appropriate rendering of the output

• PIPELINING of output data into next service
Try a MOBY Browsing Session

 Using and Editing
BioMoby Workflows
    In Taverna
Taverna Workbench
Tom Oinn and Martin Senger
myGrid Project
        Downloading Taverna

• Taverna can be obtained from:
• Once at the site, click on Download
• Download the version appropriate for your
  operating system.
• Between releases, the Moby functionality may be updated and you
  can find instructions on how to acquire those updates from:
• For a more comprehensive guide on Running and using
  Taverna, please refer to
• Assuming that you have downloaded and unzipped
  Taverna, you can run it by double-clicking on runme.bat
  (windows) or executing runme.sh (Unix/Linux/OS X)
• You may need to:

          $ chmod u+r runme.sh
• Taverna’s splash screen
• Once Taverna has loaded, you will see 3 windows:
   – Advanced Model Explorer
   – Workflow Diagram
   – Available Services
• The Advanced model explorer is Taverna’s primary editor and allows
  you to load, save and edit any property of a workflow.
• Workflow diagram contains a read only graphical
  representation of your workflow.
• The Available services window lists all of the
  services available to a workflow designer.
• Under the node ‘Biomoby
  @ …’ Moby services and
  Moby data types are
• The Object ontology is
  available as children of
  MOBY Objects node
• Services are sorted by
  service provider authority
• If you wish to use registries other than the default one,
  you can add a new Moby ‘Scavenger’ by choosing to
  ‘Add new Biomoby scavenger…’
• Enter the registry’s location and click okay.
     Creating Workflows

• I have the workflow saved and would
  like to offer it for download.
• We will start by adding the Object ontology node
  Object to our workflow.
• The Advanced model explorer now shows that we have a
  processor called Object
   – Object has 3 input ports: id, namespace and article
   – Object has 1 output port: mobyData
• The Workflow diagram illustrates our processor
• We can discover services that consume our data type,
  context click on ‘Object’ and choose ‘Moby Object
• A window will pop up that tells you what services Object
  feeds into and is produced by
• Expanding the Feeds into node results in a list of service
  provider authorities
• Expanding an authority, for example,
  bioinfo.icapture.ubc.ca, reveals a list of services
• We will choose to add the service called
  ‘MOBYSHoundGetGenBankFasta’ to our workflow.
• A look at the state of our current workflow.
• And graphically.
• The service consumes Object, with article name
  identifier, and produces FASTA, with article name fasta.
• To discover more services, context click on the service
  that outputs the data type that you would like to discover
  consuming services for and choose Moby Service
• The resultant window displays the services inputs and
• There are also tool tips that show up when your mouse
  hovers over any particular input or output that tells you
  what namespaces the data type is valid in
• Context clicking on an output reveals a menu with 3 options.
   – A brief search for services that consume our datatype
   – A semantic search for services that consume our datatype
   – Adding a parser to the workflow that understands our datatype
• The result of choosing to add a parser for FASTA to our workflow.
• The parser allows us to extract:
   – The namespace and id from FASTA
   – The namespace and id from the child String
   – The textual content from the child String
• The result of choosing to conduct a brief search for
  services that consume FASTA
• We will add the service getDragonBlastText to our
  workflow by choosing ‘Add service -…’ from the context
• The current state of our workflow shown graphically.
• A more complex view of our workflow
• Finding services that consume NCBI_BLAST_Text starts
  by viewing the details of the service ‘getDragonBlastText’
• Conduct a brief search
• Add the service ‘parseBlastText’ to our workflow
• Our current workflow
• Workflow inputs are added by context clicking on
  Workflow inputs in the Advanced model explorer and
  choosing ‘Create New Input…’
• The result from adding 2 inputs:
   – Id
   – namespace
• The workflow input id will be connected to Object’s input
  port ‘id’
• Workflow after connecting the workflow input ‘id’
• The workflow input namespace will connect to Object’s
  input port ‘namespace’
• Workflow after connection the workflow inputs.
• Workflow outputs are added by context clicking on
  Workflow outputs in the Advanced model explorer and
  choosing ‘Create New Output…’
• The result from adding 2 workflow outputs:
   – moby_blast_ids
   – fasta_out
• The output moby_blast_ids will be connected to
  parseBlastText’s output port Object(Collection –’hit_ids’)
• The output fasta_out will be connected to
  Parse_Moby_Data_FASTA’s output port fasta_’content’
• To run the workflow, click on ‘Tools and Workflow
• Choose ‘Run workflow’
• A prompt to add values to our 2 workflow inputs
• To add a value to the input ‘id’ click on id from the left
  pane and choose ‘New Input’
• Enter 656461 as the id
• Choose namespace from the left and click on ‘New Input’
• Enter NCBI_gi as the value for namespace
• Once you are done, click on ‘Run Workflow’
• Our workflow in action
• Once the workflow is complete, we can examine the
  results of our workflow.
• A detailed report is available outlining what happened
  when and in what order.
• We can examine the intermediate inputs and output, as
  well as visualize our workflow.
• If we choose the Graph tab, our workflow is illustrated.
• Intermediate inputs allow us to examine what a service
  has accepted as input
• Similarly, Intermediate outputs allows us to examine the
  output from any particular service.
• Without the parser, FASTA is represented as a Moby
  message, fully enclosed in its wrapper.
• Non-moby services do not expect this kind of message
• Non-moby services expect the just the sequence
  and using the Parse_Moby_Data processor, we
  can extract just that
• Moby services can interact with the other services in
• Let’s add a Soaplab service.
• We will choose a
  soaplab service called
• Choose the restrict
  service and add it
  to the workflow.
• We will connect
  the output port
  from the service
  ta_FASTA to the
  input port
  _data’ from the
  service restrict
• The result of our actions
  so far.
• We will need to add
  another workflow output
  to capture the output of
• Create an output called restrict_out
• Connect the output port ‘outfile’ from the service restrict
  to the workflow output restrict_out
• Once the connections
  have been made, run the
  workflow again using the
  same inputs.
• The workflow on the left has some
  extra services added to it.
   – FASTA2HighestGenericSequenceObject
      from the authority
   – runRepeatMasker     from the authority
   – A Moby parser for the output
     DNASequence from runRepeatMasker.
   – A workflow output Masked_Sequence
• Add them to your workflow
• The service runRepeatMasker is configurable, i.e. it
  consumes Secondary parameters.
• To edit these parameters, context click on the service
  and choose ‘Configure Moby Service’
• The name of the parameter is on the left and the value is
  on the right.
• Clicking on the Value will bring up a drop down menu, an
  input text field, or any other appropriate field depending
  on the parameter.
• The parameter species contains an enumerated list of
• Select human.
• When you have made your selection, you may close the
• Let’s run the workflow
• We will run our workflow with a list
  – Click on id in the left pane and then click on New
    Input twice
• Enter 656461 and 654321 as the ids
• Enter NCBI_gi as the value for namespace
• Our workflow will now run using each id with the single namespace
• Notice how the workflow is running with iterations. This is
  happening because the Enacter is performing a cross-
  product on the input
• You can still view intermediate inputs and outputs.
• Using the queryIDs, you can track each invocation of a
  moby service through the whole workflow
•   Imagine now that you want to run the workflow using a FASTA sequence that you
    input yourself (without the gi identifier)
•   To do this, context click on getDragonBlastText and choose Moby Service Details
     –   Expand the Inputs node and context click on FASTA(‘sequence’)
     –   Choose Add Datatype – FASTA(‘sequence’) to the workflow
•   A FASTA datatype will be added to the workflow and the appropriate links created
• Notice the datatype FASTA
  on the left of the workflow
   – Since the datatype FASTA
     hasa String, a String was
     also added to our workflow
     and the appropriate
     connection was made
• We will now have to add
  another workflow input and
  connect it to the String
  component of FASTA.
•   A workflow input ‘sequence’ was
    added to the workflow and a
    connection was made from the
    workflow input to the input port ‘value’
    of String.

•   We also removed the link between
    MOBYSHoundGetGenBankFasta and
    getDragonBlastText by context clicking
    on the link in the Advanced model
    explorers’ Data links and choosing to
    remove the link

•   Now when we choose to run our
    workflow, we will also have the chance
    to enter a FASTA sequence
• Go ahead an enter any FASTA sequence as the input to
  the workflow input ‘sequence’
• Run the workflow
•   Any results can be saved by simply choosing to Save to disk
     – You will be prompted to enter a directory to save the results.
     – Each workflow output will be saved in a folder with the same name as a workflow
       output and the contents of the folder will be the results
•   You can also choose Excel, which produces an Excel worksheet with
    columns representing the workflow outputs and with rows that represent the
    actual data.
   Load the SCUFL workflow you
           saved earlier
• After your gbrowse_moby browsing session, you
  saved the workflow.
         – http://www.ece.ualberta.ca/~markw/taverna_workflows/

• Load it up again now, look at it, and
  perhaps run it

• Please don’t ALL run it…when all of you run the
  same workflow simultaneously it will cause an
  awful load on the poor service providers! ;-)

To top