Document Sample
Tutorial Powered By Docstoc
					      Taverna Workflows

                Paul Fisher
       University of Manchester


This tutorial is designed to introduce you to the Taverna 1.7
                      workflow workbench
                                Prerequisites - 1


   In order to run Taverna 1.7 on your computer you will need to have the latest Java installed. If you do
    not have Java already installed, you can download it from this URL:

   You will have a choice of the download you would like. Download the JDK with Java EE packaged up
    too. This will give you the opportunity to develop web services and use the ones deployed by Java
    developers at a later date. The Java Runtime Environment (JRE) being downloaded should be 1.5 or
    later for Taverna to work.
   If you have Java installed, but it is an earlier version, you will need to update it to 1.5 or later
    otherwise Taverna will NOT work.
   The minimal installation you will need is the standard JDK package.
   Download the desired JDK by following the link on the website and choose a location on your
    computer to save it to.
   Open the saved file and follow the installation instructions to install Java on your computer
   Restart your computer to complete the installation.
                               Prerequisites - 2

                                           A zip package

   You will also need a tool to unzip the downloaded workbench. There are various tools available on
    the internet, including WinZip, 7-Zip, and a few others. Personally I prefer 7-Zip, which is free to and
    easy use, available at the following URL:

   You will need to choose the appropriate file to download for your operating system, i.e. Windows,
    Linux, Apple MAC.
   Choose a location to save the file in and save it.
   Locate your saved file and follow the installation instructions to install it on your computer.
   Restart your computer to complete the installation.
                               Prerequisites - 3

                                      Linux users - Graphviz

   Those who are installing Taverna on Linux will also have to install Graphviz onto the system. This is
    available at the following URL:


   At the time of writing – I have no installation instructions for this package, so please refer to the user
    documentation provided on the web site
                   Downloading Taverna
   Open your usual web browser and go to the myGrid homepage at the following URL:


   Find and follow the links to download Taverna 1 link on the web page


   Once on the ‘Download’ page, identify the relevant Taverna distribution you need.

   Follow the link to download the workbench. The web page should re-direct you to the
    source forge page.

   Choose a location to save the file and click OK.
            Unzipping the workbench
   Choose to “Unzip/Extract the files”, but not into the current directory.

   You will need to choose a directory in which to unzip the files. I recommend
    somewhere in the root drive of your computer so you can easily access it,
    e.g. C:\myGrid\ .

   You can change the name of the folder at this stage, e.g. to “Taverna”.

   If you are using Taverna on Linux, please be sure that you have the relevant
    access permissions to install and run Taverna in the desired directory.

        If you need a Zip package – download and install “7-ZIP”
                            (find it using Google)
                           Opening Taverna
   Locate you Taverna installation and open the Taverna folder.
   Start Taverna by double clicking on the “runme.bat” (Windows users )or “runme.sh”
    (Linux and Mac users).

   If you have successfully installed Java, you should see a dialog box or command window
    open, shortly followed by the Taverna application.
   Once you have installed Taverna for the first time it will need to update all of its
    components. You do not need to do anything for this, as this happens as the workbench is
    opening. You should see a graphic in the centre of your screen, with a download
    progress. Each component will be shown loading in this progress bar in turn. Once this
    has completed (depending on connection speed – about 5 minutes), the Taverna
    workbench will open.

   The Taverna workbench consists of 3 main panels for constructing workflows:
       The Available services pane (Top Left side)
       The Advanced Model Explorer pane (Bottom Left side)
       The Diagram pane (Right side)
                     The 3 Panes of Taverna
   The Available services pane is used to display the web services to the user. This list contains default
    services from when the workbench starts. Once you become more experienced with the workbench,
    you will be able to add you own services, including adding default services so they load
    automatically when Taverna opens. This list contains WSDL web services, local BioJava widgets,
    Soaplab services, and BioMoby objects. Each of these can be added to the workflow model
    (workflow being constructed) so that a task can be achieved.

   The Advanced Model Explorer (AME) pane contains the services used in the current workflow,
    including the inputs, outputs, and data links between each service. Once populated with services, each
    service can be expanded using the “+” button. This provides a list of the inputs and outputs that the
    service takes in and expels. It is these inputs and outputs that allow you to connect services together.

   The Diagram pane shows a graphical representation of the workflow being used/constructed. The
    diagram can be adapted to view different aspects of the current workflow, to show all the ports for
    all the services, only those ports that have been connected or bound, or to change the layout of the
    workflow from portrait to landscape.
  3 Panes of Taverna

                   Diagram pane

Model Explorer
          Advanced Model Explorer

   AME – (bottom left panel)
    The AME is the primary editing component within Taverna.
    Through it you can load, save, and edit any property of a
    It enables you to:
         build a workflow
         add nested workflows
         edit workflows by connecting services
         add metadata to a workflow
                   Diagram Pane

   Shows inputs / outputs, services and control flows
   It allows you to change the view of a workflow,
    save the visual representation, and explode or
    implode nested workflows
               Available services

Lists services available by default in Taverna – top left
   ~ 3500 services
       Local java services
       Simple web services
       Soaplab services – legacy command-line application
       R Processor
       BioMart database services
       BioMoby services
       Beanshell processor
        Allows the user to add new services or workflows from the web or from
        file systems
New services can be gathered from anywhere on the web – the
  default list are just a few we already know about – importing
  others is very straightforward
  Go to the DDBJ list of available web services at:
     These services were not designed for use in Taverna, but Taverna can use
    them if you supply the address of the WSDL file
   Click on the DDBJ blast service (http://xml.nig.ac.jp/wsdl/Blast.wsdl) and
    copy the web page address
   Go to the services panel in Taverna, and right-click on
    ‘Available Processors’ (at the top of the list). For each type of
    service, you are given the option to add a new service, or set of services.
   Select ‘Add new WSDL scavenger’. A window will pop-up asking for
    a web address
   Enter the Blast Web service address you just copied
   Scroll down to the bottom of the Services list and look at the
    new DDBJ service that is now included, clicking on the “+” icon
    next to the service
Go to the Services Panel
 Type ‘binfo’ into the search box at the top of the panel (we will
  start with simple information retrieval from KEGG)
 You may see several services highlighted in red

 Scroll down to the KEGG services, to ‘binfo’

  This service returns information about the KEGG databases,
  depending on the information you supply to it, e.g. the word
  ‘pathway’ gives info on the KEGG pathway database
   Right click on the ‘binfo’ service and select ‘Invoke service’
   In the pop-up ‘Run workflow’ window add the word “pathway”
    by clicking on the input document ‘db’ and selecting to ‘add
    new input’ from the dialog menu.
   Click ‘Run workflow’ and the service is invoked
   Click on the ‘Results’ tab in the Taverna tool bar
     Thedatabase information is displayed on the right
      when you select ‘click to view’
   Click on the ‘Process Report’ tab
     Look   at processes. This shows the experimental provenance – where
      and when processes were run, and times
   Click on the ‘Status’ tab
     Look   at options As workflows run, you can monitor their progress
      here (Note: this workflow was probably too fast to see this feature
      properly, we will come back to it later)
The processes for running and invoking a single service are the
basics for any workflow and the tracking of processes and
generation of results are the same however complicated a
workflow becomes

In the next few exercises, we will look at some example
workflows and build some of our own from scratch
      Installing The Whip Plug-in
   Your going to use the ‘new’ myExperiment Plug-in

   Firstly you need to install WHIP - http://www.whipplugin.org/
      This allows you to interact with the myExperiment server

   In Taverna, go to “Tools” and then select “Plug-in Manager”
   Click “Find New Plug-ins”, and select the “myExperiment and
    WHIP (beta) plug-in” from the list
   Then click “Install” to install the plug-in
   You should now see the myExperiment plug-in in the
    toolbar menu

   Browse through the example workflows in the first
    tab of the plug-in
   To view a workflow, select “Preview” from the
    buttons under the workflow diagram
   To open a workflow in the workbench, click on the
    open button under the workflow diagram
   Previewing a workflow allows you to see all the metadata
    associated with the workflow on the myExperiment website,
       TAGS
       AUTHOR
       CREDITS

       You can also view the latest workflows, search for keywords, and
        even browse using a tag cloud
       Choose a workflow to load and click on “Open”
              Opening from a URL

   Select ‘Open Workflow Location’ from the File menu at the top
    of the workbench. In the pop-up window, add the following web
    address to load a workflow from the web
   The ‘Mouse Pathways and Gene annotations for QTL Phenotype’ workflow will
    be loaded
   View the workflow diagram - you will see services in a couple
    of different colours
      Open from URL option

                                  Populated Diagram

Paste in the file
location – the URL

                                 Populated AME
   In the Advanced Model explorer panel – click on the name of
    the workflow at the top of the window (just above Inputs) – in
    this case ‘Pathways and Gene annotations for QTL Phenotype’ and then
    select the ‘workflow metadata’ tab at the top of the AME.

    You will see a text description of the workflow, its author and its unique LSID
    (Life Science Identifier). When publishing workflows for others, this
    annotation is useful information and allows the acknowledgement of
    intellectual property
   Now that you have loaded your workflow you can execute it
   To execute your workflow open the “File” Menu at the top of the
   Choose “Run Workflow” from the options given – this will open a
    pop-up box to input your data
   Each input requires you to enter data – to enter data into each of
    the inputs, click on one input and then click on the “New Data” option
    in the pop-up menu system
   Once you have entered these details, press the “Run Workflow”
    button at the bottom of the pop-up box
        Run Workflow option
        option                                                3
                                             Click on input

Input pop-up box


                                              Click on “New Input”
                              Run Workflow
                                    Viewing Results

   Once you have executed the workflow, the Taverna workbench will change views from “Design” to
    “Results”. You should see this change behind you Input pop-up box
   You can minimise the Input pop-up box to view the progress of the workflow being executed – the
    different colours indicate whether a service has run or not
       Green = Completed
       Purple = Currently being executed
       Grey = Awaiting execution
   Once completed, the results will appear as separate tabs at the top of the workflow diagram
    (indicated in the following diagram as workflow outputs)
   Each tab contains an output file of results – the results can be viewed by clicking on the file in the left
    hand pane where it says “click to view”
   The file can then be searched through using the right hand pane, allowing you to verify the results – if
    they are wrong simply maximise the pop-up window and hit the “Run workflow” button again, making
    sure that the inputs are correct
   Each file can then be saved to the local machine – to do this simply click on the button marked “Save
    to disk” and enter the location to save the files
   Then click OK
           Results pane

                          Workflow Outputs 2

                                               Result file

                 Save results to   4
   Import the ‘get_genes_by_pathway’ service into a new
    workflow model. First, you will need to either close the current workflow
    from the file menu, or select ‘New Workflow’ then find the above service
    again in the ‘services’ search panel.
   Right-click on ‘get_genes_by_pathway’ and import it into the
    workbench by right clicking, and selecting ‘Add to Model’
   Go to the AME and expand the [+] next to the newly imported
    service. You will see:
    1  input (Green arrow pointing up)
     1 output (purple arrow pointing down)
   Define a new workflow input by right-clicking on ‘Workflow
    Input’ and selecting ‘Create New Input’
   Supply a suitable name e.g. ‘pathway_identifier’
   Connect this new input to the ‘get_genes_by_pathway’ service
    by right-clicking on ‘pathway_identifier’ and selecting
    ‘get_genes_by_pathway ->pathway_id’

      You always build workflows with the flow of data
   Define a new workflow output by right-clicking on ‘workflow
    output’ and selecting ‘create new output’
   Supply a suitable name e.g. ‘gene_outputs’
   Connect the ‘get_genes_by_pathway’ service to the new output,
    remembering to build with the flow of data
    You have now built a simple workflow from scratch!
   Run the workflow by selecting ‘run workflow’ from the ‘File’
    menu at the very top of the workbench. You will again need to
    supply a KEGG pathway identifier – “path:mmu03010”
   Select a ‘string constant’ from ‘Available Services’ list (by
    searching for ‘constant’ in the text search box
   Right-click and select ‘add to model with name…’
   Insert ‘pathway_id’ in the pop-up window
   In the AME, right-click on ‘pathway_id’ and select ‘edit me’
   Edit the text to ‘path:mmu03010’.
   Replace the workflow input with this string constant
   Run the workflow – it runs in the same way
   Add a description and your name as author to the metadata
   Save the workflow by selecting ‘save’ in the file menu
Exercise 7 Defining Output Formats

    So far, most of the outputs we have seen have been text, but in
    bioinformatics, we often want to view a graph, a 3D structure,
    an alignment etc. Taverna is able to display results using a
    specific type of renderer if the workflow output is configured
   Load the ‘Fetch PDB flatfile from RCSB server’ workflow from
   Run the workflow with the ID ‘1crn’, or another PDB id you
    know of
Exercise 7 Defining Output Formats

   Look at the results. For ‘pdbFlatFile’, you will see the results are
    displayed graphically. This is achieved by specifying a particular mime
    type in the output – given as ‘chemical/x-pbd’ in the service metadata tab.
   Go back to the AME and look at the metadata for ‘pdbFlatFile’.
    HINT: when you click on something in the AME, a metadata tab
    will appear at the top of the window
   Click on the Metadata window and select the MIME Types tab
   MIME Types. As you can see, it has a mime type associated with it. If you
    wish to render results in anything other than plain text, you MUST specify
    the mime-type in the workflow output, e.g. PDF e.t.c.
Exercise 7 Taverna MIME-Types
The following mime-types are currently used by Taverna
text/plain=Plain Text
text/xml=XML Text
text/html=HTML Text
text/rtf=Rich Text Format
text/x-graphviz=Graphviz Dot File
image/png=PNG Image
image/jpeg=JPEG Image
image/gif=GIF Image
application/zip=Zip File
chemical/x-swissprot=SWISSPROT Flat File
chemical/x-embl-dl-nucleotide=EMBL Flat File
chemical/x-ppd=PPD File
chemical/seq-aa-genpept=Genpept Protein
chemical/seq-na-genbank=Genbank Nucleotide
chemical/x-pdb=Protein Data Bank Flat File
Exercise 8 Sharing Workflows

   Go to http://www.myexperiment.org
   myExperiment is a social networking site for sharing
    workflows and workflow expertise and experiences
   Browse around the site and see what it contains
   Create yourself an account and join the group
    called “Newcastle MSc.” (this will be necessary for
    the next exercise)
Exercise 8 Sharing workflows

   Find all the workflows containing BLAST searches. How did you
    find them? How many are there? Can they all be downloaded?
   Which is the most downloaded workflow?
   Which is the most viewed workflow? Is it the same?
   How many workflows are tagged with ‘protein_structure’ ?
   If you wish to share your workflows with the rest of the class,
    upload them and set the permissions so that only those in the
    ‘Newcastle MSc.’ group can see them – make sure you add a
    description and author details to the workflow metadata first!
Exercise 9
Workflow Reuse – Nested Workflows
   Reload your KEGG workflow from exercise 6
   We will extend this workflow to get descriptions of
    each gene identifier, and find the pathways for
    each gene.
   In the myExperiment plug-in, find all the workflow
    that are tagged with KEGG
   Select the ‘Get Kegg Gene information’ workflow
Exercise 9
Workflow Reuse – Nested Workflows
   Go back to Taverna and look at the original workflow
   In the AME, click on ‘add nested workflow’.
   Go back to the myExperiment plug-in, and choose to
    “import from URL” for the workflow you found in
   You can change the name of the nested workflow by
    right-clicking on the processor and selecting ‘rename’, on
    the nested workflow
   You need to connect up the workflow as if it was any
    other kind of service
Exercise 9
Workflow Reuse – Nested Workflows

   The nested workflow has 1 input and 2 outputs. We
    have to connect the input, but we can choose which
    outputs to display
   In the outer workflow create a new output called
    ‘gene_descriptions’ - hint: to switch between
    workflows, use the “Workflows” option in the file
    menu system
   Connect gene_descriptions to the nested workflow
    output ‘gene_descriptions’
Exercise 9
Workflow Reuse – Nested Workflows

   Save the workflow (remembering to embed the
    nested workflow, using the supplied check box) and
    run the workflow
   Look at the results
Exercise 10 Iteration

    Taverna has an implicit iteration framework. If you connect a
    set of data objects (for example, a set of fasta sequences) to
    a process that expects a single data item at a time, the process
    will iterate over each sequence

   Load the ‘Mouse Pathways and Gene annotations for QTL Phenotype’
    workflow from the myExperiment plug-in using any of the
    previously used import methods
   Watch the progress report. You will see several services with
    ‘Invoking with Iteration’
Exercise 10 Iteration

    The user can also specify more complex iteration strategies
    using the service metadata tag
   Find and load the workflow ‘Demonstration of configurable
    iteration’ from the myExperiment plug-in
   Read the workflow metadata to find out what the workflow
   Select the ‘ColourAnimals’ service and read the metadata for
    that service. Under the description is the iteration strategy
   Click on ‘dot product’. This allows you to switch to cross product
Exercise 10 Iteration

 Run the workflow twice – once with ‘dot
  product’ and once with ‘cross product’.
 Save the first results so you can compare them

  – what is the difference? What does it mean to
  specify dot or cross product?
Exercise 11 Substituting Services

    Taverna does not own many of the bioinformatics services it provides. This means
    that it cannot control their reliability. Instead, Taverna provides strategies for
    dealing with services being unavailable

   Load the ‘BiomartAndEMBOSSAnalysis’ from the myExperiment website this time,
    using the ‘Launch in Taverna’ button.


   Look at the metadata for the ‘emma’ service. It is an implementation of clustalw
   Find the DDBJ clustalw service – HINT: go to the DDBJ services homepage, and
    import the service from URL into the Available Services palatte

Exercise 11 Substituting Services

   Instead of adding the new service normally, right-click and
    select ‘add as alternate’

   In the resulting menu select ‘emma’

   The DDBJ version of the ClustalW service is now added as an
    alternative to emma in the AME. It will appear at the bottom
    of the input/output list of the Emma service

   Select the new service (which should be called ‘analyzeSimple’
    and look at the inputs and outputs. These need to be connected
    to the correct inputs and outputs in Emma (it is unlikely the
    inputs and outputs will have the same names! – see if you can
    figure them out)
Exercise 11 Substituting Services

   Right-click on the ‘query’ input in analyzeSimple and map it to
    ‘sequence_direct_data’. In both services, these inputs expect a
    set of fasta sequences.

   Right-click on the ‘result’ output and map it to ‘outseq’ in emma
    in the same way.

   Now you have a workflow which will run using emma when it is
    available – but will substitute it for DDBJ clustalw if emma
Exercise 12 Failover

    Taverna also allows the user to specify the number of times a
    service is retried before it is considered to have failed.
    Sometimes network traffic is heavy, so a working service needs to be
   Select ‘tmap’ from the same workflow. To the right of the service
    name are a series of 0s and 1s. By simply typing the numbers, the user can
    specify the number of retries and the time between the retries
   Change it to 3 retries for ‘tmap’ and set the status to ‘critical’
    using the final tickbox. Now it is critical, it means the whole workflow
    will be aborted if ‘tmap’ fails after 3 retries. Failures in non-critical services
    will not abort the workflow run.
Spotlight on BioMart
Exercise 13 Spotlight on BioMart

    Biomart enables the retrieval of large amounts of
    genomic data e.g. from Ensembl and Sanger, as well
    as Uniprot and MSD datasets
   After saving any workflows you want to keep, reset
    the workbench in the AME (by closing open workflows
    in the File menu)
   Keep open the workflow
   Run the Workflow
Exercise 13 Spotlight on BioMart

This Workflow Starts by fetching all gene IDs from Ensembl
   corresponding to human genes on chromosome 22 implicated
   in known diseases and with homologous genes in rat and

For each of these gene IDs it fetches the 200bp after the five-
  prime end of the genomic sequence in each organism and
  performs a multiple alignment of the sequences using the
  EMBOSS tool 'emma' (a wrapper around ClustalW). It then
  returns PNG images of the multiple alignment along with three
  columns containing the human, rat and mouse gene IDs used in
  each case.
Exercise 13 Spotlight on BioMart

   Right-click on the ‘hsapiens_gene_ensembl’ service
    and select ‘configure BioMart query’
   By selecting ‘Filters’ and then ‘Region’ – change the
    chromosome from 22 to 21 – now the workflow will
    retrieve all disease genes from chromosome 21 with
    rat and mouse homologues
   Run the workflow and look at the results
   See how some of the other options were configured
    by finding them in the other pull-down lists (Gene,
    Multi-species comparison etc)
Exercise 13 Spotlight on BioMart

    Find out which Gene Ontology terms are associated with the
    genes in your region by adding a new Biomart query
   Select another copy of ‘hsapiens_gene_ensembl’ from the
    services panel (under Biomart and Ensembl 50 genes (Sanger))
    and select ‘add to model with name….’ (as there is already a
    service with that name!) and call the service ‘hsapiens_GO’
   Configure ‘hsapiens_GO’ by right-clicking and selecting
    ‘configure Biomart query’ and selecting ‘filters’. In filters,
    select ‘gene’ and the ‘id list limit’ tick-box next to ‘ensembl gene
   Configure the output (by selecting attributes) and select ‘GO
    ID’ for each GO partition under the ‘External -> GO
    Attributes’ tab in the attributes section
Exercise 13 Spotlight on BioMart

   Connect the input to the ‘hsapiens_gene_ensembl’
    service via the ‘ensembl_gene_id’
   Create 3 new workflow outputs, ‘CCGOID’, ‘MFGOID’
    and ‘BPGOID’. Connect the outputs of the biomart
    processor to them
   Re-run the workflow and view which GO terms are
    associated with your chromosomal region
   NOTE: Having 3 outputs for related terms like this is
    inefficient and hard to read – we will come back to a
    solution to fix this problem in the next session
This exercise highlights the services that do not perform
biological functions, but are vital for running life science
Exercise 14

   A shim is a service that doesn’t perform an
    experimental function, but acts as a connector, or
    glue when 2 experimental services have
    incompatible outputs and inputs
   A shim can be any type of service – WSDL,
    Soaplab etc. Many are simple BeanShell scripts
Exercise 14 – Finding Shims

   In the ‘BiomartandEmbossAnalysis’, work out which
    services are shims
   What do the shims do?
Exercise 14 Other Shims

   There are many myGrid shim services. These are
    currently being described in a shim library, but for
    now, a small collection are documented here:
   Find a shim that will return a DNA file in Fasta format
    from an id. Load the example workflow and run it in
   Find a shim that will translate DNA
    HINT: these services might be in the feta registry
Exercise 14 Other Shims

 The emboss suite of programs have a
  subdivision – edit
 All the edit services are shims

 Experiment with the edit services

 Find a service that will remove gaps from
Exercise 14 Beanshell

   Open Taverna and load the workflow
   Look at the diagram. Each brown service is a
    BeanShell script
   In the ‘Advanced Model Explorer’ (AME) select the
    BeanShell ‘CreateFasta’
   Right-click and select ‘configure beanshell’
Exercise 14 Beanshell

   Look at the script and see if you can work out its
   Look at the ports and their types as well as the
   Note the names of the ports and where they
    appear in the script, you will need to know how to
    specify an input/output in the next exercise
    Exercise 14
    Beanshell – Writing your Own

Beanshell scripts allow users to write small, bespoke java scripts
  to allow incompatible services to work together
 Create a new workflow by selecting ‘file’ and ‘New Workflow’

   Add a new beanshell processor by right-clicking “Beanshell scripting host” in
    the service panel and selecting “Add to model” (you may change the name
    of the processor)
   Right click the beanshell processor created and select “ Configure
   Create 2 input port named: myName and mySurname
   Cretate 1 output port named: myFullname
Note that theses ports are automatically added to AME window
    Exercise 14
    Beanshell – Writing your Own

  Select the script tab and Paste the following script
myFullname = myName +"\t" + mySurname
 Create 2 workflow inputs and 1 workflow output by going to

  the port menu, and choosing to add a new port for both input
  and output.
 Connect them to the configured beanshell processor.

 Run the workflow

 You should get your full name printed in the output

BioCatalogue is a social networking site that allows you to discover Web
   Services, to include in your workflows
  Go to http://www.biocatalogue.org
  Familiarise yourself with the page
  Go to ‘Project information’ and look at the roadmap to see what
   features are coming
  If you want to try BioCatalogue, you can sign up to the friends email
   list (found on the front page at the bottom left), and you can try the
   Pilot out by signing up for the beta testing:

      1.   Username: biocat
      2.   Password: biodog

Shared By: