Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Get this document free

ECCB_Tech_Talk

VIEWS: 5 PAGES: 44

									Software for the Data-Driven
 Researcher of the Future


          Dr. Paul Fisher

       Paul.Fisher@manchester.ac.uk
      http://www.cs.man.ac.uk/~fisherp
                      What is myGrid?
• An e-Science Collaboration Since 2001

• Numerous partners involved:
    –   Manchester
    –   Southampton
    –   Oxford
    –   EMBL-EBI

• It provides sustainable and production quality software
    – Supported by OMII-UK, EPSRC and BBSRC

• Mixture of developers, bioinformaticians and researchers

        Software | Services | Content | Skills | Community
                       myGrid Open Suite of Tools
 Workflow Repository    Workflow GUI Workbench       Client User Interfaces
                         and 3rd party plug-ins




                                                         Web Portals
Service Catalogue



                       Provenance      Workflow      Programming and
                          Store         Server             APIs

Activity and Service
 Plug-in Manager
                             Open
                          Provenance
                             Model




                        Secure Service Access, and
                           Programming APIs
 Huge amounts of data
                        Microarray
                        1000+ Genes

                                 Next Gen
QTL regions                      Sequencing
 100+ Genes
                                      10,000+
                                      Genes


              How do I look at ALL the
               genes systematically?
      Issues with current approaches
•   Scale of analysis task overwhelms researchers – lots of data

•   User bias and premature filtering of datasets – cherry picking

•   Hypothesis-Driven approach to data analysis

•   Constant changes in data - problems with re-analysis of data

•   Implicit methodologies (hyper-linking through web pages)

•   Error proliferation from any of the listed issues – notably human error



                      Solution                Automate
• Web Services
  – Technology and standard for exposing
    code and data resources by an means that
    can be consumed by a third party remotely
  – Describes how to interact with it, e.g.
    service parameters


• Workflows
  – General technique for describing and
    executing a process
  – Describes what you want to do, including
    the services to use
        What kind of Services?
•   WSDL Web Services
•   REST
•   BioMart
•   R-processor
•   BioMoby
•   SoapLab
•   Grid Services
•   Local Java services
•   Beanshell
•   Workflows
                   Who Provides the Services?

•   Open domain services and resources
•   Taverna accesses 3500+ services (11,874 operations)
•   Third party – we don’t own them – we didn’t build them
•   All the major providers
    – NCBI, DDBJ, EBI …
•   Enforce NO common data model.




    Can include your
    own services and
    resources too !!!
Where can I find these services?
         www.BioCatalogue.org
• A public centralised and curated registry of Life
  Science Web Services

• ‘Web 2.0’-style website and API

• Allow anyone to register, discover and curate Web
  Services

• Community oriented with expert guidance

• Open content, open source, open platform
                                             Workflow
                                             diagram
           Available
           services




Workflow
Explorer




                 http://www.taverna.org.uk
What are Workflows used for?
                     Taverna
• Taverna first released 2004
• Current version Taverna 2.2
• Currently 1500+ users per month, 350+ organizations,
  ~40 countries, 80,000+ downloads across versions

• Freely available, open source LGPL
• Windows, Mac OS, and Linux

• http://www.taverna.org.uk
• User and developer workshops
• Documentation
• Public Mailing list and direct email support
                                                                   Steve Kemp
                                                                  Andy Brass
                                                         + many Others

Trypanosomiasis in Africa
   http://www.genomics.liv.ac.uk/tryps/trypsindex.html
                   Reuse, Recycle,
                 Repurpose Workflows
                         Identify biological pathways implicated in resistance to
                         Trypanosomiasis in cattle using mouse as a model
                         organism.



Dr Paul Fisher



                                                                         Dr Jo Pennock



                                    Identify the biological pathways
                                    colitis and helminth infections in
                                    the mouse model

DOI: 10.1002/ibd.21326    |    PMID: 20687192
Where can I find workflows?
Recycling, Reuse, Repurposing
                           • Share
                           • Search
                           • Re-use
                           • Re-purpose
                           • Execute
                           • Communicate
                           • Record


            http://www.myexperiment.org/
Taverna Plug-in




            Bringing
            myExperiment to the
            Taverna user
              Take a breath…..
• myGrid

• Taverna
  – Workflows good for automation
  – Reduce errors


• BioCatalogue
  – Publicly curated repository of Web Services


• myExperiment
  – Web 2.0 repository supporting Workflow discovery and re-use
  Taverna and the ‘Cloud’


                    +



Analysing Next Generation Sequencing Data
             Analysing African Cattle with
                     Taverna 2.2
Different breeds of African Cattle
     • 10,000 years separation

African Livestock adaptations:
         • More productive
         • Increases disease resistance

Potential outcomes:
    • Food security
    • Understanding resistance
    • Understanding environmental
    • Understanding diversity

http://www.bbc.co.uk/news/10403254
                        The study
• Lots of sites involved in Study:
   – Univeristy of Liverpool
   – University of Manchester
   – ILRI (Nairobi)……

• Genetic variation in cattle species
   – African breeds: N’dama, Boran and Sahiwal

• Resistance to African trypanosomiasis infection (sleeping sickness)
   – Genetic differences to make one species more resistant?
   – Potential consequences of those genetic differences?
   – Pathways are affected by those changes?
          The Analysis Problem
• Sequenced DNA from 3 cattle breeds using SOLiD / Illumina

• 22 million SNPs for Sahiwal alone
   – N’Dama, Boran ~ 11 millions SNPs each
   – Large data

• Comparing new data with reference genomes

• Identifying interesting differences
   – e.g. non-synonymous SNPs, stop lost, stop gained,
     splicing regions etc
  The Analysis Pipeline (in Perl)
                          Input SNP data from sequencer


                                 Map between
    MAP                      Genome Builds (Liftover)


  FILTER
                             Filter for SNPs in Exons

 ANALYSIS

                               SNP consequences

                           Identifying damaging SNPs
Harry Noyes –                        (Polyphen)
University of Liverpool
Workflow and phases
                      MSc Student - Mohammad Khodadadi


                            Input SNP file

                            Populate DB with start SNP’s and
                            resource version numbers

                            Lift-over: maps between UMD3
                            and BTA4 cow assemblies

                            Exon positions from ENSMBL

                             Find SNPs in Exon regions


                                 PolyPhen to mark “dangerous” SNP’s




  The result can be either a MySQL database
            or TSV / CSV download
Taverna and the ‘Cloud’


            +
     What we will demonstrate
1. Uploading Next Generation Sequencing SNP
   data to the cloud

2. Creating a new experiment

3. Running a workflow on multiple cloud
   instances

4. Showing result output, including links to
   annotated SNPs
Demo
Managing and Processing Data
Accessing Taverna on the Cloud
Loading inputs


                    Experiment
                     Metadata


                 Input Provenance




                  Jobs Status




                 Input data
                 summary
Summary of Workflow Output


                    11 Million SNP for N’ Dama




                  Non-synonymous coding
                  SNPs




                   Polyphen predictions:
                   probably damaging

                    N.B. Number variances due to workflow and polyphen filtering process
New Developments in myGrid
                      Taverna
• Taverna 2.2 execution engine
   – Large data processing
   – Pause, resume and cancelling workflows
   – Retry and parallelisation layer

• Taverna 2.2 server
   – Remote workflow execution
   – Workflows launched from web pages
   – Workflows executed on the cloud
                                              Essential for
                                                 cloud
            Other New features
 Validation reporting




• Loading and sharing service sets
• Support for offline editing
• New provenance features
BioCatalogue Plug-in




        ISMB 10
                         Training
• Tutorials and Training
   – 58+ tutorials to >900
     people.
   – >20 universities, Life
     Science Institutes, and
     networks.
   – Major Bio conferences
   – Summer schools in Biology
     and Middleware

• Developer and User Days
   – Annotation Jamborees

• Undergraduate and
  Postgraduate Bioinformatics in
  > 30 universities.
            More Information
myGrid
 – http://www.mygrid.org.uk

• Taverna
 – http://www.taverna.org.uk

• myExperiment
 – http://www.myexperiment.org

• BioCatalogue
 – http://www.biocatalogue.org
Visit us at the myGrid Silver
       Sponsor Stand
FIN

								
To top