Docstoc

Workflow Systems in Bioinformatics and the Bioinformatics

Document Sample
Workflow Systems in Bioinformatics and the Bioinformatics Powered By Docstoc
					          Workflow Systems in
         Bioinformatics and the
Bioinformatics Educational Grid
                              Tan Tin Wee
                       Associate Professor
          National University of Singapore
                    tinwee@bic.nus.edu.sg
  Shoba Ranganathan, Victor Tong, Justin Choo, Richard Tan,
     G.S.Ong, Simon See, TS Lim, Mark de Silva and KSLim.


            International Symposium on Grid Computing ISGC2004
               “Making the World Wide Grid a Reality” 27 July 2004
              In a Nutshell
 Weaving several threads of development in
  Bioinformatics such as Workflow Integration
  and DataGrid (over the past 5 yrs or so)
 to build an integrated educational grid “info”-
  structure
 that will support HR development, education,
  training, self-learning etc in the emerging
  discipline of bioinformatics
 for conventional as well as “E-” eduation
 Making the World Wide Grid
  a reality: Contribution of
       Bioinformatics
 Bioinformatics is the science of using
  information and ICT to understand
  biology
 Despite being driven by rapid progress in allied
  disciplines in the “New Biology”: genomics,
  proteomics, metabolomics, transcriptomics, other
  „omics, computational biology, systems biology
  generating unprecedented volumes of data
 Grid computing is not yet ubiquitous in life sciences
                                                      In Vitro
                                                      In Vivo
                                                      In Situ



                                                      In Silico
                                                      Biology
                                                      And
                                                      Personalised
                                                      Medicine



                                                      Imaging
                                                      Modeling
                                                      Simulation
                                                      Theoretical
                                                      Biology
D. Hanahan and R. A. Weinberg.
The hallmarks of cancer. Cell., 100(1):57–70 Review, 2000
“Tools have changed, but the job hasn’t”
Cartoons from talk by Rozhan Mohammed Idrus & Hanafi Atan, Universiti Sains
Malaysia, APAN 2003



Bioinformatics - Emergent and almost pervasive
in all biological and life science disciplines
 Computational Demands and Data
   Processing in Life Sciences
         are expanding!
                   ‘omics
        Genomics            Proteomics

            Bioinformatics


Computational                  Medical
   Biology                   Informatics   BioStatistics



      LIFE SCIENCE INFORMATICS

       LIFE SCIENCES and HEALTH SCIENCES
Where does the Grid fit in?




              Life Science
              Informatics

BIOTECHNOLOGY         INFOCOMMUNICATIONS
and NEW BIOLOGY           TECHNOLOGY
        Why no Grid here yet?
 Lack of widespread awareness and training in computational
  skills in the life sciences community
 Few computational, networking and grid computing experts
  with first hand domain knowledge in life sciences
 Data-intensive nature of life science grid computing
  applications
 Labour-intensive nature of building life science grids
 Lack of Killer Applications
 Bioinformatics is a Rapidly changing target
     Biotech and InfoComm Technology
     - Parallel Growth     Systems
                  Worm       Dolly &          Biology
                 Genome       DNA
                                     Human
Genome     Microbial          chips
           Genomes                   Genome
Project                                          BioX


               Biotechnology
  1990    92     94   96     98 2000     2002    2004
          InfoCommunication Technology

                             Dotcom boom
  Wais
                              And crash
 Gopher                                    Lambda
                       Internet 2        Networking
      WWW
                                Grid Computing
      boom            Java
               ISP
Grids applied to Life Sciences
  Internet2 demos of late 1990s
   Quasi-realtime data collection from
   synchrotrons for 3D structure determination
  iGrid98, SC‟98, SC‟99, SC2003 (most
   geographically dispersed grid computing
   award – arthropod phylogenetics)
  Anthrax research – United Devices
  Encyclopedia of Life (EOL)
  OBIGrid
  Kansai BioGrid
   Large scale mega projects
   Not a WorldWide Grid
       iGrid98
        SC‟98




http://www.startap.net/startap/igrid98/maxLikeAnApbionet98.html
         INET‟99
          demo




   http://www.bic.nus.edu.sg/admin/News/Jun99/inet/inet99.html
http://www.startap.net/startap/APPLICATIONS/collabForStruct.html
 When will World Wide Grid
be a reality for Life sciences?
 Like World Wide Web – everyone uses it,
  from publication and accessing the content
 Plug and play: Tap computational cycles
  anywhere from everywhere anytime
 Secure to use
 Killer application like Mosaic in 1993
 Generate meaningful results
 Control key tools and automate mundane
  processes
 Connect people, computation, data,
  instruments
   Focus on two key areas
 Grid-enabled bioinformatics workflows
  systems as the killer application
 Building a bioinformatics educational
  grid
          Workflow integration
 1996/7 Java based FlowBot project
 1998 Inet98 Internet Flowbot Protocol
 http://www.isoc.org/inet98/proceedings/8x/8x_1.htm
 1998 Application to Life Sciences – Workflow
  Integration – BIC-CNPR joint project – Lim et al
 1998 PSB‟98 From Sequence to Structure to
  Literature: The protocol approach to Bioinformation
  Wu et al  spinoff company GeneticXchange.com
 2001 Spinoff Company KOOPrime Pte Ltd
 2002 BioWorldWideWorkFlow initiative in APBioNet

 Workflow integration is the Killer Application for a
  World Wide Life Science Grid!
  Bioinformatics Educational
             Grid
 2001 - S* Life Sciences Informatics Alliance –
         3 years of experience in online bioinformatics education:
         5 courses and >1000 persons worldwide trained in
         basic bioinformatics
         Team of Online Teaching Assistants
 Workshop on Education in Bioinformatics: WEB01,
  WEB02, WEB03, WEB04
 2004 – Problem Based Learning PBL
         in Bioinformatics online using
         emeet.nus.edu.sg
 2004 – Building the Bioinformatics Educational Grid

 Education is the answer to making the
  World Wide Grid a reality
             Background
 Biologists and Biotechnologists need to be
  equipped and trained to carry out tomorrow‟s
  biological research today!
 Integration of
   Network Infrastructure
   Databases
   Software
   Computational Grid
   Online educational and teaching and learning
    materials
   Education + Killer Application
1993
   1. Network Infrastructure




      APAN Advanced
Research Network 1996-2004
      Internet2 and beyond
 1st Country outside North America to connect:
  SINGAREN – Singapore Advanced
  Research and Education Network
 TANET2 from Taiwan and APAN-Transpac
  were next.
 Then Abilene….
 … Today‟s Starlight and Lambda
  networking
                 2. Databases




 Key major databases - 1.5 Terabytes today!
 Publicly accessible data over the Internet doubling every
  12 to 18 months http://www.bio-mirror.net/
 Mirroring Moore’s Law for chip technology
BIODATABASES
               Genbank
           Genbank Genomes
               InterPro
                 PDB
               BlastDB
               BLOCKS
                DDBJ        PIR
                EMBL       PFAM
               ENZYME     REBASE
               PROSITE NCBI REFSEQ
                            SRC
                        SWISSPROT
                         Taxonomy
                          TrEMBL
                          UniGene
                          euGenes
   BioDataGrid: Registry of
         Databases
 NUS BioDataGrid initiative
  everest.bic.nus.edu.sg/lsdb
 Singapore National Grid Office has a
  new initiative – to be announced soon.
 Facilitate varying levels of granularity
  of access to structured and
  unstructured biological data
            3. Software
 APBioBox project
 Funded by IDRC Pan Asia Networking
  R&D Grant
 Rapid and Easy Replication of Grid
  enabled software crucial to grid growth
3. Software - APBioBox
  Funded by International Development
   Research Centre of Canada, under their
   PAN Pan
   Asia Networking ICT grant
  To build an easily installable, widely
   and freely accessible, integrated suite
   of bioinformatics applications to faciliate
   training and research amongst biologists
   in developing countries

    A/P Tan Tin Wee,
     National University of Singapore
    Adjunct Professor Shoba Ranganathan, NUS and Chair Professor, Macquarie
     University, Sydney
    Ong Guan Sin, Consultant programmer, Singapore Computer Systems Pte Ltd
3. Software - APBioBox

 Shrink-wrapped bundle of some 300
  software applications used in bioinformatics
 Preconfigured and integrated
 15 mins to install on a Linux RedHat9
  platform which typically takes several weeks
  to set up.
 Partnered with Sun Microsystem to come up
  with Bio-Cluster Grid, the equivalent in Sun
  Solaris platform.
 CDROMs and Downloadable
  http://www.apbionet.org/apbiogrid/apbiobox
3. Software – APBioBox appls
 Logical Abstraction through Java Wrappers built for:
  EMBOSS ~160 applications
  PHYLIP ~30 applications
  HMMER
  CLUSTALW
  BLAST
  FASTA, SSEARCH (in progress)
  MySQL
  SRS (Lion Bioscience)
  Globus Grid Toolkit 2.4
  Unix Utilities
  KOOPlite
  Key Bioinformatics Databases (in progress)
BioBox for Solaris
Grid Engine Portal
          4. Computational Grid
APBioGrid 2002
To faciliate the building of a shared computational grid resources for the
   Asia Pacific region.
              APBioGRID Project

                                      CRAY



APBioGrid
Aims to provide computational resources to bioinformaticians and
   biological researchers to facilitate education and research through
   sharing each other‟s computers over the Grid
Why APBioNet Grid is needed?
 Large-scale [life] science [..] are done
  through the interaction of people,
  heterogeneous computing resources,
  information systems, and instruments, all of
  which are geographically and
  organizationally dispersed.
 The overall motivation for “Grids” is to
  facilitate the routine interactions of these
  resources in order to support large-scale [life]
  science […].
                       Altered from Bill Johnston 27 July 01
           Why the “Grid”?
 1998: advent of Grid Computing – distributed
  computing
 E.g. Tapping idle CPU cycles globally in the
  SETI project or the Anthrax online projects.
 “Like tapping electrons from the power grid,
  just plug in the appliance into the socket”
 Currently, one of the
  hottest areas in ICT.
 So the basis for BioGrids
  has been laid
               5. Online Learning Material

            Eight institutions from 5 continents since 2001
         – The S* Life Science Informatics Alliance
                                                 Sweden

                                                        Karolinsk
                              University
                                                        a
                              of Uppsala
                                                        Institutet

      USA
 Stanford University

University of California,
      San Diego                                                                   National University of
                                                                                       Singapore

                                                                     Singapore


                                                                      Australia
                                                                           University of Sydney
                              South Africa                                 Macquarie University
                            University of the Western Cape
Sample Lecture- Slide View
      Wide Range of S* learning
             materials
- Tutorial ppt presentation materials on introductory bioinformatics
- Frequently Asked Questions in Forum discussion archives
- Overview lectures on:
1.    Introductory Molecular Biology
2.    An Overview of the Computational Analysis of Biological Sequences
3.    Transcript Analysis and Reconstruction
4.    Comparative Genomics
5.    Representations and Algorithms for Computational Molecular Biology
6.    Protein Structure Primer, Structure Prediction and Protein Physics
7.    Genomics and Computational Molecular Biology Genomics
8.    Protein and Nucleic Acid Structure, Dynamics,and Engineering
9.    Proteomics and Proteomes
10.   Structure Prediction for Macromolecular Interactions
11.   Protein - Ligand Modeling
12.   Microarray informatics
                 Goals of S*
• Provide a GLObal Bioinformatics Unified Learning
  Environment (GLOBULE) made up of modular
  courses in the disciplines of bioinformatics, medical
  informatics and genomics
• Provide accessibility to the highest possible quality
  of online courseware approved by the educators
  from the host institutions.
• Develop an integrated modular learning
  environment that allows a student to select from
  both pre-requisite modules and advanced modules
  in order to build a comprehensive program.
S* course-3: by country
S* course Growing List of
 Participants‟ Countries
S* Geographical Comparison
    Participants Distribution In 1st Course Compared
                  Against In 3rd Course

   70
   60
   50
   40
   30                                                     1st Course
   20                                                     3rd Course
   10
    0
    australasia       asia            europe
           north america     africa       south america
                     Feedback
 Pretty good. A few rough edges but I'm sure you'll work
  them out over time. I really enjoyed it. Most of the lectures
  were very well presented and the participants in the forums
  helpful. I'm very impressed at the amount of work that has
  obviously gone into setting up the course. ~ Alan
  Wardroper, Thailand

 The international participation of the lecturers and students.
  The relevance of the field of bioinformatics in meeting the
  biomedical needs of today. The level of communication
  provided by the IVLE system enhanced learning
  considerably. The range of professional and academic
  background of students. The technical support provided by
  SStar was rapid and efficient to queries.
 ~ C.A.O. IDOWU, England
                      Feedback
 To think that a world-class, web based education with such
  valued lectures is brought to your desk free of cost is
  impossible elsewhere. The course was wonderfully well
  managed. Our requests and problems were quickly and
  well attended to. I had a great time doing this course and
  thank the S*STAR team whole heartedly for making me a
  fortunate participant with this fantastic experience.
 ~ Naidu Ratnala Thulaja, Singapore

 I think it is a very useful course, it is exactly what it says it
  is: an introduction to bioinformatics. It covers nicely major
  topics and provides enough information in order for us to
  understand what bioinformatics is all about. I enjoyed it
  very much and I am even a bit sad it is over. Thank you
  very much! ~ Patricia Severino, Romania
        Emergence of Grid
          Technologies
 “The Grid” - Grid Computing
 Next Generation Internet technologies
  (Internet2) and their applications
 Computational Grids
 Informational Grids
 Access Grid
 Educational Grids  do the same for the
  educational process – the learner or the
  teacher can tap into learning materials, tools,
  information, computational hands-on, in the
  so-called classroom without walls!
          Educational Grid for
            Bioinformatics
 Increase repository of regularly used bioinformatics software
 Registry of tools, software and databases
 Higher level abstraction of resources
 Virtual classrooms and discussions
 Distributed repository of learning objects and materials
 Self assessment tests
 Project Based modules
 Problem Based Learning
 Integrated learning environment for the practice of
  bioinformatics in the life sciences
 Support both conventional and e-learning/e-education
    Problem-Based Learning
            (PBL)
• Started at McMaster University Medical School over
  25 years ago

• Encourages hand-on and critical thinking. Its hands-
  on approach is particular suited for bioinformatics
  where many of the skills require practical execution
  and the problems encountered are generally open-
  ended.

• PBL encourages :
   acquisition of critical knowledge.
   problem solving proficiency; problems tackled are generally
    open-ended.
   self-motivated learning.
   team participation.
            Role Change
• In PBL, there‟s a fundamental change
  in the role played by the participants.
  a facilitator guides the entire session.
  a scribe records the entire session.
  some participants field questions; others
   try to brainstorm and provide answers.
   There will not be student-teacher
   relationship,everybody is treated equally.
   Focus is on peer learning
PBL Asynchronous Sessions
• S* is currently experimenting PBL session
  using IVLE discussion forum and eventually
  web-based collaboration platform – TWiKi
  (http://twiki.org)
• Consideration/Issues to resolve :
   How to accommodate so many participants
   How to host so many TWiKi page
   Will participants with slow connection able to
    access ?
 PBL synchronous sessions
 Emeet.nus.edu.sg
 CENTRA technology
 Low bandwidth requirement
 VOIP for voice, Video if necessary
 Agenda, Whiteboard, Shared
  applications, File transfer, Web Safari
              Projects
 8 different projects
 8 teams of volunteer facilitators
 300 students into 8 groups
 Two phases
 Set them up to solve various topical
  bioinformatics problems from bottom
  up in PBL style.
 Online Delivery Mechanism
• Consider and want to explore various
  advanced networking technologies
  particularly on video conferencing software.
   e.g. AccessGridTM
    http://www.accessgrid.org/
                                      TM
                 AccessGrid
• It is a suite of resources including multimedia large-
  format displays, presentation and interactive
  environments, and interfaces to Grid middleware and to
  visualization environments.
• Developed by the Futures Laboratory at Argonne
  National Laboratory and deployed by the NCSA PACI
  Alliance, it is now used over 150 institutions worldwide
  with each institution hosting one or more Access Grid
  (AG) node.
• Each node employs high-end audio and visual
  technology needed to provide a high-quality compelling
  user experience.
           Immersive Learning
 Enable group-to-group
  interactions across the Grid.

 Activities such as large-scale 1: Controlling Audio/Visual Quality
                                 Fig
  distributed meetings,
  collaborative work sessions,
  seminars, lectures, tutorials,
  and training are made
  possible.

                                     Fig 2: Group-to-Group Live Interaction
    Issues & Consideration
 Infrastructure (high speed network,
  connection/bandwidth)
 Cost of setting up
 Location of set-up
 Manpower required
 Technical competency
   Workflow as the killer app
 KOOPrime‟s LivePortal/LifeBase and
  KOOPlatform
 Carole Goble‟s myGrid, Taverna, etc
 Anabench
 Vibe
 Bingo
 All with killer GUI
 Others such as ASP model –
  Bioinformatics .com/Entigen‟s BioNavigator
KooP Testbed on APBioGRID
Management Microarray
      Database




                      Email




              DB
             Update
  Remote Management of
biosamples and distributed
    statistical analysis
                                    Implementation:
                                  Laboratory Integration




 Allows users to select vendor plates for processing
 Generate in-house plates from vendor plates
 Print barcodes for each selected plate

 Start up legacy dispenser software

 Auto-import output files of dispenser into database
                                                                 Email
 Email user if there is any error in processing


                                                         DB
                                                        Update
Analysis of Results and
DistributedComputing
                     Grid Based Workflows:
                       A High-Level View
                         Operator View
Administrator View
Browse Drag Drop Connect
                Scheduling Functions


Search and
Resource
Discovery
functions

  Description
  Annotation
  Authoring
  Function
                Sharing/Publishing/
                Resource Browsing
                functions
Future BioGRID User
           Bio End
                   Components
                  Web                KOOP
                Interface           Interface
   Bio Applications (EMBOSS,             Bio Applications
  PHYLIP, FASTA, SSEARCH)              (EMBOSS, PHYLIP)
Sun        Globus-aware Scheduler      Globus-aware Scheduler
                 (Nimrod-G)                  (Nimrod-G)
SGE LSF
                  Globus                        Globus
                    OS                           OS
                   CPU                          CPU
       Weaving the threads of
          development
 In Networking
 In Bioinformatics Software application
  packages
 In BioDataGrid
 In Online Educational Learning Objects
 Bioinformatics Educational Grid and
  Bio World Wide WorkFlow Bio W3F
Output                 Workflow   Output is
From previous object   Object     Input to the next object
        Ingredients of W3F
W3F orchestrator                          KOOPserver

W3F service providers                     KOOPsdk
Apps developers

W3F enactors                              KOOPdaemon

W3F browser                               KOOPbrowser

 W3F editor                                KOOPeditor

 •Users can browse workflows and cobble together
 app objects to reuse, repurpose objects or workflows
 •Apps developers can wrap their applications and
 advertise to potential users and service providers
 •Service providers can mount apps from apps developer
 •W3F orchestrator coordinates scheduling, load balancing, security etc
    Why suitable for grid?
 KOOPdaemons can call grid
  commands through grid portals
 KOOPsdk easily wraps your existing
  applications, including grid ones
 KOOPsdk can also call grid commands
  of say Globus grid toolkits
 Layered approach for rapid uptake.
Framework of Bioinformatics
Development in Asia Pacific
      from 1991-2004
             POLICY

           RESEARCH

         EDUCATION &
       Manpower Training

   Compute INFRASTRUCTURE

     DATA INFRASTRUCTURE

   NETWORK INFRASTRUCTURE

    Collaboration Cooperation
                        The Future
 Defining an Evolving Educational Grid for bioinformatics
 Continuing Major Impact of ICT in the Life Sciences
 Synergistic and sustained growth of two major late 20th
  Century technologies




 Building the framework for World Wide Workflow
 Share resources, access resources seamlessly
 Build sophisticated automated workflows comprising interconnection of
  people, computation, data and bioinstrumentation
 Thank you for this opportunity to share this with
  you.

 Tan Tin Wee
 Tinwee@bic.nus.edu.sg

				
DOCUMENT INFO