Bioinformatics a guided tour by czl10931

VIEWS: 6 PAGES: 35

									WinBioinfTools: Bioinformatics Tools
  for Windows High Performance
      Computing Server 2008

            joint project between
                Nile University,
             Microsoft Egypt, and
      Cairo Microsoft Innovation Center



          Mohamed Abouelhoda
                Nile University




                                          1
                                 Nile University

•   Established in 2006 as a first non-profit research university

•   Specialized in
     • Information and Communication Technology and related fields and their
         applications

•   Research centers
     • Center for Informatics Sciences (CIS)
     • Center for Wireless Intelligent Networks (WINC)
     • Center for Innovation & Competitiveness (CIC)

•    Modern Master Programs
     •  9 Master programs in IT, Micro-electronics, Management, Business,
        Transportation systems, and construction management

•   Recent undergraduate program
     •   Engineering and management programs


                                         Nile University                       2
     Research Groups
• Established in June 2008
• 9 Senior Scientists , 36 Junior
  scientists
• Mission: Address information rich
  problems of importance to the
  region and Egypt




                                  Nile University   3
                                                        3
                                   State of the art
                                                                                       Scientific Discovery
                                                                                       & Business Insights



Bioinformatics                     Scientists                Knowledge Workers
Medical Imaging




  Data Mining
                            Data Analysis, Decision Making & Collaboration Tools



Data Management


                        Local Computing         Data & Information    Local Data &
     HPC
                           Resources             Integration Tools   Software Tools
  Ubiquitous
  Networking
                                                                                            Distributed
                                                                                             Scientific
                                                                                          Information &
                                                                                            Resources

           Distributed Computing          Distributed Data Sources                 Distributed
            Resources & Remote                                                                      4
              Software Access               (SQL, Web Sources,                 Sensors & Devices
                           Infrastructure of CIS
Local CIS resources (first phase):
• 21 Servers with 160 AMD/Intel                                   Bioinformatics
    cores and total 1TB RAM                                        Applications

      •24 TB total Storage

  Extensible resources via partners          Nile University
• Microsoft, Imperial College, Bridge
                Project

                                        Shared Middleware:
            Standardized SOA interfaces, Service Composition, Utility-based Computing, ….




   Biblioteca         Microsoft CMIC                                Imperial College        Other resources
  Alexandrina                                 Nile University          London
                                                                                            Bridge-Project

                                             Nile University                                          5
Group Leader: Mohamed Abouelhoda
Co-Workers: 7 RAs

Projects and Research:
  • NUBIOS: Nile University Bioinformatics Server
  • Plant , animal, bacterial, and virus computational genomics
  • Cancer Bioinformatics
  • High Performance Computing for Bioinformatics Applications

Collaborators:
  Academic
  • Imperial College, Prof. Hani Gabra
  • National Cancer Institute, Egypt
  • Bielefeld University, Prof. Robert Giegerich                  http://www.bioinf.nileu.edu.eg
  • Agriculture Research Institute
  Industry
  •Cairo Microsoft Innovation Cenetr (CMIC), Egypt
  • IBM
WinBioinfTools: Bioinformatics Tools for Windows High
        Performance Computing Server 2008
                               Motivation


 bioinformatics tools are essential for recent molecular biology research



 Obstacles :
    • Open source bioinformatics tools are usually written for Unix/Linux, which
    are not so popular in life science community


    • Data size becomes prohibitively large to analyze on usual PC




                                                                                   8
                        Project Objectives

 Providing WinBioinfTools to the biological community that 
    - runs under MS-windows
    - runs under computer cluster (Windows HPC Server 2008)



 Primary focus on sequence analysis and comparative genomics
    - Distributed Sequence Alignment
    - Distributed BLAST (Basic Local Alignment Search Tool)
    - CoCoNUT (Computational Comparative GeNomics Utilities Toolkit)



 Comparing the performance of the Windows based versions of these tools to the
 corresponding Linux based versions.
                                Resources

 Human Resources
    o Mohamed Abouelhoda, Hisham Mohamed (Nile University)
    o Mohamed Zahran (collaborator, New York City University)
    o Tamer Shaalan (CMIC)




 CMIC Lab:
    • Cluster of 4 nodes (2 Quad-core 2.6 GHz processors, 16GB RAM, 250 GB HD)
    • 1 Giga Ethernet Network
    • Windows HPC server 2008, with HPC Pack 2008




                                                                             10
                 Why Sequence Analysis First?

- We focused on sequence analysis tools
    1. Comparing short sequences  Parallel Sequence Alignment

    2. Comparing large genomic sequences  Parallel CoCoNUT

    3. Database search  Parallel Blast

                                                                     Database
                                                                      search
                                                                                  Genome
- Sequence analysis helps in elucidating                                        Comparison,
                                                                                 Sequence
                                                                                 alignment
 function and structure of genomic regions
                                                                 Database
                                                                  search



- Example pipeline used in practice is HAVANA
   (Human And Vertebrate Analysis aNd Annotation)
               Cluster Modes of Operation



1. Load balancing: task level parallelism
    – Most bioinformatics problems can be well solved under this category due to
      decomposability of data




2. (High Performance) Compute cluster: instruction level parallelism
         - Problems following this are very critical and form a bottleneck




                                                                                   12
   Basic features of the Windows (HPC) Server 2008

 High performance:

     64bit version, accessing large memory, 16, 32, 64, 128 GB RAM

     Cluster and multi-core support

 Cluster management and monitoring tools

 Load balancing: Job scheduler

 Parallel computing: MS MPI

 Interoperability: SUA (Support for Unix Applications), Cygwin also works

 Virtualization: Hyper-V for virtual machines support



                                                                             13
Sequence Alignment




                     14
                      Sequence Alignment

                                                                         mismatch
                     S1 TACAATCAA              T _ ACAA TCA A
                      S TCACTCAC               TC AC_ _TCA C
                      2
                               Sequence Alignment                      insertion/deletion




 Dynamic programming algorithms take   O(n 2 ) time (k=number of genomes, n=average
genome length)




                                                                         Needlemann-Wunch, 1970




                                                                                            15
                 Dynamic Programming Algorithm

  Sequence alignment aims at maximizing the similarities between sequences.

  Optimal sequence alignment can be computed using dynamic programming.

  For two sequences, the best alignment is computed by filling a 2D matrix, where the
 score at cell (i,j) is computed as follows:



                     score(i 1, j 1) 1, if S [i ] S [ j ]
                     score(i 1, j 1),   if S [i ] S [ j ]
score(i, j )   min
                     score(i 1, j ) 1 (character deletion cost)
                     score(i, j 1) 1 (character deletion cost)




                                                                                16
                Parallelization of the DP Algorithm
  The cluster nodes cooperate in filling matrix (Compute Cluster Model)

  The filling proceeds diagonal-wise, and the master node synchronizes the filling

  The complexity reduces to O(n2/k+tk’), where t is the communication time, k is the number
 of cores, k’ is the number of cluster nodes.




                     score(i 1, j 1) 1, if S [i ] S [ j ]                                      node 4

                     score(i 1, j 1),   if S [i ] S [ j ]                             node 3
score(i, j )   min                                                           node 2
                     score(i 1, j ) 1 (character deletion cost)
                                                                    node 1
                     score(i, j 1) 1 (character deletion cost)


                                              synchronizing line,
                                               synchronized by
                                               the master node
                        Experimental Results
 The running times (in seconds) for pairwise sequence alignment on one and 4 nodes.


                                       Time on 4 nodes
                                                                            Time on one
    Sequence Length Communication         Processing          Total            node
                             Time            time
        100 X 100          0.03623         0.000665         0.001765            0.0034

       1000 X 1000         0.152653          0.005            0.014              0.04

       5000 X 5000         0.142311           0.3               1                3.9

      10000 X 10000          1.19             1.1              2.6               8.4

      20000 X 20000          3.679             2                8                 18

      30000 X 30000            4              11               15                 40




- In the first column, we list the sequence sizes, where 100x100 for example means that we
aligned two sequences, each of100 character length.
                    Experimental Results




- On the x-axis, we list the sequence sizes, where 100x100 for example means that we
aligned two sequences, each of100 character length.
Database Search




                  20
       Querying Biological Databases using BLAST

Biological database formatting
And querying

                                                                            2                query



                                      1


                                 formatting




                                                                                         results              3
                                                             1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP
                                                                  . ||| |          .   |. . . | : .||||.:|     :
                                                     1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP 44 lactoglobulin
                                                             1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD.
                                                          . ||| |         .   |. . . | : .||||.:|      :
                                              1    MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKKDPEG 50 RBP lactoglobulin
                                                     1 ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44
                                                            51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP
                                                      . ||| | | | . | |. | . . | | : .||||.:| |: : ||
                                                               :                   ::    .| . ||               |.
                                              1    ...MKCLLLALALTCGAQALIVT..QTMKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin
                                                    51 LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP 93 lactoglobulin
                                                            45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK
                                                       : | |   |    |     :: | .| . || |:       ||     |.
                                              51   LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE 97 RBP lactoglobulin
                                                    45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93
                                                            98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP
                                                   : | |   |    | ||.
                                                                ||                           ||
                                                                      :: | .| . || |: :.|||| | . |.
                                                                                |                              .|
                                              45   ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTK 93 lactoglobulin
                                                    98 DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 RBP 135 lactoglobulin
                                                            94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC
                                                        || ||.          |         :.|||| | .           .|
                                              98    94 IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC RBP
                                                   DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC 136 135 lactoglobulin
                                                    || ||.         |         :.|||| | .             .|
                                              94   IPAVFKIDALNENKVL........VLDTDYKKYLLFCMENSAEPEQSLAC 135 lactoglobulin



                                                                                                                              21
               Large Scale Application of BLAST

 BLAST (basic local alignment search tool): given a biological sequence it search for
similar (sub) regions in the database                               Altschul et al. 1997

The database size is extremely large

The search time is proportional to the database length

 Computer cluster provides an ideal solution for speeding
  up BLAST search




                 Internet             queries
                Institution
                Enterprise




                                                                                     22
               Large Scale Application of BLAST

 BLAST (basic local alignment search tool): given a biological sequence it search for
similar (sub) regions in the database                               Altschul et al. 1997

The database size is extremely large

The search time is proportional to the database length

 Computer cluster provides an ideal solution for speeding                          DB2
  up BLAST search




               Internet/              queries                                      DB3
              Institution/
               Enterprise
                                                        DB1


                                                                                   DB4

          Database segmentation, where the whole database DB
          is divided into subsets DB1,…,DB4
                                                                                     23
              Running Time for 1000 Query on BLAST


              Database                        Running on 4 nodes                One Node
                               Communication          Processing   Total time   Windows
                                  Time                   time

         Drosoph                   0.014522            0.023478      0.038        0.08

         Pataa                     0.01835               0.116      0.13435       0.5

         est_others                 0.0343              0.5456      0.5799         1

         env_nr                    0.53077                3.5       4.03077        18

         Nr                         0.4077                6.8       7.2077         27


Running times in hours for biological data bases
The first 3 databases are DNA while the others are proteins
The query sequence is of the same type as the database




                                                                                           24
             Running Time for 1000 Query on BLAST




Running times in hours for biological data bases
The first 3 databases are DNA while the others are proteins
The query sequence is of the same type as the database
Comparative Genomics




                       26
                           Genome Comparison
Genome Comparison:
Given two genomic sequences, locate the regions of similarity and difference.




                Human genome                                              Mouse genome




                                 Human                            Mouse
                               chromosomes                      chromosomes




                                                                                         27
                                     CoCoNUT


 CoCoNUT is written in Perl and C/C++ and it was intended to run under Linux/Unix

 CoCoNUT                                                                   Abouelhoda-Kurtz-
                                                                             Ohlebusch, 2008
      -Compares two or multiple genomes

      - Compares draft genomes

      - analyzes repeat

      -Maps cDNA to complete genomes




                                                                                          28
                                 CoCoNUT


 CoCoNUT is written in Perl and C/C++ and it was intended to run under Linux/Unix
                                                                     Abouelhoda-Kurtz-
 CoCoNUT was ported to run under Windows                             Ohlebusch, 2008

     - Parts of the code are compiled and runs directly on windows

     -Third party packages (GenomeTools) runs using SUA and Cygwin

 The correctness of the porting was asserted by comparison to results obtained
before

 The large scale comparison runs

     -using the Job Scheduler

     -using our MPI-based script to save some computations




                                                                                   29
       Pairwise Comparison of multi-chromosomal


Chromosome comparisons are independent of each other  Divide the comparisons among the
cluster nodes




                      Genome1                X                 Genome 2




                              X
                                        N1




                     N2                          ……   Nm


                      X             X                      X
  Comparison between the Human and Mouse Genome
 Total time of 20 comparison one node: approx. 47 h. on Windows
 Total time of 20 comparison on 4 nodes: approx. 12 h. on Windows
 Estimated total time for the whole human-mouse comparison is 75 days on 1 node
and 19 days on 4 nodes




               Human Chr. 13 to Mouse Chr. 14          Human Chr. 18 to Mouse Chr. 18




                                                                                        31
                     Availability of our Tool
 The outcome of this project is the package WinBioinfTools
     open source
     available to download from
         • NUBIOS: Nile University Bioinformatics Server
                  http://www.nubios.nileu.edu.eg/tools/WinBioinfTools
         • CodePlex: Microsoft repository for open source tools
                  http://winbioinftools.codeplex.com




                                       •In September: Total downloads 138
                                       times since its release in May 2009
                                       •2 November: 176 downloads

                                                                             32
                  Conclusions and Lessons


 Porting applications to HPC is not always straightforward  Research is still
needed on the algorithmic level

 Parallelization is not scalable

 Mixed model of parallelization is becoming the trend


Windows cluster solution encapsulates the required features to conduct high
performance computing and application migration from Unix/Linux in a user friendly
way

The Linux version is slightly faster but this is not attributed to the cluster modules
of the Windows HPC Server, rather to the third party packages used
(GenomeTools)




                                                                                          33
Thanks for your attention




                            34
            Advantages of Windows (HPC) Server 2008


For users
 • user friendly, GUI  focus more on application
 • job scheduler with intuitive interface and efficiency in practice.
 • no sophisticated command lines.


For developers
 • the MS-MPI (Microsoft implementation of MPI) runs smoothly through the Visual Studio, i.e.,
 easy to compile and run. It also has the feature to run virtually over many cores.
 • the debugging features of the Visual Studio 2008 supporting parallel algorithms


For administrators
 • easy to download with rich and informative documentation 
      Full configuration (including HPC, Security, Networking) 1.5 h per node.
 • efficient, easy-to-use cluster management tools
                                                                                           35

								
To top