Docstoc

Report on EMBRACE Grid

Document Sample
Report on EMBRACE Grid Powered By Docstoc
					                              LHSG-CT-2004-512092

                                   EMBRACE

A European Model for Bioinformatics Research and Community Education


                               Network of Excellence

             Life Sciences, Genomics and Biotechnology for Health



             D3.3.3 Updated Report on Embrace grid deployment              Formatted: Centered




                         Due date of deliverable:  31 Jul 2009
                           Actual submission date:   TODO


Start date of project:            1 Feb 2005         Duration: 60 months

Organisation name of lead contractor for this deliverable: CNRS
Contributing authors

Jean Salzemann, Vincent Bloch, Vincent Breton,




Abstract

        The EMBRACE Grid was started on the grid as a testbed during year 2006. The first
goal was to use this testbed to evaluate the WP3 technology recommendation by deploying
test cases that are CPU intensive and producing a lot of data. The EMBRACE grid however
has evolved from a specific set of resources to a set of available services that can, if needed,
be deployed on specific grid infrastructure. This document will present the current status of
the resources, and some technology that allows services providers to integrate their
application on the grid.
1 Introduction
       In 2006, the EMBRACE Grid was started on the EGEE grid as a testbed to deploy
applications and services that needed intensive computations. The EGEE infrastructure was
chosen because it was mature enough to support a large panel of computing intensive
applications with high-level stable services and offered a lot of available resources. In its
previous stage, the EMBRACE Grid was to be considered as a standard computing grid,
based on its own virtual organization and middleware. The only way to interact with this grid
was to access a user interface and learn how to use the specific commands and APIs.
Nowadays, the EMBRACE grid has shifted toward a grid of services, providing to the whole
bioinformatics community a large number of web services that can be composed into
workflows and that provide a real abstraction of the underlying technology.



2 The evolution of the computing grid infrastructure

         2.1 Resources


           As described in D3.3.2 “Report on EMBRACE Grid Deployment”, WP3 decided to
build a new virtual organization on the EGEE grid as a starting point to allow EMBRACE
users to access resources for their applications. The virtual organization, as its name implies,
is a virtual federation of users that share the same scientific interests and work on the same
topics or applications. The administration and gathering of resources for the EMBRACE
virtual organization was a very tedious task, as EMBRACE lacked the visibility of other
virtual organization, and a lot of sites were reluctant to open their resources to uncommon
virtual organizations. We remind that in EGEE all sites that provide computing resources can
choose which are the VOs whose members are allowed to run their jobs on their resources.
So eventually, EMBRACE partners provided most of the resources, and the number of
collected resources was very small in comparison to other virtual organizations. It was a
strong political move to have the EMBRACE Grid have its very own resources, because it
allows to identify EMBRACE as an independent initiative, and that promotes its own
deployment and administration policies, but while the available EMBRACE resources were
sufficient for a testbed they proved to be really weak and limited for large deployments. As
the purpose of EMBRACE grid is to run biological and bioinformatics applications this falls
absolutely in the range of topics addressed within the biomed virtual organization. As noted in
D3.3.2, the deployment on PDB refinement was also made on a great part on the resources of
the biomed VO because the resources of the Embrace VO were not sufficient to finish all the
computations in a limited amount of time. The real point is that the biomed VO is the largest
non physics related virtual organization on EGEE, and one of the most active ones. The
number of resources in the biomed VO is very important, and the resources are well
maintained. Most of the sites that opened their resources for EMBRACE also supports the
biomed VO, so we believe that for the purpose of application deployment it really makes
sense to use the biomed VO instead of the EMBRACE specifically created VO.

       The procedure for the users to join virtual organization is the same for any VO, they
have to own a personal certificate, and ask the VO administratives to join the collaboration.
Of course they have to work on applications that match the interest of the VO. But since
EMBRACE and biomed VO are sharing the same type of applications, there is no overhead to
apply for biomed VO instead of EMBRACE VO. Also as we will describe in part 2.2 of this
report, we will present a technical solution to access the grid transparently, so that users
appliance to a specific VO will not be needed.


                       Number of sites                                198
                    Total number of CPUs                            30.000
                       Total disk space                             800 TB
                      Number of users                                 135
                  Number of resource brokers                           42

                      Table 1. Biomedical Virtual Organization resources.



      2.2 Abstraction Mechanisms (the example of the WISDOM
production environment)

        The EMBRACE Grid is not to be seen as a common computing grid, because the goal
of the project is not to build an infrastructure but to define standards for the interoperability of
services. So the grid made in Embrace is a collection of services that offers business-logics
interfaces instead of a raw collection of computing and storage resources. The recommended
technology is a step forward to achieve it as it allows hiding any infrastructure and providing
a standardized view of the services. The process of service integration on a specific resource
type generally takes 3 steps:

       -Tune the application execution for a specific infrastructure, which often means
compiling the binaries for a specific computing architecture, editing the scripts depending on
how the application is run or deploying the data in a suitable way so the software can access
them from where it is executed.
       - Build an interfacing web-service that will be the entry point to the system.
       - Adapt the web service code to match the specific commands and APIs of the
       underlying middleware. That means, handle stage-in and stage-out data transfer,
       handle job submission, monitoring and fault tolerance.

This work will have to be performed for every service, and for each case the quality of service
and fault tolerance may be completely variable depending on the amount of work spent in
tuning and optimizing the integration code and also the quality of the service will strongly
depend on the knowledge of the underlying technology.

To avoid the tedious task of building a new application software for each service, we
designed, based on our experience of Grid Computing, an environment that can settle on grid
systems or more generally on computing resources, clusters and handle the data and jobs and
share the workload on all the integrated resources even if they follow different technology
standard. Based on this meta-middleware, we can simply build web-services that will interact
with the services of the system and the integration work will be made automatically by the
system.
                       fig 1: Implementation of the meta-middleware.


The meta-middleware is really to be seen as a set of generic services that just abstracts the
specific resources and provides a generic management of data and jobs so the application
services can use any of the underlying systems in a very transparent way. The top business
specific services are just normal EMBRACE services, and the important thing in this concept
is that the whole EMBRACE services on top of the system can share the whole infrastructures
on which the system is actually based in a completely transparent way (fig 2).




                        Fig 2 : Architecture of the meta-middleware


Users will not interact with the grids below or have to know about how they work since they
will just interact with the top services just like with any other web services. The important
aspect of this feature is that users will not have to actually apply to a specific virtual
organisation or ask for the credential to access a specific grid. Once an infrastructure is
integrated in the system, all users and services based on the system will benefit from it. So we
can progressively integrate new computing infrastructures to the system without having to
redevelop specific services every time, or having the users learning specific knowledge about
it. The whole system allows having one single service per application, and avoiding the
necessity to have a specific service for each underlying grid or cluster. The system can
virtually integrate any computing resources: dedicated grids like EGEE or OSG, clusters,
simple computers or even cloud computing or desktop computing grid like BOINC.

Finally this system enables the EMBRACE grid to concentrate on services and not on
resources, as through this system, the EMBRACE grid becomes an integration of all the
underlying infrastructures and clusters. As stated in 1.1, this makes the registration to a
specific Virtual Organisation pointless since they can all been accessed transparently by the
users as soon as the system embed them, without the need of a certificate.

Currently the system integrates the following resources:
       - Biomed virtual Organisation on EGEE
       - Auvergrid virtual Organisation on EGEE
       - Embrace virtual Organisation on EGEE
       - Engagement virtual Organisation on OSG
       - C4 Cluster in South-Africa
       - Digital Ribbon company cluster

Some work is in progress to enable the use of the system with BOINC-like middlewares and
provide interoperability features with the DEISA European grid infrastructure. The use of grid
technology is not mandatory in every case, but whenever services are to be used by a lot of
users simultaneously and provide a good scalability, the use of grid can significantly increase
the performances because there will be, supposedly, sufficient resources to handle all the
simultaneous queries. In any case of a service that may be used for large scale deployment
either because of a heavy workload or because of an intensive and heavy usage by a lot of
queries coming from different users, grid become the best approach to improve scalability.



3. The EMBRACE Registry

Another notable aspect of the EMBRACE grid is the development and set up of a service
registry. As stated before EMBRACE grid is not per se a computing grid or resources grid but
more of a services grid, or knowledge grid. With the establishment of the EMBRACE
technology recommendation, a lot of service providers started to make their services available
for the community under the normalized WSDL, while more and more services would be
available it became clear that the wiki of the EMBRACE website would become badly suited
to support all the available services. Thus the project decided to develop a Registry to allow
an easier tracking of the services, and offers high level view on the services such as
descriptions, links to services wsdl, services availability and statistics etc. In the paradigm of
grid computing, there is always a service that has to reference the resources, or provide
information on the available resources, this service is generally called the grid information
system. The EMBRACE registry can be seen exactly as the EMBRACE grid information
system as it references all the available components of the EMBRACE grid. In a classical
computing grid, the components are resources, but the EMBRACE grid is made of web
services.


                Total Number of services                             802
          Number of Services with EMBRACE tag                        350
                     Number of users                                 146

                                 Table 1. Registry information


The current number of registered services is 802, the first service has been registered at the
end of August 2008, since then new services have been added periodically and frequently.
Not only EMBRACE services have been registered but also services from ENFIN,
BioSapiens, MOBY etc. The registry is offering the ability for anyone that creates an account
on it to interactively add new services and tag them with label that can be used to categorize
the services and search them. All the services can be used transparently by any end-user just
grabbing the service endpoint and invoking the service using a web service client such as
SoapUI or Taverna. The registry just provides information on the web services, with
descriptions, keywords and some basic testing features to allow simple monitoring of the
services. Those web services can be based on any computing infrastructure, but as the registry
is not a user interface to run the services, it can’t run a specific service depending on the
requirements of the users. This feature could be achieved by developing a new service called
workload management system that could analyze the needs of the users, select the
corresponding service and run it. If one needs a specific service deployed on Grid for a large-
scale deployment (see section 4. For deployment examples), he needs to go through the
description of the services in the registry or look for a specific tag to select a service
specifically deployed on a computing grid and then call the service with a web-service user
interface. Even though the registry is not sufficient to fully exploit the services, just as on a
normal grid, this is still a mandatory service and reference point to about existing resources.



4. Application deployment
In this section we will present a report of the applications that were deployed using
EMBRACE technology and resources in 2008 and 2009, and show good example of what can
be achieved on the EMBRACE grid at a production level.


4.1 Drug Discovery application using Gold:

Gold is a licensed docking software, this work was performed in the frame of the WISDOM
initiative and the jobs were submitted and managed through the WISDOM production
environment. The goal of this deployment was to dock more than 3000 ligands with 7000
protein structures, for a total of approximately 25,000,000 dockings. The particularity of the
experiment is that we could use 3000 licences to run the computations, so we were limited by
the number of licences, even though we had more CPUs at our disposal.
The deployment was performed on both dedicated computing grid and a cluster located in
South Africa using the same meta-middleware, up to 4500 CPUs have been used during the
deployment thanks to the use of agents, but of course only 3000 instances of the software
could run simultaneously. Even though the PDB-redo deployment has already been deployed
on both grid and clusters, this deployment has been a step further in interoperation as all the
computations and data were piloted through a single user interface, and the WISDOM
production environment dealt with the specificities of each computing resource. For the PDB-
redo application, several people had to share the workload and compute their own subset of
data.


4.2 InterproScan Deployment

InterproScan is a combination of several protein signature recognition and their associated
databases. Given a sequence, Interproscan will run all the different algorithms to find out the
relationships between the sequence and the know protein families. All the algorithms are
independent and can be run in parallel without any major constraint. InterproScan is currently
available as through a web portal at EBI (http://www.ebi.ac.uk/Tools/InterProScan/) as well
as a Web Service.

Due to a large usage and limited amount of resources at EBI, users are not allowed to submit
many sequences at the same time when using either the web portal or the Web Service.
InterproScan was ported to the grid to allow a single user to easily submit a large dataset and
get the results as soon as possible.

During the deployment, 50 000 sequences have been used against the following algorithms
from InterproScan:
    fingerPrintScan
    gene3d
    pfam
    pir
    prodom
    scanregexpf
    prosite
    smart
    tiger
    superfamily
Using grid resources, it has been possible to run all the sequences, representing more than 110
days of computation, in only a few days.

However, to be fully exploitable by users, the data requires a post-processing step that has not
been implemented yet on the grid. Moreover, the deployment has shown that the data
management can become a real issue without the proper tools. If the computation itself can be
achieve very quickly, the gathering of the results files located all around the grid can take
many days (if not weeks). Thus the grid version of InterproScan still needs improvements
before it can be released to the public.
5 Conclusion

The EMBRACE Grid has first been designed as a test bed to evaluate the WP3 technology
recommendations with the ultimate goal of hosting EMBRACE applications and make them
available through Web Services. As for today the EMBRACE Grid is made of all the
resources integrated through the WISDOM production environment, that represents 30000
cpus on the biomed VO, several clusters already integrated, and some that are yet to be
integrated. The main evolution from previous report about EMBRACE Grid, is that we
decided not to go specifically for a VO based on the EGEE grid, but instead provide to the
community a more flexible and transparent access through a high-level meta-middleware. The
main advantage of this strategy is that we can simply build-up web services integrated on this
meta-middleware, register them on the EMBRACE registry and anybody can use them and
submit jobs on the available resources without asking for specific credential. Now the tedious
procedures of asking a certificate and registering in a VO to access a computing grid is no
more mandatory, since only the server that hosts the WISDOM production environment has to
ask for the credentials and share its credentials to run the specifically deployed bioinformatics
applications
Even though the EMBRACE project is not about building a production infrastructure, we
were able to use the technology recommendation, and resources in the framework of
EMBRACE to perform real production experiments. Moreover with technology we have
developed with the EMBRACE technology recommendation, we have mechanisms to easily
settle applications on production grids and really provide a transparent access to the users.
Ultimately we could provide support to service providers to port their current services on
grids whenever they will require high-end computing power.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:9/18/2012
language:English
pages:9