A Challenge to wards Next-Generation Research Infrastructure f or
Document Sample


A Challenge towards Next-Generation Research Infrastructure for Advanced Life Science 1
A Challenge towards Next-Generation Research
Infrastructure for Advanced Life Science
Haruki NAKAMURA
Instutite for Protein Research, Osaka Univ.,
3-2 Yamadaoka, Suita, Osaka 565-0871 JAPAN
harukin@protein.osaka-u.ac.jp
Susumu DATE, Hideo MATSUDA
Graduate School of Information Science and Technology, Osaka Univ.,
1-3 Machikaneyama, Toyonaka, Osaka 560-8531 JAPAN
sdate@ist.osaka-u.ac.jp, matsuda@ist.osaka-u.ac.jp
Shinji SHIMOJO
Cybermedia center, Osaka Univ.,
5-1 Mihogakaoka, Ibaraki, Osaka 567-0047 JAPAN
shimojo@cmc.osaka-u.ac.jp
Received 15 June 2003
Abstract
Recently, life scientists have expressed a strong need for computational
power sufficient to complete their analyses within a realistic time as well as for a
computational power capable of seamlessly retrieving biological data of interest
from multiple and diverse bio-related databases for their research infrastructure.
This need implies that life science strongly requires the benefits of advanced IT.
In Japan, the Biogrid project has been promoted since 2002 toward the estab-
lishment of a next-generation research infrastructure for advanced life science.
In this paper, the Biogrid strategy toward these ends is detailed along with the
role and mission imposed on the Biogrid project. In addition, we present the
current status of the development of the project as well as the future issues to be
tackled.
2 H. NAKAMURA, et al.
Keywords BioPfuga, UDS-XML, Metadata-based Database Federation, GSI-
SFS, Computational Grid, Data Grid
1 Introduction
The recent proliferation of bio-related databases, along with the advancement
of bioinformatics and IT has produced two outstanding problems. One is the shortage
of computational power, and the other is the lack of a capability for seamlessly and
quickly retrieving data of interest from a diversity of bio-related databases, each of
which is distributed and managed on the Internet. These two problems must be solved
for the further advancement of life science.
Recently, the Grid, which is cutting-edge information technology, has attracted
people who are engaged in life science, due to its promise of solving these two prob-
lems. In reality, however, current Grid technologies cannot fully satisfy present de-
mands and expectations. In other words, a kind of technical gap exists between the
computational platform realized with current Grid technologies and what is actually
expected by bio scientists and researchers.
The goal of this research is the establishment of a next-generation research in-
frastructure for advanced life science. The research infrastructure must not only solve
the two problems described above but must also provide a research platform that allows
bio-researchers and scientists to perform their analyses seamlessly. For this purpose,
we have promoted the Biogrid project since 2002 as a research and development plat-
form where knowledge, techniques, and technologies regarding both IT and life science
easily converge. Furthermore, in order to make the role and mission of the project
complete, the Biogrid project is composed of the following four research groups: the
Computing Grid, the Data Grid, the Core Grid, and the online Data analysis Grid. The
following sections describe the research goal along with the role and mission imposed
on the former three groups.
2 Computing Grid: Integration of Large-scale Com-
putations on a Grid Architecture for Modeling Biological
Systems on Multiple Levels
2.1 Development of BioPfuga (Biosimulation Platform United on
Grid Architecture)
The usual usage of the Grid architecture is to run one computation on many
distributed CPUs through a high-speed network. However, in order to analyze much
A Challenge towards Next-Generation Research Infrastructure for Advanced Life Science 3
more complicated biological systems, composed of simulations at different levels, on
a new paradigm for biological science, more integrated computational approaches are
required. The members of the computation grid group have already developed their
own biological simulation programs, which cover a wide range of fields in biological
science from electronic analyses of biological macromolecules to cellome and phys-
iome research. In particular, AMOSS (Ab initio Molecular Orbital System for Super
computer) for electronic state simulation by Sakuma et al.1) and pretsoX-basic for pro-
tein molecular dynamics simulations by Fukunishi et al.2)3) have already been driven on
our BioGrid architecture. The individual programs should be driven on their own corre-
sponding machines on the Grid system. For this purpose, we have designed and devel-
oped a new platform, BioPfuga (Biosimulation Platform United on Grid Architecture)
where individual applications, corresponding to the different levels of bio-simulations,
are united and executed as a hybrid application.
Fig. 1 (a) An example of UDS-XML (Universal Data Set-XML) in Base64 form for the data exchange
between different computational programs for actual execution of Grid computations. (b) Work flow of
BioPfuga for a hybrid-QM/MM calculation using AMOSS1) and prestoX-basic2, 3) .
BioPfuga requires that (1) application programs be divided into a set of many
pieces, each of which corresponds to a unit simulation procedure, and that (2) data
communication be made between the program pieces by a standard description. For
the former requirement, the simulation unit should not be too small for rapid compu-
tation so that the data communication time among different machines is minimal. For
the latter problem, we have proposed a simple, standard description using XML, UDS-
XML (Universal Data Set-XML) for the data exchange between different computational
programs for the actual execution of Grid computations. Until now, many application
programs have used only binary data for intermediate and temporary data in the field of
High-Performance Computing (HPC). This reason can be explained from the fact that
4 H. NAKAMURA, et al.
data files become very large and access very slow if data is described in a text format.
For the XML description, we designed three forms: a text form, a hexadecimal form,
and a Base64 form (Fig. 1(a)). When the Base64 form is used, the size is only about
1.3 times larger than that used in the binary form. The advantage of the XML form for
intermediate and output data is that any meta-data can be easily added as an attribute or
as tagged information in addition to the actual computed data to be exchanged among
the different application programs. It should be emphasized that the unit of data can
always be provided in UDS-XML, so that the different application programs recognize
and confirm the unit system for computation and analysis. Researchers in different
scientific fields usually use their own conventional units, and a barrier for research inte-
gration is therefore encountered without a common understanding of the units. For the
point of view described above, we have proposed and utilized UDS-XML to facilitate
the seamless data exchange among a variety of programs and the breaking programs
into gridfied modules. The schema of UDS-XML and the details of BioPfuga platform
will be available on our Web page (http://www.biogrid.jp/).
2.2 Application of BioPfuga to hybrid-QM/MM calculations
Both the static and dynamic features of protein structures are currently ana-
lyzed by molecular simulations. Chemical reactions at the active sites of proteins re-
quire information about the dynamic features of electrons, which cannot be attained by a
classical molecular mechanics simulation. Thus, we have combined quantum chemical
(QM) simulations, AMOSS 1) , with molecular mechanic (MM) simulations, prestoX-
basic2, 3) , in a way that is suitable for Grid computation. We first divided the two big
programs into a set of many pieces. Then, a hybrid calculation was performed follow-
ing the work flow as shown in Fig. 1(b). As a computational platform, we used PC
clusters composed of Pentium-3 and Pentium-4 processors, and the MD part was also
driven on the special-purpose computers for MD simulations, MDGrape2 4) . The current
Globus provides no advanced APIs or tools required for the function of dynamic spawn-
ing in the integration process of BioPfuga modules: therefore, we have tentatively used
MPI/LAM5) in the current implementation of BioPfuga.
As an example of this hybrid-QM/MM calculation, a simple model system
composed of ethanol in water was simulated. Here, the ethanol molecule and the two
water molecules close to the ethanol hydroxyl group were treated as the QM region, and
the Hartree-Fock molecular orbitals (MO) were analyzed using the 35 basis functions
of MINI-46) . The other 226 water molecules were represented using a conventional
classical model, TIP4P7) . The Nos´ -Hoover algorithm was applied at a constant (283 K)
e
A Challenge towards Next-Generation Research Infrastructure for Advanced Life Science 5
temperature without any truncation of the non-bonded interactions. The CAP boundary
˚
with a 13 A radius was used. From the canonical ensemble of the molecular system, it
was found that the gauche rotamer was as stable as the trans rotamer, which is associated
with the ethanol dihedral angle around the -C-O- covalent bond. In contrast, when the
classical force field8) was used for ethanol, only the trans conformer was stable. In the
current computational system, we used a Gigabit Ethernet among the PCs, and it took
about 0.1 s for 100 kBytes data transfer with the hexadecimal UDS-XML form plus the
corresponding XML parsing and writing procedures at one MD unit step.
3 Data Grid: Application of Grid Technology to Het-
erogeneous Database Federation
3.1 Metadata-based database federation
Fig. 2 Outline of Data Grid system: (a) Metadata-based database federation, and (b) An example of an
application associating protein-compound relationships with protein-protein and compound-compound simi-
larities.
One of the key technologies we are developing is the metadata-based database
federation (Fig. 2(a)). Two types of metadata described in XML are introduced: (1)
Application metadata (or AP-Metadata) plays a role for the mediation between applica-
tions and databases. (2) Data service metadata (or DS-Metadata) is used for hiding the
heterogeneity of various records of biology-related databases.
AP-Metadata is designed for filling the gaps between applications and databases.
For example, in the drug design process, exploring chemical compounds (or ligands),
which exhibit affinity with disease-related proteins (such as, receptors), is necessary.
In this type of application, one would like to classify a large number of possible pairs
between compounds and proteins according to their interaction types: for example,
6 H. NAKAMURA, et al.
activators (or agonists) and inhibitors (or antagonists). However, such databases bridg-
ing interdisciplinary areas (i.e., biology and chemistry) are very few compared to the
number of protein-specific or compound-specific databases. Thus, establishing such
interoperating relationships, which we call AP-Metadata, is necessary. AP-Metadata
describes a drug-related relationship between a protein and a compound (see below for
details). AP-Metadata includes the evidentiary information of their relationships (what
databases they come from, and how likely the relationship holds). We use the gene
ontology annotation rule9) for describing their information. We are making the protein-
compound AP-Metadata by exploring PDB complex data and Medline abstracts.
In contrast to AP-Metadata, DS-Metadata is used when many information
sources exist for the same type of data. For example, many databases that store protein-
related information (such as, SWISS-PROT, PIR, PDB, etc.) exist. However, due to
their high heterogeneity and the frequency of updates, building a single database that
integrates all the related data is difficult. Thus, we have designed a common XML
format10) , DS-Metadata, to describe related data extracted from many databases. These
metadata are not only data for a protein or a compound but also keep reference point-
ers to the original databases. These pointers are represented by their database names
(expressed using the namespace in XML) and their database IDs. We have developed
a system for constructing DS-Metadata by converting original database formats to an
XML-based common database format.
In order to achieve seamless federation of the databases distributed in the net-
work (making use of the independence of the different database management policies),
a prototype of the Grid Database Infrastructure System was developed. In this way the
databases will be integrated virtually using Web mechanisms such as SOAP to enable
database services to operate within the XML scheme. We are now porting the system
to work in a practical Grid environment, using the Globus Toolkit 3 with OGSA-DAI
(Open Grid Services Architecture - Database Access and Integration), http://www.ogsa-
dai.org/).
3.2 A preliminary application example
Figure 2(b) shows a schematic view of an application example of the Data
Grid. By using the two metadata described above, one can extract a large number of
relationships between proteins and ligands. Furthermore, by associating those relation-
ships with protein-protein and compound-compound structural similarities, elucidating
the following becomes possible:(1) What parts of compound structures contribute to
their binding affinities and functions (activation or inhibition) to a specific (e.g., drug
A Challenge towards Next-Generation Research Infrastructure for Advanced Life Science 7
target) protein? (2) What parts of protein structures (such as, domains or motifs) are
recognized by the compound? (3) By combining (1) and (2), how are the protein-ligand
interactions systematically characterized? We are now implementing this application
on the Globus Toolkit 3 / OGSA-DAI environment.
4 Core Grid: Development of Grid Middleware for
Bioinformatics
4.1 Secure Grid environment
Seamless access to databases, which are geographically distributed on the In-
ternet, will take on a role of importance for life science research. This fact simulta-
neously means that the success of advanced life science after the paradigm shift will
strongly depend on the establishment of an efficient and effective research infrastruc-
ture, which will allow bio-researchers and scientists to seamlessly access databases in
an on-demand manner. Bio-related data are exchanged on public networks such as
the Internet; therefore, data security will become a pressing problem. Moreover, the
bio-related data to be used in the field of advanced life science is expected to contain
information related to an individual’s privacy, such as SNPs. In addition, this data may
contain company secrets, such as know-how and techniques in the process of drug de-
sign. In both cases, in order to reduce the risk of data leaks, data protection with strong
cryptography is one of the most important problems to be solved.
4.2 A Secure Grid Filesystem: GSI-SFS
Current Grid technologies offer us a few promising capabilities, such as third-
party data transfer, in terms of data operation on the Grid. However, data access to a
remote computer on the Grid requires knowledge of the Grid and its techniques. The
lack of convenient data access methods prevents bio-researchers and scientists who
really want to benefit from the Grid from making maximum use of the Grid.
To enhance user-convenience of data access on the Grid, we have exploited
new Grid middleware for data access as a filesystem. The Grid filesystem, named GSI-
SFS, has been developed so that it can achieve five targeted requirements based on user
demands and requests.
1. Single Disk Image (SDI): Bio-scientists and researchers want to access data located
on a remote computer on the Grid without having to be aware of the data location. Al-
though the Globus grid toolkit which is a de facto standard implementation of the Grid,
provides GASS (Global Access to Secondary Storage) as a data access service, its data
8 H. NAKAMURA, et al.
access is performed on the basis of URI (Universal Resource Identifier). This data ac-
cess method with URI is not difficult to use because it provides a single-tree structure
of data. Nevertheless, users want to access data of interest as if the data were located
on a local disk. Thus, a data access method that provides a single disk image like NFS
on the Grid is required.
2. Data Confidentiality: On the Grid, a diversity of confidential bio-related data should
be exchanged in a secure manner. In order to protect such confidential data, strong en-
cryption must be performed for data in transit on the network so that malicious users
never see the content of the data.
3. Exclusiveness: Sometimes, access information to certain data, rather than the data
itself, must be protected. The following situation illustrates this concern. Assume that
pharmaceutical company A plans to secretly discover a medicine by focusing on a cer-
tain compound and needs to access a text-based public database containing the com-
pound for a detailed investigation on the compound. If the access information to the
data on the database is revealed to pharmaceutical company B, company B can guess
what kind of drug is being discovered by company A. In this case, company A, as a re-
sult, may incur a heavy loss. Taking this situation and others like it into consideration,
data access should be made exclusive.
4. On-demand remote data access: Data access to remote data resources occurs non-
stationarily. For such data, data access should be performed in a user on-demand man-
ner in order to reduce the risk of data access information leaks.
5. User Convenience: Users do not want to use a data access method that forces them
to perform certain procedures such as typing passwords and passphrases many times.
Although a trade-off problem between security and user convenience exists, a data ac-
cess method that allows users to access a variety of data distributed on the Grid in a
single sign-on manner is required.
In order to realize a data access method that achieves the five requirements
described above, we have combined two existent promising technologies to realize a
secure Grid filesystem: GSI (Grid Security Infrastructure)11) and SFS (Self-Certifying
Filesystem)12) .
Figure 3 shows a system overview of the secure Grid filesystem named GSI-
SFS. Originally, SFS was designed so that the SFS server/client module mediates NFS
communication, which allows each user to hold a single disk image. However, because
a pair of RSA keys must be managed for each SFS server and client pair, a serious
problem in scalability exists in SFS. In order to avoid the complexity of key manage-
ment and thus enhance the scalability and user-convenience of the filesystem, we have
A Challenge towards Next-Generation Research Infrastructure for Advanced Life Science 9
2. GSI authentication/encryption
4. Secret key,
GSI-SFS Client GSI-SFS Server
etc.
1. Invoke 5. Secret key
SFS Agent 3. Public key
Secret key
SFS
Application SFS Client SFS Server
NFS 3 Client NFS 3 Server
System call
Kernel Kernel
Fig. 3 System overview of GSI-SFS
developed an authentication module composed of a GSI-SFS server and a client which
authenticate each other with X.509 certificates instead of SFS. The most advanced fea-
ture of the authentication module is that it allows users to benefit from single sign-on
functionality. In other words, after showing users’ credentials such as passphrases with
single sign-on functionality, users can freely access data of interest with the underlying
SFS functionalities without typing passphrases many times.
At present we plan to further develop high-level functionalities such as VO
(Virtual Organization) -based access control and replica management on top of GSI-
SFS so that it satisfies the needs and demands on Grid File system by bio scientists.
5 Summary
In this paper, the strategic view of the BioGrid project promoted by Osaka
University is described with its research products. The Computing Grid group has
explored the best solution for the computationally intensive problem of biological sim-
ulations at multiple levels. The new platform, BioPfuga, however, is still preliminary,
and more semantic information still needs to be added with the development of a GUI
for actual applications. The Data Grid group has built the Data Grid environment to al-
low researchers and scientists to seamlessly retrieve biological data of interest for their
analyses with the promising OGSA-DAI. Moreover, in order to support the research
and development implemented by other groups, the core Grid group is responsible for
the creation of Grid middleware that can satisfy demands and requests from researchers
and scientists who are engaged in life science. We would like to promote the BioGrid
project with a focus on the establishment of a next-generation research infrastructure,
namely, a BioGrid environment for advanced life science.
10 H. NAKAMURA, et al.
References
1) Sakuma, T., Kashiwagi, H., Takada, T., and H. Nakamura, “Ab initio MO study of the
chlorophyll dimer in the photosynthetic reaction center. I. A. theoretical treatment of
the electrostatic field created by the surrounding proteins,” Int. J. Quant. Chem., 61, 1,
pp.137-151, 1997.
2) Fukunishi, Y., Mikami, Y., Nakamura, H., ”The filling potential method: a method for
estimating the free energy surface for protein-ligand docking,” J. Phys. Chem. B, 2003, in
press.
3) Nakajima, N., Higo, J., Kidera, A., and Nakamura, H., “Free energy landscapes of pep-
tides by enhanced conformational sampling,” J. Mol. Biol., 296, 1, pp. 197-216, 2000.
4) Narumi, T., Susukita, R., Ebisuzaki, T., McNiven, G., and Elmegreen, B., “Molecular
dynamics machine: Special-purpose computer for molecular dynamics simulations,” Mol.
Simulation, 21, 5/6, pp. 401-415, 1999.
5) Squyres, J. M., Lumsdaine, A., George, W. L., Hagedorn, J. G., and Devaney, J. E., “The
Interoperable Message Passing Interface (IMPI) Extensions to LAM/MPI,” in Proceed-
ings of MPIDC2000, 2000.
6) Huzinaga, S., Gaussian Basis Sets for Molecular Calculations, Elsevier, New York, 1984.
7) Jorgensen, W. L., Chandrasekhar, J., Madura, J. D., “Comparison of simple potential
functions for simulating liquid water,” J. Chem. Phys., 79, 2, pp. 926-935, 1983.
8) Jorgensen, W.L., “Optimized Intermolecular Potential Functions for Liquid Alcohols,” J.
Phys. Chem., 90, 7, pp. 1276-1284, 1986.
9) The Gene Ontology Consortium, “Creating the gene ontology resource: design and im-
plementation,” Genome Research, 11, 8, pp.1425-1433, 2001.
10) Matsuda, H., “Development of Bio-Information Environment on the Grid,” GlobusWorld,
San Diego, 2003.
11) Foster, I., Kesselman, C., Tuecke, S., Volmer, J., Welch, V., “A national-scale authentica-
tion infrastructure”, IEEE Computer, 33, 12, pp. 60-66, 2000.
12) Mazires, D., “Self-certifying file system”, PhD thesis, MIT, 2000.
Acknowledgment This work was supported by the IT-program of the Ministry
of Education, Culture, Sports, Science and Technology of Japan (MEXT). The authors
thank to the Biogrid project members. This product includes software developed by
and/or derived from the Globus project (http://www.globus.org/).
Get documents about "