488 Genome Informatics 13: 488–489 (2002)
New Features of PDBj-ML,
an XML Format for Protein Data Bank
Nobutoshi Ito1,2 Hisashi Sakamoto1,2
Kaori Kobayashi1,2 Yoshikazu Kaneta4
Yohei Kawaguchi5 Takenao Ohkawa5 Haruki Nakamura1,3
email@example.com firstname.lastname@example.org email@example.com
Institute for Protein Research, Osaka University, 3-2 Yamadaoka, Suita, Osaka 565-
BIRD, Japan Science and Technology Corporation, 5-3 Yonbanchou, Chiyoda-ku,
Tokyo 102-0081, Japan
Genome Science Center, RIKEN, Yokohama Institute, 1-7-22 Suehiro-chou, Tsurumi,
Yokohama 230-0045, Japan
Graduate School of Engineering, Osaka University, 2-1 Yamadaoka, Suita, Osaka 565-
Graduate School of Information Science and Technology, Osaka University, 2-1 Ya-
madaoka, Suita, Osaka 565-0871, Japan
Keywords: protein data bank, XML, XPath
The Protein Data Bank (PDB) has been a primary source of the structural information of biological
macromolecules [1, 2]. The advance of structural genomics is expected to produce an enormous
amount of structure and function data of biological macromolecules. Eﬃcient interfaces between
such data and various applications of bioinformatics should be essential to exploit its potential. In
an attempt to provide such an interface, we have developed, PDBj-ML, an XML format for PDB
2 Method and Results
The basic structure of PDBj-ML, which is deﬁned in XML Schema, is based on that of Macromolecular
Crystallographic Information Format (mmCIF) , with some modiﬁcation to make the use of features
of the XML format. To circumvent the large size of ﬁles, often problematic in XML implementations,
atomic information such as coordinates and temperature factors is stored in a separate ﬁle. As such
data are not expected to be used as a search term, this separation also helped to achieve faster searches.
Several search methods, such as resolution and author names, similar to the current PDB, are
provided for users via MySQL database (Fig. 1). Limited searches using XPath speciﬁcation are also
available. The search is as rapid as the original PDB search facility.
In addition to the format conversion, eﬀort to enhance the content of the database is also under way.
Missing data, such as those about data collection and model reﬁnement, are being supplemented from
literature. At the same time PDB Remark Transcoder, a program to extract relevant information from
the REMARK lines of PDB ﬁles and to describe it in XML, is being developed. Furthermore information
about biochemical functions of the molecules at residue/atom revels, rather than the molecular level
as in the conventional PDB, is also included.
New Features of PDBj-ML, an XML Format for Protein Data Bank 489
Figure 1: System diagram of the database.
We have converted the entire content of the current PDB into PDBj-ML and the database is regularly
updated. Use of the powerful schema deﬁnition language, XML Schema, rather than conventional
DTD, made validation of data more strict. In fact, various problems in the mmCIF ﬁles were found
through the validation of our XML database to show its advantage.
To exploit other advantages of XML further, we are planning to implement services through SOAP
so that users can have direct access to the database, rather than the web interface.
We thank Dr. Masami Kusunoki and Mr. Takashi Kosada (IPR, Osaka Univ.) for their discussion
and Ms Chisa Kamata, Yukiko Shimizu and Aki Takahashi for their contribution to the data curation.
This study was supported by grant-in-aid from Institute for Bioinformatics Research and Development,
Japan Science and Technology Corporation (BIRD-JST).
 Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N.,
and Bourne, P.E., The protein data bank, Nucleic Acids Research, 28(1):235–242, 2000.
 Bernstein, F.C., Koetzle, T.F., Williams, G.J., Meyer, E.E.Jr., Brice, M.D., Rodgers, J.R., Ken-
nard, O., Shimanouchi, T., and Tasumi, M., The protein data bank: A computer-based archival
ﬁle for macromolecular structures, J. Mol. Biol., 112(3):535–542, 1977.
 Bourne, P.E., Berman, H.M., McMahon, B., Watenpaugh, K.D., Westbrook, J., and Fitzger-
ald, P.M.D., The macromolecular crystallographic information ﬁle (mmCIF), Meth. Enzymol.,