Automatic Synchronization and Distribution of Biological Databases and Software over Low-Bandwidth Networks among Developing Countries
Proposed by: Prince of Songkla University, Thailand
The Asia-Pacific Bioinformatics Network (APBioNet)
Background Bioinformatics and the need for network bandwidth Bioinformatics involves the collection, organization and analysis of large amounts of biological data, using networks of computers and databases. Publicly available biological databases, such as GenBank, are currently growing at a rate of nearly a doubling in size every year; all essential databases in total are now reaching up to terabyte level in size. Bioinformatics Centres around the world have to regularly update their database repositories with the latest releases. This is normally done by a file transfer over FTP; but the large and growing sizes of these databases means that a large network bandwidth is required to ensure the new database releases are downloaded quickly and without failure. To assist this, a network of database mirror sites was established in several countries worldwide in 1997, under the Bio-Mirror project. The Singapore node, for example, is hosted at the National University of Singapore, and has mirrors of databases updated as often as twice weekly Developing countries in the Asia-Pacific region are just moving into this new field of bioinformatics, but the computational infrastructure and network bandwidths available in those countries are still at a primitive level compared to that in more developed countries. Network bandwidth within these countries are still very low, often with speeds of 512kbps – 1Mbps only, and the low reliability of connections means breaks / aborts in downloads are common. For example, the Centre of Excellence in Molecular Biology in Pakistan has only a 2Mbps network, while the Prince of Songkla University in Thailand, which is one of the larger institutes in that region, has only a 10Mbps connection serving a community of about 20,000 professors and students. This can be compared to a university in a more advanced country such as Singapore such as the National University of Singapore which has international links to the Internet / Internet 2 at 155Mbps. This is far below the advanced nations which typically enjoy Gigabit Internet access and Lambda Networking. So, in spite of the Bio-Mirrors nodes being made available, many developing countries in the world still face a major problem in regularly updating these databases. And, with the large and growing sizes of these databases, the problem will only get worse in the next years because the growth of databases outstrips the rate of bandwidth penetration to the end user. For example, even with APAN 1[1] or TEIN22[2], there is still significant difficulty in reaching developing country universities with bandwidth that guarantees regular nightly updates of databases. A revolution in file sharing technology In the late 90‟s, the Internet community witnessed the start of a major revolution in the way people share files – Peer-to-Peer (P2P) file exchange was introduced with the wildly popular Napster in 1997. Internet users used this to share mp3 music and video files
1[1] 2[2]
APAN – Asia Pacific Advanced Network (http://www.apan.net) TEIN2 – Trans-Eurasia Information Network (http://www.tein2.net)
throughout the world. P2P technology involves exchanging files not just between a central server and multiple clients that connect to it, but rather focus on using clients to exchange files amongst one another. The technology continued to evolve and improve, with the second generation P2P FastTrack / Kazaa network in 2001. In 2002, the BitTorrent protocol was introduced. This third-generation P2P technology was a major advance over previous P2P protocols with BitTorrent, a large file to be distributed will be broken up into smaller fragments, typically around a quarter of a megabyte each. These fragments are distributed to each peer, and amongst peers, in a random manner, and are reassembled at the requesting machine. This difference between traditional client/server distribution of files, and 3 rd generation P2P distribution, is illustrated in Figures 1 and 2 below:
These figures illustrate the power of the concept introduced by 3 rd generation P2P technology: As the number of downloading clients in the traditional distribution architecture increases, demands for bandwidth placed on the server will only increase and lead to a bottleneck. However, for the case of the 3 rd generation P2P architecture, the more peers there are, the more nodes are available to distribute fragments of the file. High demand will actually lead to greater throughput as more bandwidth from additional nodes becomes available to the group.
Using P2P technology in distributing biological data From the comparison above, it can be seen that if 3 rd generation P2P technology is used, it offers to simultaneously solve the two major problems plaguing the distribution of biological data to developing countries: 1) 1) Low international bandwidth With a P2P architecture, downloads need not be from a central server in another country – every peer that connects up to synchronize its databases or software, whether from the same institute, state, country or region, will provide additional bandwidth, that will speed up the overall download rate of all the peers 2) 2) Unreliable connections In the conventional server/client architecture, all download is from a single server and if this connection becomes very slow or unreliable, there can be no „failover‟ to automatically continue downloading from another source For the 3rd generation P2P architecture however, downloads are automatically sourced from peers with the best connections; and if a connection experiences a bottleneck, downloads automatically continue from the next best connections. This can be applied in three areas – the distribution of biological software, courseware, and databases 1. Using 3rd generation P2P technology to distribute biological computing software
(size of files ~ < 1GB)
In the past few years, IDRC-PAN has funded the APBioBox project between the National University of Singapore and the APBioNet. The outcome is a publicly accessible compilation of frequently used bioinformatics software, with built-in easily deployable Grid software. In view of rapid developments in this field, we have moved on to the next step of deploying the APBioBox compilation package into a remastered version of Knoppix 3.63[3], which is now called APBioKnoppix. This is a 600Mbyte size LiveCD which is freely distributable to educational institutions to use for teaching purposes etc. Since then it has been used in the context of the practical hands-on courses offered by the S* Life Science Informatics Alliance 4[4]. Distribution of this software by physical CD has been the preferred option for low bandwidth developing countries, as it is difficult to transfer an entire LiveCD over the Internet given bandwidth constraints.
3[3] 4[4]
Knoppix (http://www.knoppix.org) – A Linux OS distribution bootable and run fully on a CD S* Life Science Informatics Alliance (www.s-star.org)
In the past six months, we have upgraded the release to APBioKnoppix2. This consists of a LiveCD plus the ability to maintain a stable persistent Knoppix image of the home directories etc. This is a feature based on the UNIONFS in Knoppix 3.8 upwards. As a result, users can easily install any new software they wish and still maintain the persistence, unlike with previous LiveCDs. However, the size of the LiveCD has increased to 703 Mbytes, and we anticipate that with the proliferation of bioinformatics software, we will have to increase this beyond CD range to DVD sizes. To overcome this problem, we are planning to distribute bootable CDs or their ISO images, and separately distribute the full maxi-ISO release containing more than 700 Mbytes. Knoppix has a “bootfrom=” feature that allows the bootable CD to bootfrom a softcopy ISO image on the hard disk. This copy can be downloadable from the Internet, and does not need to be limited by the CD maximum of 700 Mbytes. To download such 0.7 to 1 Gigabytes of data, or more, to the end-user‟s machine, requires good network bandwidth connectivity for rapid dissemination in view of frequent releases. Therefore, we propose that the P2P approach of disseminating databases to bioinformatics nodes in Asia will be applicable also to dissemination of latest up-to-date releases of bioinformatics software. 2. Using 3rd generation P2P technology to distribute bioinformatics training material
(size of files ~ 1GB – 10GB)
For the past five years, the S* Life Science Informatics Alliance has been actively promoting an Introductory Bioinformatics Course. The course material consists of audiovideo streaming material as well as AV records of online e-meetings (emeet.nus.edu.sg) of Problem Based Learning (PBL) sessions which include powerpoint presentations, its annotation, text chats, audio discussions etc recorded into a session recording. Over the past six courses and more than 1000 students trained in a 3monthly courses where students take 1 lecture a week, 1 MCQ assessment a week, and participate in online discussion forum to revise and discuss the lectures given by eminent bioinformatics professors in the S* Alliance members such as Stanford, Karolinska, etc. and take part in PBL-based mini-projects. The course material built up over the years needs to be rapidly disseminated to students especially those in rural institutions and universities in developing countries. P2P nodes set up in topologically closer locations to these end users are a critical development of the technology, and will help overcome the bandwidth limitation that is the current major hindrance to greater adoption of this pedagogy. We hope that by establishing collaboration between countries like Thailand and Singapore, we can establish the metrics and parameters for efficient dissemination of large datasets, software that utilize these datasets, as well as the pedagogical material that help end users understand and make effective use of the datasets and software.
3 .Using 3rd generation P2P technology to distribute biological databases
(size of files ~ 10GB - 100GB)
The Bio-Mirrors sites were established before P2P technology was introduced. Considering the two factors of increasing biological database sizes, and the emerging thrust of developing countries into bioinformatics - it can be proposed that a new system of global biological database distribution and synchronization, based on 3 rd generation P2P technology, will bring into realization a much needed next generation Bio-Mirrors based on the latest file distribution technologies that can address the problems faced by developing countries.
Project Objectives 1) 1) To develop a client application based on 3 rd generation P2P protocols, or extend an existing open-source one, for use in the distribution of biological software, courseware, and databases 2) 2) To set up and test the performance of this biological software, courseware, and database distribution P2P network, with nodes in countries in the Asia-Pacific region starting with Singapore and Thailand, and to beyond. These tests will include o o Benchmarking performance against more traditional rsync and FTP techniques o o Assessing the effect of bandwidth saturation in using P2P o o Identifying P2P architecture and topology variations most suited for distributing the datasets of different sizes
Project Beneficiaries The major beneficiaries will be research and educational institutes in developing countries, especially those in the Asia-Pacific region that are just moving into bioinformatics but do not have the necessary large Internet bandwidth available, such as Thailand, India, China, Indonesia, and the Philippines.
Project Sustainability Once the research and development work under this funding is completed and the test deployment results prove its feasibility, APBioNet will coordinate the next level of adoption by its various constituent members in the Asia-Pacific region, as well as its link ups with other regional and global networks such as ASTRENA. We have a consistent
record of the BioMirrors project which is institutionalized, and the continual development of APBioBox, APBioKnoppix and APBioKnoppix2 as proof of active development over a sustained period, and we also have the history of consistently providing rigorous certification bioinformatics education over the past five years to more than a thousand students online.
Project Methodology Objective 1: Development or extension of an existing open-source P2P client application for use in the distribution of biological software, courseware, and databases 1) 1) First, the features required of a client application for use in automatic synchronization and distribution of biological software, courseware, and teaching material will be identified. These requirements may include such items as: a. a. Scheduling specific items for automatic download at regular intervals b. b. Running scripts after download to format the databases, and link them up with the tools deployed which utilize these databases 2) 2) The available open-source 3rd generation P2P client applications will be studied, to assess if an existing one is available that can be suitably extended for use in this application 3) 3) If a suitable application has been identified in Step (2), necessary extensions will be developed on top of that tool. If a suitable available application has not been found, a new client will have to be developed based on 3 rd generation P2P protocols.
Objective 2: Setup of a test-bed with nodes in countries in the Asia-Pacific region, starting with Singapore and Thailand 1) 1) The Centre for Genomics and Bioinformatics Research, at the Prince of Songkla University (PSU), Thailand, will acquire, set up, and test a bioinformatics server that will serve as a node in the P2P distribution network. 2) 2) The other test node in Singapore will be provided by the National University of Singapore. This node and the one at PSU will be used in carrying out a series of tests that will prove the effectiveness of the distribution network 3) 3) Once this setup has been completed and tested, APBioNet will coordinate the addition of more test nodes amongst its members based in other countries, to test the effect of scaling up the network
Tests to be performed may include: Transfer rate benchmarks for performance of the P2P network against more conventional rsync or FTP techniques Assessing effects of bandwidth saturation in using P2P Identifying optimal P2P implementation schemes for different applications / file sizes (~1GB biological software / ~10GB courseware /
~100GB databases)
i. ii. iii.
i. Full P2P ii. Leech mode and partial P2P iii. Full leech mode for low-bandwidth / last-mile users, and P2P for peers connected via high-bandwidth TEIN2 or APAN Gbps networks
Project Timeline
Timeline
Milestone Identification of features and functionality required for a P2P client application for use in database distribution and synchronization Study of available open-source P2P clients to identify those suitable for use PSU to set up bioinformatics server Development / extension of P2P client application for use in distribution of biological databases Test of P2P distribution of biological databases between Singapore (NUS) and Thailand (PSU) Test of P2P distribution of biological databases between more nodes in Asia-Pacific, coordinated by APBioNet Pan Asia-Pacific workshop on use of P2P technology in bioinformatics, and results of project, to set the stage for wider adoption amongst other Asia-Pacific countries
Jan 2006 – Mar 2006
Apr 2006 – Dec 2006
Jan 2007 – Jun 2007
Jun 2007
Project Outputs Open-source client software based on 3 rd generation P2P technology, for automatic downloading and synchronization of biological software, courseware,
and databases. This software will be made available free of charge through APBioNet. Reports on performance and results of the test-bed set up will be published in respected journals and conferences, and will be made available through the APBioNet website.
Project Monitoring A progress report will be submitted for evaluation every 6 months
Project Budget [Please refer to attached Application Form and Appendix B]
Project Applicants Co-Principal Investigators:
Amornrat Phongdara
Director, Centre for Genomics and Bioinformatics Research, Faculty of Science, Prince of Songkla University, Thailand
Tan Tin Wee
Associate Professor, Dept of Biochemistry, National University of Singapore Secretariat, APBioNet Secretariat, S* Life Science Informatics Alliance Chairman, Academic Committee, ASEAN Virtual Institute of Science and Technology (AVIST) Board Director, International Society for Computational Biology (ISCB)
Darran Nathan
Project Manager, APBioNet
Course Coordinator, S* Life Science Informatics Alliance
Tong Joo Chuan, Victor
Project Manager, APBioNet Secretariat, S* Life Science Informatics Alliance
Choo Khar Heng, Justin
Project Manager, APBioNet Secretariat, S* Life Science Informatics Alliance
Mark De Silva
IT Manager for Bioinformatics Resources, Office of Life Sciences, National University of Singapore
Lim Kuan Siong
IT Manager for Bioinformatics Resources, Office of Life Sciences, National University of Singapore
Applicant Organizations Centre for Genomics and Bioinformatics Research, Prince of Songkla University
The Centre for Genomics and Bioinformatics Research was set up in December 2004 by a group of Molecular biologists at the Prince of Songkla University, one of the most prestigious universities in the Southern region of Thailand. Biology is changing rapidly, with the integration of information technology into biological research becoming very important. We realize the need for university infrastructure and personnel with skills in mathematics and computing as well as in the biological sciences. The research group has established comprehensive wet-lab facilities in molecular biology and genomics, but has yet to deploys the necessary backend computational infrastructure to support research and teaching in bioinformatics and computational biology. In its first five-year plan, the Centre is seeking budget and partner for the execution of its bioinformatics technology initiatives, including human resources development, and investment in research facilities and infrastructure.
Asia-Pacific Bioinformatics Network (APBioNet)
The Asia Pacific Bioinformatics Network (APBioNet) is a non-profit, non-governmental, international organization. It focuses on the promotion of bioinformatics in the Asia Pacific Region. Since 1998, its mission has been to pioneer the growth and development of bioinformatics awareness, training, education, infrastructure, resources and research amongst member countries and economies. Its work includes the technical coordination and liaison with other international bodies such as the EMBnet. APBioNet has more than 20 organizational and 300 individual members from over 12 countries in the region, and members include those from industry, academia, research, government, investors and international organisations. APBioNet has coordinated or co-organised more than 20 international and national meetings in cooperation with members in different economies. It is spearheading a number of key bioinformatics initiatives in the region in collaboration with international organisations such as APAN, APEC, S* Alliance and A-IMBN.
Other Information This proposal has received in-principle endorsement from the East Asia Bioinformatics Network (EABN) at the ASEAN COST PLUS THREE workshop (1st EABN workshop, Busan, Sep 22-23, 2005) and strong interest for participation from countries such as Korea, Malaysia, Thailand, etc, signifying the importance of this project in establishing the next generation of Biological information nodes based on the latest technology for file exchange.
Proposed Collaborators include: National University of Singapore (NUS) and the Prince of Songkla University (PSU), Thailand as the first nodes for initial tests NUS with high-bandwidth 155Mbps – 1Gbps links to Internet 2 PSU with low-bandwidth network connections over multiple-campuses East Asia Bioinformatics Network nodes under the proposed Virtual Bioinformatics Centres initiative of APBioNet. It is expected that at least half a dozen nodes will be set up over the next two years in Singapore, Malaysia, Indonesia, Philippines and Thailand, using the TEIN2 network infrastructure as it rolls out, with the PLUS Three nations, China, Japan and Korea, starting with the National Genomics Information Centre, NGIC, KRIBB, Korea. Cooperating nodes in India and ASEAN, under the ASEAN-India Bioinformatics Network ( to be discussed at the ASEAN-India Bioinformatics HRD workshop, Oct 2005 in Hyderabad), starting with the Centre of Excellence in Medical Informatics, and the Centre for DNA Fingerprinting and Diagnostics CDFD, which is the official node of the European Molecular Biology Network EMBnet.
References [1] Don Gilbert, Yoshihiro Ugawa, Markus Buchhorn, Tan Tin Wee, Akira Mizushima, Hyunchul Kim, Kilnam Chon, Seyeon Weon, Juncai Ma, Yoshihiro Ichiyanagi, DerMing Liou, Somnuk Keretho, and Suhaimi Napis (2004) Bio-Mirror project for public bio-data distribution. Bioinformatics 20: 3238-3240 (doi:10.1093/bioinformatics/bth219 Bioinformatics Advance Access published on April 1, 2004) [2] Ranganathan S, Subbiah S and Tan T.W. (2002) APBioNet: The Asia Pacific regional consortium for bioinformatics. Applied Bioinformatics 1: 101-105. [3] Lim YP, Hoog JO, Gardner P, Ranganathan S, Andersson S, Subbiah S, Tan TW, Hide W, Weiss AS. (2003) The S-Star trial bioinformatics course - An on-line learning success. Biochem Mol Biol Edu 31 (1): 20-23.