Random Community Genomics
Rob Edwards April 2006
Rob Edwards, Fellowship for Interpretation of Genomes, Burr Ridge, IL, and San Diego State University, San Diego, CA Email: RobE@theFIG.info
Random Community Genomics. Rob Edwards Table of Contents About this Document ............................................................................................................3 Abstract ................................................................................................................................3 Background...........................................................................................................................3 General high-throughput sequencing .................................................................................4 Low cost, high-throughput ABI sequencing.......................................................................5 Pyrosequencing.....................................................................................................................5 Costs of pyrosequencing....................................................................................................5 Advantages of pyrosequencing ..........................................................................................6 Disadvantages of pyrosequencing......................................................................................6 Uses of pyrosequencing in environmental genomics..........................................................6 Single Stranded DNA........................................................................................................7 DNA vs. RNA...................................................................................................................7 Sequence storage and retrieval...........................................................................................7 Bioinformatics Challenges From Pyrosequencing..................................................................7 Functional analysis and annotation ....................................................................................8 16S rDNA comparisons.....................................................................................................8 Other comparisons ............................................................................................................8 Assembly ..........................................................................................................................9 ORF calling.......................................................................................................................9 Direct comparisons of the data ..........................................................................................9 Currently Available Tools .....................................................................................................9 FABULOUS .....................................................................................................................9 SEED..............................................................................................................................10 MIG ................................................................................................................................10 Immediate Problems............................................................................................................10 Short Term Requirements....................................................................................................10 Longer term requirements ...................................................................................................11 Summary.............................................................................................................................11 Acknowledgements .............................................................................................................12 References ..........................................................................................................................13
Page 2 of 14
Random Community Genomics. Rob Edwards
About this Document
These are some thoughts and comments on the directions of metagenomics and community genomics following recent discussions with researchers in the US and UK. The views reflect my bias towards data analysis as I see it evolving in the next 3-5 years. I do not address what metagenomics or random community genome sequencing should be used for, that question should be left to individual scientists to justify. This document merely describes what I see as the current state of the technology.
Abstract
Random community genomics, sequencing whole DNA without growing the microbes or cloning their DNA is now a reality. Our group, alone, has sequenced in excess of 850 M bp of DNA from environmental samples. Sample preparation and sequencing is very cheap and easy, costing less than $500 per million bp. The major limitation towards the advancement of our understanding of environmental genomics is no longer our ability to see the DNA. It is the lack of access to high-performance computing. Once the technologies and techniques that are being used be selected labs today become commonplace, there will be overwhelming demand for computational power beyond anything that is readily available.
Background
“We are sitting on a gold-mine without any shovels” – Mya Breitbart, USF. Author of the largest collection of random community genomics papers published to date. In this white paper I’m adopting the two phrases metagenomics and random community genomics to mean different things. There is already confusion over the use of the two terms, and so for the confines of this document I am defining them here. The term “metagenomics” is being used to describe what used to be called recombinant techniques, where you clone a gene or a group of genes, and then look for a specific function or gene of interest. This technique is how most of the genes that have been experimentally characterized were found. The technique has been widely used and perfected for gene discovery for more than 30 years, and over the years there are many reports that describe techniques for gene finding. Some years ago I used this approach to find genes involved in nitrogen regulation and their neighbors, sequencing the genes using 35S, 32P, and 33P. Perhaps the single biggest innovation in recent years has been the application of high-throughput sequencing to enable people to sequence more DNA per clone, and consequently screening of large libraries for more than just one or two functional genes. This will continue in the usual fashion and will be affected by pyrosequencing because the cost of sequencing will drop, but is not the focus of this white paper. I use the term “random community genomics” for the sequencing of whole communities from the environment with or without cloning, but without screening for any functional component. This is a reasonably new technique, first being described in 2002 (6). Since that publication, there have been several community sequences described (Table 1). The main innovations that were required for the application of random community genome sequencing to environmental samples were (i) isolation of sufficient quantities of pure DNA; (ii) techniques
Page 3 of 14
Random Community Genomics. Rob Edwards such as linker-amplified shotgun libraries (LASLs; http://www.lucigen.com) for improved cloning efficiency from community DNA (6); and (iii) the ready availability of low-cost highthroughput sequencing. Table 1. Published Random Community Genome Sequences Year 2002 2003 2003 2004 2004 2004 2005 2005 2005 2005 2005 2006 2006 2006 Marine water Human feces Drinking water biofilms Marine sediment Acid Mine Sargasso Sea Horse feces Human blood Human plasma Human respiratory tract aspirates Whales, Farm Soils Marine depth profile Iron Mine (454) Human Feces (RNA) Community (6) (4) (14) (3) (16) (17) (7) (5) (11) (1) (15) (8) (9) (18) Ref
Random community sequencing is essentially addressing two fundamental questions: what is there and what is it doing? Typically, the “what is there” question has been addressed by 16S rDNA sequencing, and comparing those sequences to databases such as GreenGenes (http://greengenes.lbl.gov/), the ribosomal database project (http://rdp.cme.msu.edu/), or the European ribosomal RNA database (http://www.psb.ugent.be/rRNA/). The “what is it doing?” question has previously been addressed by screening and isolation of genes with interesting functions, however the advent of random community genomics provides the tools for asking what is the entire metabolism of the community. As shown below, current technology provides the tools for cheap and efficiently addressing both these questions.
General high-throughput sequencing
Commercial sequencing companies such as Agencourt (http://www.agencourt.com) charge approximately $3-$5 per reaction for sequencing, including arraying but not library construction. The cost of high-throughput sequencing is currently around $1 per reaction, so sequencing a reasonable sample of environmental DNA will still require a significant financial commitment.
Page 4 of 14
Random Community Genomics. Rob Edwards Many of the bioinformatics tools that have been developed (see below) were targeted towards summarizing data generated using this approach and are suited to the manipulation of hundreds to thousands of sequences typical of random community genomics approaches using clone libraries.
Low cost, high-throughput ABI sequencing
Efforts by a few sequencing centers, but most notably the Joint Genome Institute’s community sequencing program (CSP) have greatly facilitated the application of random community genomics to many different environments. The JGI have allocated 13 Gbp in FY05 and 22 Gbp in FY06 for community sequencing projects that are open to public competition.
Pyrosequencing
Our publication of the first random community genome created by pyrosequencing represents a watershed in the availability of environmental genomics (9). The effective cost-perread has dropped from about $3 to about 3 cents. In addition to the raw-sequencing cost factor, the different methods developed for pyrosequencing have eliminated the need for PCRamplification, library construction, cloning, colony-picking and arraying. This technology, therefore, has the ability to put random community genome sequencing within reach of the average bench-biologist. A typical pyrosequencing run generates 300,000 sequences from an environmental sample. The average sequence length is 105bp, meaning that a typical run provides approximately 30Mbp of raw sequence data. Our group has sequenced more than 30 samples using this pyrosequencing, generating in excess of 850 Mbp of sequence data in under 1 year. To achieve this level of sequencing, a traditional approach involves purifying whole community DNA from a sample, typically using an extraction kit. Depending on the amount of DNA extracted an optional whole genome amplification step is performed using one of the many commercially available kits (e.g. 10). The target is for approximately 3-5µg of DNA to be shipped to 454 Life Sciences, Inc. I have not described the protocol for the sequencing here, as you can find specific details in the original paper (12) and on the company’s website (http://www.454.com). Sequencing takes a few weeks, and the data is currently supplied as a fasta file, a quality file, and potentially a flowgram file.
Costs of pyrosequencing
The quoted price for commercial sequencing is approximately $12,000 per pyrosequencing run, however discounts may be negotiated for bulk orders. Much of the cost of sequencing may be mitigated by technological advancements that are either already in place or underway. First, each run of the sequencing machine can be split into four or more separate runs by masking areas of the sequencing plate. Therefore, the costs per run drops proportionally. The disadvantage to this technique is that the total yield is decreased slightly because of the physical presence of the mask. Second, it is technically possible (although apparently not yet practically feasible) to sequence both ends of a read, providing approximately twice as much sequence data per sample with paired end information. A typical run generated in this fashion may have two short segments separated by an unknown number of bases. The approximate size of the fragments would be known, so the upper bounds on the distance between the two ends could be provided. Third, it may be possible in the near future to combine separate samples during the process but Page 5 of 14
Random Community Genomics. Rob Edwards have each sample tagged so that the sequences can be computationally sorted post sequencing. This would allow several different runs to be combined without a loss of yield. The tag will typically be a 4-bp sequence, allowing 256 combinations of sequences. Fourth, the length of the sequences may increase. Reportedly, 200bp sequences are achieved in the laboratory, although this has not yet reached the commercial sequencing arena.
Advantages of pyrosequencing
The obvious advantage of the pyrosequencing approach is the large amount of sequence that is generated at a low cost. The technology has reduced the barrier to enter the random community genome sequencing arena by at least two orders of magnitude and enabled typical bench researchers to consider using this approach to answer environmental and ecological questions. Another advantage of pyrosequencing for community genomics is the lack of bias in the sample preparation and analysis. In the random community genome sequencing approach, if sufficient DNA can be extracted from the sample there should be essentially no bias in the sequence generated. If whole genome amplification techniques are used, this can potentially introduce end-effect biases, but our preliminary unpublished data suggests that these may be limited in scope and nature
Disadvantages of pyrosequencing
The principle problem with the approach is the short sequence fragments that are generated. This, of course, limits the ability of most bioinformatics analyses that are currently used such as gene finding, protein similarity searches, and sequence assembly. However, several groups (including our group) are interested in facilitating these analyses as described in more detail below. The second problem that is well known and characterized is the issue of homopolymeric runs in sequence data. The nature of pyrosequencing means that continuous stretches of the same nucleotide are difficult to discriminate. Controls showed that the software is able to distinguish up to about 7 or 8 homopolymeric bases without difficulty, but beyond that problems occur. The nature of biology may also influence that analysis, because large tracts of single nucleotides may be polymorphic in microbial sequences as strand-slippage during DNA replication causes inexact DNA duplication, a technique that is frequently used for phase-variation of gene expression.
Uses of pyrosequencing in environmental genomics
For environmental microbiology there are two main approaches that are currently using pyrosequencing. The first is whole genome random sequencing (9). In this approach community genomic DNA is extracted and sequenced as-is. Usually purification steps are included before DNA extraction to exclude any free, viral, microbial, or eukaryotic DNA that are unwanted. Depending on the quantity of DNA that is generated, a whole genome amplification step may be applied. This provides a bias against the ends of linear DNA (Rohwer et al, unpublished) but does not appear to introduce overt taxonomic biases into the sequences. The second common approach that is being used is to sequence 16S rDNA libraries to extinction. In this approach, 16S rDNA genes are amplified by PCR, but instead of cloning, the genes are sequences with pyrosequencing. This should provide complete and thorough coverage
Page 6 of 14
Random Community Genomics. Rob Edwards of the 16S rDNA genes in the sample, with the obvious caveat that it will be biased by the PCR amplification step. This approach has the facility to replace traditional 16S rDNA clone libraries and DGGE analysis as the approach for examining microbial communities for the next few years depending on achieving a cost that most labs can afford. Based on discussions with researchers regularly using DGGE approaches for classifying microbial communities, the cost would have to be close to or less than $100 per sample with each sample providing in excess of 1,000 sequences for pyrosequencing to replace DGGE approaches. If 256 samples could be combined and tagged in a single run, the cost would be less than $40 per sample and more than 1,000 sequences would be generated per sample, enabling the replacement of DGGE and cloning for most environmental laboratories.
Single Stranded DNA
The process used to generate the pyrosequence data includes a single-stranded DNA intermediate. Unlike-based approaches, this does not limit the approach to double stranded environmental DNA, and heretofore uncharacterized single stranded molecules are included in the sequence.
DNA vs. RNA
Currently pyrosequencing has focused on using DNA templates, however with minimal alterations to the protocol the same approach could also be used for sequencing RNA. This provides some interesting abilities to focus on RNA viruses, gene expression, ESTs, and other largely unexplored areas of environmental microbiology.
Sequence storage and retrieval
Several groups (including 454, Sanger, and NCBI) have agreed on a common format for exchanging flowgram data from pyrosequencing machines. For public storage and exchange, the NCBI trace archive is being amended for storage for flowgram data. As with all sequence data, ownership and secrecy becomes an issue at some point. My view, which is borne out with joint publications with other groups, is that we have so much data we could never perform every analysis ourselves. As with other sequencing projects there is always the valid concern of being scooped with your own data. These issues will also have to be decided if there are central resources developed for community genomics.
Bioinformatics Challenges From Pyrosequencing
The generation of sequence data using pyrosequencing has become routine. For the typical molecular biology laboratory the sample is prepared and the DNA shipped out for sequencing. Although a few institutions have purchased 454 machines, the economies of scale suggest that most pyrosequencing sequencing will be performed at central facilities as it currently is with ABI sequencing. Although the specifics of sample collection and preparation vary, these are not challenges for random community genomics any more than they are challenges for other aspects of environmental microbiology. The current bioinformatics challenges from pyrosequencing data can be separated into several categories. Each of these categories are being researched, many by our group as well as other groups in the US and overseas.
Page 7 of 14
Random Community Genomics. Rob Edwards
Functional analysis and annotation
Although the sequences are only 100bp long, the sequence is sufficient to search against extant sequence databases. Depending upon the library, the sample, and the cutoffs applied, a typical search will show that up to 95% of the sequences have no known homologs in the nonredundant databases such as the SEED nr or GenBank nr. However, with 300,000 sequences in a sample, even a 10% hit rate provides 30,000 sequences to examine, significantly more than other approaches. The strength in these types of analysis are obvious when a comparative random community genome approach is applied. If two samples are treated similarly and then compared, the differences between samples become significant and the biases negated. Statistical approaches to comparing metagenomes yield insights into the important metabolic potential that can be found in different environments (9, 13). Functional analysis is typically performed using BLASTX to compare the DNA sequences from the 454 library with the protein sequences from the non-redundant database. A standard analysis using this approach takes approximately 1,000 CPU hours to perform (i.e., it would take about 20 hours to perform this analysis on a cluster with 50 nodes). This is currently the single biggest hurdle for sequence analysis, since even a 12-node cluster is hard to come by. Efforts are underway be several teams, including the group at Argonne National Labs to facilitate entrylevel cluster access for sequence comparisons against the non-redundant databases.
16S rDNA comparisons
When 16S rDNA sequences are extracted from a random community genome sequence, approximately 100 sequences are found by similarity using BLAST. This is a small enough sample that the sequences can be manually aligned and curated using other software to identify at least the genus/species information. It is difficult to generate a single 16S tree from this data, since these sequences are randomly distributed along the 16S rDNA gene. However, if a 16S tree is desired then specific 16S rDNA sequencing can be performed. When 16S rDNA sequences are specifically sequenced the identification challenge will also focus on assembly of the sequences. Sequences with exact matches are grouped to provide species information more than just the flattening of data that occurs during whole genome assembly.
Other comparisons
Comparisons of random community genome sequence data against boutique databases is very fast and extremely informative. Our current analysis includes comparisons to protein families such as COG (http://www.ncbi.nlm.nih.gov/COG/), FIG (http://theseed.uchicago.edu/FIG/proteinfamilies.cgi), PFAM (http://pfam.wustl.edu/), and PIR (http://pir.georgetown.edu/). In addition, we compare data to the ACLAME database of mobile elements (http://aclame.ulb.ac.be/) and our in-house database of phage and prophage genomes (http://phage.sdsu.edu/~rob/PhageTree/v4/).
Page 8 of 14
Random Community Genomics. Rob Edwards
Assembly
Accurate assemblies are critical for environmental genomics, not only for the ability to create longer sequences that may provide full length protein sequence and functional coupling data, but also for the potential for modeling microbial communities based on sequence-read overlaps (2). Pyrosequencing provides some unique challenges for sequence assembly. The short sequences, and the flowgram nature of the data preclude using off-the-shelf solutions to sequence assembly such as phrap (http://www.phrap.org) or Sequencher (http://www.genecodes.com). Our group has had some success assembling subsamples of the entire library in either sequencher, the TIGR assembler, or phrap (ms in preparation), however this has taken considerable optimization of the process and we still do not have perfect assemblies. Most likely alternative assemblers that incorporate the flowgram data format will be written and released in the next 12-18 months.
ORF calling
Short fragments such as those presented by pyrosequencing do not lend themselves to the identification of open reading frames. Nonetheless, we can use them for identification of protein sequences. End-to-end in frame sequences suggest a likely polypeptide, and other techniques may be applied to provide best-guess fragments of polypeptides that span different reading frames.
Direct comparisons of the data
Individual researchers with a protein, a small family of proteins, or a pathway may want to identify gene fragments from within random community genomes. Therefore, providing sites with the ability to use their 454 libraries as the database in a BLAST search, and providing DNA or protein sequences to compare to the library will be a critical resource for the near future. A typical library-library comparison using BLASTN takes approximately 20 CPU hours.
Currently Available Tools
A few tools have been developed for the manipulation of random community genomics data, but these tend to be specifically designed for individual groups. In this section I touch on some of the tools that our group has produced to deal with the data flow. This will give a flavor for some of the computational requirements that will be required for pyrosequencing data.
FABULOUS
(http://phage.sdsu.edu/~rob/cgi-bin/parse.cgi)
A suite of web interfaces create a manual pipeline for researchers in the group. This site was designed for use with ABI generated sequences, and is therefore really suitable for sequencing projects in the range of 100-10,000 sequences. The suite includes tools that allow researchers to use base calling software (phred and phrap) on their projects, and then submit the data to a BLAST pipeline. Large BLAST searches (such as those against GenBank non-redundant database) are separated out and submitted using network tools developed by NCBI. Small scale searches and searches against boutique databases are performed in-house. Post-processing tools provide abilities for parsing data, generating reports on the variety of sequences that are present, and searching results. The interface is designed with the end-user who is knowledgeable in biology, and has some experience with computers, but is not a programmer.
Page 9 of 14
Random Community Genomics. Rob Edwards
SEED
(http://theseed.uchicago.edu/FIG/index.cgi and http://seed.sdsu.edu/FIG/index.cgi)
The SEED database is a public resource of complete and draft genome sequences. The SEED contains more genomes than any other database, and also includes manual curation of those genomes. Therefore, the SEED database is ideal for connecting random community genome sequences to metabolic function. The database contains precomputed similarities, tools for the functional analysis of proteins, and for gene identification. The SEED database contains their own protein families, but also contains families from PFAM, COG, KEGG, PIR, and others. The standard SEED installation (such as the server in Chicago) contains common random genome sequences including the Sargasso and AMD datasets. Local installations are enabled to contain other datasets, including pyrosequencing data. The SDSU SEED installation currently houses approximately 25 pyrosequencing datasets, and includes the similarities to other proteins in the database, and annotations of sequences, where available. The SEED also provides a straightforward interface for searching the data and sifting through the large datasets that are generated by pyrosequencing. For example entering the search terms “4444444 integrase” and clicking the “Allow substring match” will identify all the integrase genes in all the pyrosequenced metagenomes. Again, this interface is designed with casual users in mind but has a significant overhead on the back-end to introduce the data to the databases.
MIG
(http://phage.sdsu.edu/~rob)
Our Microbial Informatics Group are rapidly developing tools for analyzing pyrosequencing data. Our current pipeline (outlined on the website) includes renumbering and storing the sequences in a database, comparing against simple databases as outlined above, and generating summaries and reports. All of this software is released through our subversion repository under open source licenses.
Immediate Problems
At the moment random community genome sequencing is being performed by those that have, or have access to, the computational power to perform the analysis. The development of pyrosequencing will unleash the ability to generate large quantities of data on all biological researchers in unprecedented levels. The cost or complications of generating sequencing data are no longer the barriers to entry into the environmental genomics arena. Very quickly the primary barrier will move downstream – beyond the sequencing facility and into the computational analysis field. In both the short term (less than 12 months) and the mid- to long-term (up to 5 years and beyond), the difficulties will be handling and storing the data, and performing meaningful functional analyses.
Short Term Requirements
The most immediate problem for random community genome sequencing is the availability of the computational horse-power to handle the searches required for functional analyses. Some
Page 10 of 14
Random Community Genomics. Rob Edwards of this is being provided on an ad-hoc basis by distributed resources, but much is by word-ofmouth and friends-of-friends. Central computational resources capable of receiving a complete 454 sample, comparing those sequences to the non-redundant databases (using both BLASTX and the computationally more expensive TBLASTX algorithms) are desperately needed. The resource must not only return the individual sequence results, but must be hooked into an annotation framework in a manner that provides the most up-to-date annotations of the proteins that are in the databases. Smaller boutique databases are easier to manage for end users, however truly facilitating the access of traditional bench biologists to random community genome sequences will require incorporating those databases into a distributed server system, and pipelining the whole analysis. The remaining problem is data visualization. Each sequence set contains hundreds of thousands of sequences, and these are compared to databases with similar numbers of sequences. Providing the results of comparisons to end users in a manner that allows them to search for results of interest, generate summaries in appropriate formats, and so on, will require some exploration of alternative data presentation formats.
Longer term requirements
The longer term requirements will include the ability to compare databases between sites. This will not be limited to pyrosequencing datasets. For example, comparing random community genome sequences from one part of the ocean to the already-released and soon-to-be-released oceanic microbial sequences will require significant computational power that is not readily available to most researchers. In the longer term, different kinds of analyses will also come online. For example, BLAST may not be the most appropriate tool for the comparison of short sequences with large databases, or the comparison of two databases with each other. However, for the immediate future it is clearly the gold standard in quality of results, ease of implementation across different machines and architectures, and in its widespread acceptance for publications. Assemblies, gene finding, and other analytical techniques will become common and prevalent, allowing researchers to expand their analyses from single reads to longer fragments of DNA. Currently the computational power is centralized in a few laboratories, but every molecular laboratory will require access to similar power. One of the most important long term goals must be to facilitate the research average bench researcher that is not a computer programmer, but is quite capable of using a computer.
Summary
Access to cheap, reliable, and easy sequencing is about to transform molecular environmental microbiology. The majority of labs that are currently using 16S rDNA cloning and sequencing or DGGE will at least attempt some form of pyrosequencing “just to see if it works”. The generation of the samples and the generation of the sequence data is no longer a significant barrier to most microbiological researchers. We are about to enter a period where large amounts of sequence data are going to be generated, and we need to provide access to the existing computational infrastructure and to
Page 11 of 14
Random Community Genomics. Rob Edwards develop the future computational infrastructure to ensure that the analysis does not become the next hindrance to the biological researcher.
Acknowledgements
Thanks to Andy Lilley and Dawn Field (CEH, Oxford), and Ian Head (Newcastle) for great discussions about random community genomics. Thanks to Mya Breitbart (USF, St. Petersburg, FL) and Forest Rohwer (SDSU, San Diego, CA) for data, comments on this paper, cajoling, and discussions. Thanks to Forest for sequencing the world so that no one else needs to.
Page 12 of 14
Random Community Genomics. Rob Edwards
References
1. Allander, T., M. T. Tammi, M. Eriksson, A. Bjerkner, A. Tiveljung-Lindell, and B. Andersson. 2005. Cloning of a human parvovirus by molecular screening of respiratory tract samples. Proc Natl Acad Sci U S A 102:12891-6. Angly, F., B. Rodriguez-Brito, D. Bangor, P. McNairnie, M. Breitbart, P. Salamon, B. Felts, J. Nulton, J. Mahaffy, and F. Rohwer. 2005. PHACCS, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information. BMC Bioinformatics 6:41. Breitbart, M., B. Felts, S. Kelley, J. M. Mahaffy, J. Nulton, P. Salamon, and F. Rohwer. 2004. Diversity and population structure of a near-shore marine-sediment viral community. Proc R Soc Lond B Biol Sci 271:565-74. Breitbart, M., I. Hewson, B. Felts, J. M. Mahaffy, J. Nulton, P. Salamon, and F. Rohwer. 2003. Metagenomic analyses of an uncultured viral community from human feces. J Bacteriol 185:6220-3. Breitbart, M., and F. Rohwer. 2005. Method for discovering novel DNA viruses in blood using viral particle selection and shotgun sequencing. Biotechniques 39:729-36. Breitbart, M., P. Salamon, B. Andresen, J. M. Mahaffy, A. M. Segall, D. Mead, F. Azam, and F. Rohwer. 2002. Genomic analysis of uncultured marine viral communities. Proc Natl Acad Sci U S A 99:14250-5. Cann, A. J., S. E. Fandrich, and S. Heaphy. 2005. Analysis of the virus population present in equine faeces indicates the presence of hundreds of uncharacterized virus genomes. Virus Genes 30:151-6. DeLong, E. F., C. M. Preston, T. Mincer, V. Rich, S. J. Hallam, N. U. Frigaard, A. Martinez, M. B. Sullivan, R. Edwards, B. R. Brito, S. W. Chisholm, and D. M. Karl. 2006. Community genomics among stratified microbial assemblages in the ocean's interior. Science 311:496-503. Edwards, R. A., B. Rodriguez-Brito, L. Wegley, M. Haynes, M. Breitbart, D. M. Peterson, M. O. Saar, S. Alexander, E. C. Alexander, Jr., and F. Rohwer. 2006. Using pyrosequencing to shed light on deep mine microbial ecology under extreme hydrogeologic conditions. BMC Genomics 7:57. Hawkins, T. L., J. C. Detter, and P. M. Richardson. 2002. Whole genome amplification--applications and advances. Curr Opin Biotechnol 13:65-7. Jones, M. S., A. Kapoor, V. V. Lukashov, P. Simmonds, F. Hecht, and E. Delwart. 2005. New DNA viruses identified in patients with acute viral infection syndrome. J Virol 79:8230-6. Margulies, M., M. Egholm, W. E. Altman, S. Attiya, J. S. Bader, L. A. Bemben, J. Berka, M. S. Braverman, Y. J. Chen, Z. Chen, S. B. Dewell, L. Du, J. M. Fierro, X. V. Gomes, B. C. Godwin, W. He, S. Helgesen, C. H. Ho, G. P. Irzyk, S. C. Jando, M. L. Alenquer, T. P. Jarvie, K. B. Jirage, J. B. Kim, J. R. Knight, J. R. Lanza, J. H. Page 13 of 14
2.
3.
4.
5. 6.
7.
8.
9.
10. 11.
12.
Random Community Genomics. Rob Edwards Leamon, S. M. Lefkowitz, M. Lei, J. Li, K. L. Lohman, H. Lu, V. B. Makhijani, K. E. McDade, M. P. McKenna, E. W. Myers, E. Nickerson, J. R. Nobile, R. Plant, B. P. Puc, M. T. Ronan, G. T. Roth, G. J. Sarkis, J. F. Simons, J. W. Simpson, M. Srinivasan, K. R. Tartaro, A. Tomasz, K. A. Vogt, G. A. Volkmer, S. H. Wang, Y. Wang, M. P. Weiner, P. Yu, R. F. Begley, and J. M. Rothberg. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 13. 14. Rodriguez-Brito, B., F. Rohwer, and R. Edwards. 2006. An application of statistics to comparative metagenomics. BMC Bioinformatics 7:162. Schmeisser, C., C. Stockigt, C. Raasch, J. Wingender, K. N. Timmis, D. F. Wenderoth, H. C. Flemming, H. Liesegang, R. A. Schmitz, K. E. Jaeger, and W. R. Streit. 2003. Metagenome survey of biofilms in drinking-water networks. Appl Environ Microbiol 69:7298-309. Tringe, S. G., C. von Mering, A. Kobayashi, A. A. Salamov, K. Chen, H. W. Chang, M. Podar, J. M. Short, E. J. Mathur, J. C. Detter, P. Bork, P. Hugenholtz, and E. M. Rubin. 2005. Comparative metagenomics of microbial communities. Science 308:554-7. Tyson, G. W., J. Chapman, P. Hugenholtz, E. E. Allen, R. J. Ram, P. M. Richardson, V. V. Solovyev, E. M. Rubin, D. S. Rokhsar, and J. F. Banfield. 2004. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428:37-43. Venter, J. C., K. Remington, J. F. Heidelberg, A. L. Halpern, D. Rusch, J. A. Eisen, D. Wu, I. Paulsen, K. E. Nelson, W. Nelson, D. E. Fouts, S. Levy, A. H. Knap, M. W. Lomas, K. Nealson, O. White, J. Peterson, J. Hoffman, R. Parsons, H. BadenTillson, C. Pfannkoch, Y. H. Rogers, and H. O. Smith. 2004. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304:66-74. Zhang, T., M. Breitbart, W. H. Lee, J. Q. Run, C. L. Wei, S. W. Soh, M. L. Hibberd, E. T. Liu, F. Rohwer, and Y. Ruan. 2006. RNA viral community in human feces: prevalence of plant pathogenic viruses. PLoS Biol 4:e3.
15.
16.
17.
18.
Page 14 of 14