The rapid accumulation of genetic information generated by the Human Genome Project and related research has heightened public awareness of genetics issues. Education in genome science is needed at all levels in our society by specific audiences and the general public so that individuals can make well-informed decisions related to public policy and issues such as genetic testing. Many scientists have found that an effective vehicle for reaching a broad sector of society is through high school biology courses. From an educational perspective, genome science offers many ways to meet emerging science learning goals, which are influencing science teaching nationally. To effectively meet the goals of the science and education communities, genome education needs to include several major components— accurate and current information about genomics, hands-on experience with DNA techniques, education in ethical decision-making, and career counseling and preparation. To be most successful, we have found that genome education programs require the collaborative efforts of science teachers, genome researchers, ethicists, genetic counselors, and business partners. This report is intended as a guide for genome researchers with an interest in participating in pre-college education, providing rationale for their involvement and recommendations for ways they can contribute, and highlighting a few exemplary programs. World Wide Web addresses for all of the programs discussed in this report are given in Table 1. We are developing a database of outreach programs offering genetics education (http://geneticseducation.mbt.washington.edu/database) and request that readers submit an entry describing their programs. We invite researchers to contact us for more information about activities in their local area.
cDNA Project: The Mammalian Gene Collection (MGC)
The National Institutes of Health's Mammalian Gene Collection (MGC) project was designed to generate and sequence a publicly accessible cDNA resource containing a complete open reading frame (ORF) for every human and mouse gene. The project initially used a random strategy to select clones from a large number of cDNA libraries from diverse tissues. Candidate clones were chosen based on 5′-EST sequences, and then fully sequenced to high accuracy and analyzed by algorithms developed for this project. Currently, more than 11,000 human and 10,000 mouse genes are represented in MGC by at least one clone with a full ORF. The random selection approach is now reaching a saturation point, and a transition to protocols targeted at the missing transcripts is now required to complete the mouse and human collections. Comparison of the sequence of the MGC clones to reference genome sequences reveals that most cDNA clones are of very high sequence quality, although it is likely that some cDNAs may carry missense variants as a consequence of experimental artifact, such as PCR, cloning, or reverse transcriptase errors. Recently, a rat cDNA component was added to the project, and ongoing frog (Xenopus) and zebrafish (Danio) cDNA projects were expanded to take advantage of the high-throughput MGC pipeline.
The International HapMap Project Web site
The HapMap Web site at http://www.hapmap.org is the primary portal to genotype data produced as part of the International Haplotype Map Project. In phase I of the project, >1.1 million SNPs were genotyped in 270 individuals from four worldwide populations. The HapMap Web site provides researchers with a number of tools that allow them to analyze the data as well as download data for local analyses. This paper presents step-by-step guides to using those tools, including guides for retrieving genotype and frequency data, picking tagSNPs for use in association studies, viewing haplotypes graphically, and examining markerto-marker LD patterns.
BAC Resources for the Rat Genome Project
Two 11-fold redundant bacterial artificial chromosome (BAC) libraries (RPCI-32 and CHORI-230) have been constructed to support the rat genome project. The first library was constructed using a male Brown Norway (BN/SsNHsd) rat as a DNA source long before plans for rat genome sequencing had been launched. The second library was prepared from a highly inbred female (BN/SsNHsd/MCW) rat in support of the rat genome sequencing project. The use of an inbred rat strain is essential to avoid problems with genome assembly resulting from the difficulty of distinguishing haplotype variation from variation among duplicons. We have demonstrated the suitability of the library by using a detailed quality assessment of large insert sizes, narrow size distribution, consistent redundancy for many markers, and long-range continuity of BAC contig maps. The widespread use of the two libraries as an integral part of the rat genome project has led to the database annotations for many clones, providing rat researchers with a rich resource of BAC clones that can be screened in silico for genes of interest.
Cancer Genome Anatomy Project
SNPs (Single-Nucleotide Polymorphisms), the most common DNA variant in humans, represent a valuable resource for the genetic analysis of cancer and other illnesses. These markers may be used in a variety of ways to investigate the genetic underpinnings of disease. In gene-based studies, the correlations between allelic variants of genes of interest and particular disease states are assessed. An extensive collection of SNP markers may enable entire molecular pathways regulating cell metabolism, growth, or differentiation to be analyzed by this approach. In addition, high-resolution genetic maps based on SNPs will greatly facilitate linkage analysis and positional cloning. The National Cancer Institute's CGAP-GAI (Cancer Genome Anatomy Project Genetic Annotation Initiative) group has identified 10,243 SNPs by examining publicly available EST (Expressed Sequence Tag) chromatograms. More than 6800 of these polymorphisms have been placed on expressionbased integrated genetic/physical maps. In addition to a set of comprehensive SNP maps, we have produced maps containing single nucleotide polymorphisms in genes expressed in breast, colon, kidney, liver, lung, or prostate tissue. The integrated maps, a SNP search engine, and a Java-based tool for viewing candidate SNPs in the context of EST assemblies can be accessed via the CGAP-GAI web site (http://cgap.nci.nih.gov/GAI/). Our SNP detection tools are available to the public for noncommercial use.
Navigating the Human Genome Project
The Human Genome Project has increased the rate of DNA sequence accumulation to the point where information management has become a formidable task. The central repositories for this avalanche of data, GenBank, EMBL (European Molecular Biology Laboratory), and DDBJ (DNA Data Bank of Japan), continue to accumulate DNA sequences at an unprecedented rate. For example, the total number of nucleotides stored in the GenBank database more than doubles every 18 months (Benson et al. 1997). The scientific community is clearly interested in supporting rapid access to high-quality DNA sequence, and, although this remains controversial (Adams and Venter 1996; Bentley 1996), in supporting release of ―unfinished‖ DNA sequence data generated by the sequencing centers. (Unfinished DNA sequences generated from a cosmid, BAC, or P1 clone may include nucleotide errors and may consist of unordered or ordered contigs with one or more gaps.) Since the process of ―finishing‖ a sequence (which includes resolving any ambiguous bases, contig assembly, gap closure, and annotation) proceeds at a much slower pace than the initial production of sequence, a considerable amount of unfinished sequence can accumulate at the sequencing centers. Growing interest in timely dissemination of all the data, plus the perception that uneven access to the unfinished DNA sequences could confer an unfair advantage (or disadvantage) to research groups, resulted in increasing pressure on the sequencing centers
Against a Whole-Genome Shotgun
The human genome project is entering its decisive final phase, in which the genome sequence will be determined in large-scale efforts in multiple laboratories worldwide. A number of sequencing groups are in the process of scaling up their throughput; over the next few years they will need to attain a collective capacity approaching half a gigabase per year to complete the 3-Gb genome sequence by the target date of 2005. At present, all contributing groups are using a clone-by-clone approach, in which mapped bacterial clones (typically 40–400 kb in size) from known chromosomal locations are sequenced to completion. Among other advantages, this permits a variety of alternative sequencing strategies and methods to be explored independently without redundancy of effort. Although it is not too late to consider implementing a different approach, any such approach must have as high a probability of success as the current one and offer significant advantages (such as decreased cost). I argue here that the whole-genome shotgun proposed by Weber and Myers satisfies neither condition.
For purposes of comparison it is helpful to first outline a specific implementation of cloneby-clone sequencing. Although by no means the only one possible, this implementation is being used by several of the larger groups and seems likely to be the method of choice for the major part of the genome. One starts with a set of mapped sequence-tagged sites (STSs) (Olson et al. 1989) from a particular chromosomal region. These are screened against a bacterial artificial chromosome (BAC) (or other large bacterial clone) library (Kim et al. 1996) to obtain overlapping clusters of clones from that region. Since whole-genome mapping efforts are nearing the target density of 1 STS per 100 kb
Big Time for Small Genomes
A new field of research has emerged that would have been unthinkable to most people as recently as 2 years ago. There is no generally accepted name for this discipline, but ―genomebased microbiology‖ seems to be appropriate. As indicated in the title of the recent meeting ―Small Genomes: Sequencing, Functional Characterization and Comparative Genetics‖ (The Institute for Genomic Research Genomic Series, Hilton Head, SC, January 25–28, 1997), this field capitalizes on complete sequences of ―small‖ genomes, which effectively means genomes of unicellular organisms. The information derived from complete genome sequences and particularly from their comparative analysis is used explicitly to study the biology of the microbial cells and to infer phylogenetic conclusions. At the meeting, approximately one-half of the talks still belonged to the genomic era, that is, they were progress reports on projects aimed at complete sequencing of a particular genome or, in some cases, several genomes. The remaining talks ventured into the postgenomic era and presented either genome comparison studies or functional analysis based on the knowledge of a complete genome sequence. As a substantive indication of the importance attached to the new field, the meeting featured brief presentations by representatives of three major funding agencies, National Institutes of Health (USA), Department of Energy (USA), and the Wellcome Trust (UK), all of which provide significant support for further development and worldwide coordination of the microbial genomics effort. In his introductory overview of the small genome sequencing efforts, J. Craig Venter (The Institute for Genomic Research, Rockville, MD, hereafter TIGR) indicated that even though the number of completely sequenced genomes is currently very small, the trend toward exponential (if not faster) growth is apparent—two complete genomes appeared in 1995, four in 1996, and it is realistic to expect eight or more in 1997. Therefore, there is no doubt that …
The basidiomycete fungus Cryptococcus neoformans is an important opportunistic pathogen of humans that poses a significant threat to immunocompromised individuals. Isolates of C. neoformans are classified into serotypes (A, B, C, D, and AD) based on antigenic differences in the polysaccharide capsule that surrounds the fungal cells. Genomic and EST sequencing projects are underway for the serotype D strain JEC21 and the serotype A strain H99. As part of a genomics program for C. neoformans, we have constructed fingerprinted bacterial artificial chromosome (BAC) clone physical maps for strains H99 and JEC21 to support the genomic sequencing efforts and to provide an initial comparison of the two genomes. The BAC clones represented an estimated 10-fold redundant coverage of the genomes of each serotype and allowed the assembly of 20 contigs each for H99 and JEC21. We found that the genomes of the two strains are sufficiently distinct to prevent coassembly of the two maps when combined fingerprint data are used to construct contigs. Hybridization experiments placed 82 markers on the JEC21 map and 102 markers on the H99 map, enabling contigs to be linked with specific chromosomes identified by electrophoretic karyotyping. These markers revealed both extensive similarity in gene order (conservation of synteny) between JEC21 and H99 as well as examples of chromosomal rearrangements including inversions and translocations. Sequencing reads were generated from the ends of the BAC clones to allow correlation of genomic shotgun sequence data with physical map contigs. The BAC
maps therefore represent a valuable resource for the generation, assembly, and finishing of the genomic sequence of both JEC21 and H99. The physical maps also serve as a link between map-based and sequence-based data, providing a powerful resource for continued genomic studies.
A “Quality-First” Credo for the Human Genome Project
The Human Genome Project is lurching toward large-scale genomic sequencing. Although we find the arguments in favor of this path compelling, there remain sobering uncertainties about the cost of producing genomic sequence on a gigabase-pair scale, the rate at which the needed sequencing capacity can be developed, and the extent to which compromises will be required in the quality of the final product. Of course, it is these very uncertainties that make the sequencing of the human genome a scientific and managerial challenge worthy of the committed attention of many hard-working and talented people. Policy decisions are now being made that will greatly affect how this talent is deployed during the years ahead. We argue here, on both scientific and managerial grounds, that it is essential that the Human Genome Project adopt a ―quality-first‖ credo. Scientific arguments for quality are rooted in a view of how the sequence will be used. The most common uses of reference sequences involve comparisons with other data. A vast number of sequences, derived from many sources—human and nonhuman—will be compared with the human reference sequence by future scientists. We should not prejudge either the comparison methods or the questions that the comparisons will address. Our goal should be to produce a reference sequence sufficiently good that comparisons made with it will rarely fail or produce misleading result
Sites of the Human Genome Project
In this last decade of the twentieth century the scientific community has witnessed the progression of a historic project—The Human Genome Project (HGP)—and a corresponding escalation in the rate of DNA sequence data accumulation. Managing, understanding, tracking, and storing this volume of data are not trivial tasks. The public data repositories have expanded their resources to receive, store, and retrieve sequence and map data. Furthermore, recognizing that the full extent of the data (including project descriptions and protocols) should be organized and publicly accessible, the HGP sequencing centers provide their own web sites. These sites allow for rapid dissemination of a wealth of data and information, including map and sequence data, protocols, software tools, and overviews of goals and progress. As such, it is useful to have an understanding of the general organization and content of these web sites.
The HGP Web Sites
Over the past year, this WebWise series has reviewed a dozen web sites hosted by the larger HGP sequencing centers. These sites and the URLs to their Home pages are listed, in order of review date, in Table1. Each review provided an indication of the contents and main features, as well as an organizational diagram, of the center’s web site. As web sites are often subject
to revision, these sites were revisited for this final review to determine whether any large changes have been made since the original review.
Assessing the Quality of the DNA Sequence from The Human Genome Project
It is sometimes hard to remember that the first DNA sequence of the entire genome of a freeliving organism, Hemophilus influenzae, was reported <4 years ago (Fleischmann et al. 1995). Since then, the genomes of >17 other prokaryotes (http://linkage.rockefeller.edu/wli/seq/), a unicellular eukaryote,Saccharomyces cerevisiae (Nature 1996), and a multicellular organism, Caenorhabditis elegans (The C. elegans Sequencing Consortium 1998), have been completely sequenced. Progress toward determination of the human DNA sequence has also become more rapid; at the time of this writing, the public databases contain 227.2 Mb of nonredundant, finished sequence available in contigs of >30 kb (and another 152.7 Mb of unfinished sequence) (http://www.ncbi.nlm.nih.gov/genome/seq/weekly_report.html). In comparison, there was 84.4 Mb of finished data (http://www.ebi.ac.uk/∼sterk/genome-MOT/) in February 1998. It is increasingly likely that the human sequence will be complete by 2003, and a working draft will be in hand even sooner (Collins et al. 1998;Venter et al. 1998). One consequence of our increased sequencing capacity is that within the next couple of years, we expect the rate of deposition of sequence data to increase from the current ∼3 Mb per week, to an average of well over 10 Mb per week worldwide. Very few scientific fields can measure progress as easily as can be done for large-scale genomic sequencing, quantifiable as it is into base pairs per unit time. However, mere numbers can be deceptive—the essential ―production‖ nature of large-scale genomic sequencing leaves it susceptible to errors in ways other scientific endeavors are not. Because of the rapid accumulation of human genomic sequence data, there is little opportunity for, or even possibility of, direct peer review of data prior to publication. The major venue for primary publication of genomic data is not the peer-reviewed literature at all, but public databases.
An Efficient DNA Sequencing Strategy Based on the Bacteriophage Mu in Vitro DNA Transposition Reaction
A highly efficient DNA sequencing strategy was developed on the basis of the bacteriophage Mu in vitro DNA transposition reaction. In the reaction, an artificial transposon with a chloramphenicol acetyltransferase (cat) gene as a selectable marker integrated into the target plasmid DNA containing a 10.3-kb mouse genomic insert to be sequenced. Bacterial clones carrying plasmids with the transposon insertions in different positions were produced by transforming transposition reaction products into Escherichia coli cells that were then selected on appropriate selection plates. Plasmids from individual clones were isolated and used as templates for DNA sequencing, each with two primers specific for the transposon sequence but reading the sequence into opposite directions, thus creating a minicontig. By combining the information from overlapping minicontigs, the sequence of the entire 10,288-
bp region of mouse genome including six exons of mouse Kcc2 gene was obtained. The results indicated that the described methodology is extremely well suited for DNA sequencing projects in which considerable sequence information is on demand. In addition, massive DNA sequencing projects, including those of full genomes, are expected to benefit substantially from the Mu strategy.
Inference of population genetic parameters in metagenomics: A clean look at messy data
Metagenomic projects generate short, overlapping fragments of DNA sequence, each deriving from a different individual. We report a new method for inferring the scaled mutation rate, θ = 2Neu, and the scaled exponential growth rate, R= Ner, from the sitefrequency spectrum of these data while accounting for sequencing error via Phred quality scores. After obtaining maximum likelihood parameter estimates for θ and R, we calculate empirical Bayes quality scores reflecting the posterior probability that each apparently polymorphic site is truly polymorphic; these scores can then be used for other applications such as SNP discovery. For realistic parameter ranges, analytic and simulation results show our estimates to be essentially unbiased with tight confidence intervals. In contrast, choosing an arbitrary quality score cutoff (e.g., trimming reads) and ignoring further quality information during inference yields biased estimates with greater variance. We illustrate the use of our technique on a new project analyzing activated sludge from a lab-scale bioreactor seeded by a wastewater treatment plant.
novoSNP, a novel computational tool for sequence variation discovery
Technological improvements shifted sequencing from low-throughput, work-intensive, gelbased systems to high-throughput capillary systems. This resulted in a broad use of genomic resequencing to identify sequence variations in genes and regulatory, as well as extended genomic regions. We describe a software package, novoSNP, that conscientiously discovers single nucleotide polymorphisms (SNPs) and insertion-deletion polymorphisms (INDELs) in sequence trace files in a fast, reliable, and user-friendly way. We compared the performance of novoSNP with that of PolyPhred and PolyBayes on two data sets. The first data set comprised 1028 sequence trace files obtained from diagnostic mutation analyses of SCN1A (neuronal voltage-gated sodium channel α-subunit type I gene). The second data set comprised 9062 sequence trace files from a genomic resequencing project aiming at the construction of a high-density SNP map of MAPT (microtubule-associated protein tau gene). Visual inspection of these data sets had identified 38 sequence variations for SCN1A and 488 for MAPT. novoSNP automatically identified all 38 SCN1A variations including five INDELs, while for MAPT only 15 of the 488 variations were not correctly marked. PolyPhred detected far fewer SNPs as compared to novoSNP and missed nearly all INDELs. PolyBayes,
designed for the sequence analysis of cloned templates, detected only a limited number of the variations present in the data set. Besides the significant improvement in the automated detection of sequence variations both in diagnostic mutation analyses and in SNP discovery projects, novoSNP also offers a user-friendly interface for inspecting possible genetic variations.
Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences.
A large set of mRNA and encoded protein sequences, from orthologous murine and human genes, was compiled to analyze statistical, biological, and evolutionary properties of coding and noncoding transcribed sequences. Protein sequence conservation varied between 36% and 100% identity, with an average value of 85%. The average degree of nucleotide sequence identity for the corresponding coding sequences was also approximately 85%, whereas 5' and 3' untranslated regions (UTRs) were less conserved, with aligned identities of 67% and 69%, respectively. For some mouse and human genes, nucleotide sequences are more highly conserved than the encoded protein sequences. A subset of 32 sequences, consisting of only mouse/human protein pairs for which the human sequence represents a positionally cloned disease gene, had properties very similar to the larger data set, suggesting that our data are representative of the genome as a whole. With respect to sequence conservation, two interesting outliers are the breast cancer (BRCAI) gene product and the testis-determining factor (SRY), both of which display among the lowest degrees of sequence identity. The occurrence of both introns and repetitive elements (e.g., Alu, Bl) in 5' and 3' UTRs was also studied. These results provide one benchmark for the "comparative genomics" of mice and humans, with practical implications for the cross-referencing of transcript maps. Also, they should prove useful in estimating the additional sampling diversity provided by mouse EST sequencing projects designed to complement the existing human cDNA collection.
Pattern of Sequence Variation Across 213 Environmental Response Genes
To promote the clinical and epidemiological studies that improve our understanding of human genetic susceptibility to environmental exposure, the Environmental Genome Project (EGP) has scanned 213 environmental response genes involved in DNA repair, cell cycle regulation, apoptosis, and metabolism for single nucleotide polymorphisms (SNPs). Many of these genes have been implicated by loss-of-function mutations associated with severe diseases attributable to decreased protection of genomic integrity. Therefore, the hypothesis for these studies is that individuals with functionally significant polymorphisms within these genes may be particularly susceptible to genotoxic environmental agents. On average, 20.4 kb of baseline genomic sequence or 86% of each gene, including a substantial amount of introns, all exons, and 1.3 kb upstream and downstream, were scanned for variations in the 90 samples of the Polymorphism Discovery Resource panel. The average nucleotide diversity across the 4.2 MB of these 213 genes is 6.7 × 10 -4, or one SNP every 1500 bp, when two random chromosomes are compared. The average candidate environmental response gene contains 26 PHASE inferred haplotypes, 34 common SNPs, 6.2 coding SNPs (cSNPs), and
2.5 nonsynonymous cSNPs. SIFT and Polyphen analysis of 541 nonsynonymous cSNPs identified 57 potentially deleterious SNPs. An additional eight polymorphisms predict altered protein translation. Because these genes represent 1% of all known human genes, extrapolation from these data predicts the total genomic set of cSNPs, nonsynonymous cSNPs, and potentially deleterious nonsynonymous cSNPs. The implications for the use of these data in direct and indirect association studies of environmentally induced diseases are discussed.
Connecting Sequence and Biology in the Laboratory Mouse
The Mouse Genome Sequencing Consortium and the RIKEN Genome Exploration Research grouphave generated large sets of sequence data representing the mouse genome and transcriptome, respectively. These data provide a valuable foundation for genomic research. The challenges for the informatics community are how to integrate these data with the everexpanding knowledge about the roles of genes and gene products in biological processes, and how to provide useful views to the scientific community. Public resources, such as the National Center for Biotechnology Information (NCBI; http://www.ncbi.nih.gov), and model organism databases, such as the Mouse Genome Informatics database (MGI; http://www.informatics.jax.org), maintain the primary data and provide connections between sequence and biology. In this paper, we describe how the partnership of MGI and NCBI LocusLink contributes to the integration of sequence and biology, especially in the context of the large-scale genome and transcriptome data now available for the laboratory mouse. In particular, we describe the methods and results of integration of 60,770 FANTOM2 mouse cDNAs with gene records in the databases of MGI and LocusLink.
Computational Comparison of Human Genomic Sequence Assemblies for a Region of Chromosome 4
Much of the available human genomic sequence data exist in a fragmentary draft state following the completion of the initial high-volume sequencing performed by the International Human Genome Sequencing Consortium (IHGSC) and Celera Genomics (CG). We compared six draft genome assemblies over a region of chromosome 4p (D4S394– D4S403), two consecutive releases by the IHGSC at University of California, Santa Cruz (UCSC), two consecutive releases from the National Centre for Biotechnology Information (NCBI), the public release from CG, and a hybrid assembly we have produced using IHGSC and CG sequence data. This region presents particular problems for genomic sequence assembly algorithms as it contains a large tandem repeat and is sparsely covered by draft sequences. The six assemblies differed both in terms of their relative coverage of sequence data from the region and in their estimated rates of misassembly. The CG assembly method attained the lowest level of misassembly, whereas NCBI and UCSC assemblies had the highest levels of coverage. All assemblies examined included <60% of the publicly available sequence from the region. At least 6% of the sequence data within the CG assembly for the D4S394–D4S403 region was not present in publicly available sequence data. We also show
that even in a problematic region, existing software tools can be used with high-quality mapping data to produce genomic sequence contigs with a low rate of rearrangements.