Introduction to sequence databases NM

Document Sample
Introduction to sequence databases NM Powered By Docstoc
					ECOL/MCB/BIOC 453/553 Lectures will be posted on the web site a day or so after the lecture, sometimes before. Usually. The homework page should be working now. LINK on course website Username: your UA login PW: first 3 letters of username (Please change your password during 1st use) Email (nmoran@) if you have problems. Exercise will also be at website as .doc file.

Aug 29 2006 Introduction to sequence databases

1

From “A User’s Guide to the Human Genome”, Wolfsberg et User’ Genome” al. 2002 “The user must understand the capabilities—and limitations —of the programs being used. In the same way that molecular biologists need to understand the chemistry underlying a routine assay or the physics behind separation techniques, they must have a basic understanding of what search or analysis methods actually do once the 'Submit' button has been pressed.

• Core sequence database is the International Nucleotide Sequence Database Collaboration
– GenBank (at National Center for Biotechnology Information, National Institutes of Health) – DNA DataBank of Japan (DDBJ) – European Molecular Biology Laboratory (EMBL). – These three organizations exchange data daily to provide comprehensive worldwide coverage – Entrez is the integrated search & retrieval system for nucleotide and other information at NCBI

http://www.ncbi.nlm.nih.gov/

2

http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi

(All Databases)

PubMed OMIM

GenBank, all nucleotides Protein sequences Genome sequences

http://www.ncbi.nlm.nih.gov/Database/

Molecular Databases Shows links

3

NUCLEOTIDE DATABASES
NCBI's sequence databases accept genome data from sequencing projects from around the world and serve as the cornerstone of bioinformatics research. GenBank: An annotated collection of all publicly available nucleotide and amino acid sequences, comtrbuted by researchers. dbEST: A collection of expressed sequence tags, or short, single-pass sequence reads from mRNA (cDNA). GSS database: A database of genome survey sequences, or short, single-pass genomic sequences. TraceArchive:
permanent repository of DNA sequence chromatograms (traces), base calls, and quality estimates for singlepass reads from various large-scale sequencing projects.

updated weekly, contains data from ongoing sequencing projects. dbSNP: A central repository for both single-base nucleotide substitutions and short deletion and insertion polymorphisms. RefSeq: A database of non-redundant reference sequences standards, including genomic DNA contigs, mRNAs, and proteins for known genes

• Where do Sequence Data in GenBank come from?
– The most important source of new data for GenBank is direct submissions from individual research labs and sequencing centers – GenBank depends on contributors to keep the database comprehensive, current, and accurate – NCBI proceses submissions and reviews information of new entries and updates
• • • • • Eg checking ORFs for integrity checking taxonomic information detecting vector sequence contamination detecting redundancy with previous sequences monitoring notation for consistent format

– NCBI assists authors who have new data to submit and provides software (“Sequin”) for preparing data submissions.

4

• Why do scientists submit their data to GenBank?

– Most journals require that DNA and amino acid sequences used in published research are submitted to a sequence database before publication (accession number must be provided). – Many funding agencies require that sequence data from a funded project be submitted promptly to public databases. – Sequence data submitted in advance of publication can be kept confidential if requested, but are released if accession # is published. – Submission of data underlying published results is sometimes a contested issue, esp for data collected by for-profit institutions that wish to publish but not to reveal potentially valuable data

GenBank grows at an exponential rate Number of nucleotide bases doubles ~ every 14 months. In April 2006, GenBank held 130 billion bases from over 100,000 species.

5

• What materials are the sources of sequences in GenBank?
– Sequences can come from a wide variety of source organisms
• • • • Long-term laboratory strains Environmental samples, including population samples Phylogenetic studies across related taxa Genetic studies, mutants

– Sequences can originate from

• cDNA derived from mRNA as whole or partial sequences • RNA sequenced directly (in the past) • direct sequencing of DNA templates

• PCR products – directly sequenced or cloned

– usually cloned using some vector (must remove vector sequence)

• Sequences are given a standard format, in a GenBank flat file
– Some information is essential, e.g.
• The sequence • The taxon (species or strain) • The contributors of the sequence.

– Some information is optional, e.g.

• Annotation on promoters, other functional information • translation of the ORF, annotation of ORFs • Information on polymorphic sites

6

1: NM_121197. Arabidopsis thali...[gi:18416608] Links LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM NM_121197 711 bp mRNA linear PLN 31-JUL-2003 Arabidopsis thaliana AP2 domain transcription factor TINY, putative (At5g11590) mRNA, complete cds. NM_121197 NM_121197.1 GI:18416608 . Arabidopsis thaliana (thale cress) Arabidopsis thaliana Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids II; Brassicales; Brassicaceae; Arabidopsis. REFERENCE 1 (bases 1 to 711) AUTHORS Town,C.D., Haas,B.J., Maiti,R., Hannick,L.I., Chan,A.P., Ronning,C.M., Smith Jr.,R.K., Yu,C., Wortman,J.R., White,O. and Fraser,C.M. TITLE Arabidopsis thaliana chromosome 5 genomic sequence JOURNAL Unpublished REFERENCE 2 (bases 1 to 711) AUTHORS Haas,B.J. TITLE Direct Submission JOURNAL Submitted (06-MAY-2003) The Institute for Genomic Research, 9712 Medical Center Dr, Rockville, MD 20850, USA, bhaas@tigr.org COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final NCBI review. This record is derived from an annotated genomic sequence (NC_003076). The reference sequence was derived from mrna.At5g11590.1. Release 4.0 of the Arabidopsis genome annotation from TIGR database. The current tiling path and associated information can be viewed at: http://www.tigr.org/tdb/e2k1/ath1/ath1.shtml. Chromosome annotation is available in TIGR XML format at ftp://ftp.tigr.org/pub/data/a_thaliana/ath1/PSEUDOCHROMOSOMES. FEATURES Location/Qualifiers source 1..711 /organism="Arabidopsis thaliana" /mol_type="mRNA" /cultivar="Columbia" /db_xref="taxon:3702" /chromosome="5" gene 1..711 /locus_tag="At5g11590" /note="synonym: F15N18.180; AP2 domain transcription factor TINY, putative; go_component: [goid GO:0003677] [evidence ISS]" CDS 1..711 /locus_tag="At5g11590" /note="similar to transcription factor TINY (GI:1246403) [Arabidopsis thaliana]" /codon_start=1 /product="AP2 domain transcription factor TINY, putative" /protein_id="NP_196720.1" /db_xref="GI:15239113" /translation="MAEEYYSLRSERVTQLLVPNSESDSVSDKSKAEQSEKKTKRGRD SGKHPVYRGVRMRNWGKWVSEIREPRKKSRIWLGTFPTPEMAARAHDVAALSIKGTAA ILNFPELADSFPRPVSLSPRDIQTAALKAAHMEPTTSFSSSTSSSSSLSSTSSLESLV LVMDLSRTESEELGEIVELPSLGASYDVDSANLGNEFVFYDSVDYCLYPPPWGQSSED NYGHGISPNFGHGLSWDL" BASE COUNT 167 a 174 c 197 g 173 t ORIGIN 1 atggcagagg aatactacag cctccgctcg gagagagtaa ctcagcttct tgtccctaac 61 tcggagtctg actcagtgag tgacaaaagc aaagctgagc aaagcgagaa gaagactaaa 121 cgtgggagag actccggtaa acaccctgtt tatcgcggag taaggatgag gaactgggga 181 aaatgggtgt cggagattcg tgagccgagg aagaaatcac gtatttggct gggaactttc 241 ccgacgccgg agatggcggc gcgtgcacac gacgtggcgg ctctgagcat taaaggaacg 301 gccgctatac taaacttccc tgaactcgct gactcattcc ctcgacccgt ttcattaagc 361 cctcgagaca ttcagacagc agctcttaaa gcagctcaca tggaaccgac gacgtcgttt 421 tcatcttcca cgtcttcgtc gtcgtctttg tcttctacgt cttcgctcga gtctcttgtg 481 ttggtgatgg acctctcgag gactgagtcg gaggagctcg gtgagattgt ggagcttcca 541 agtctcgggg cgagttacga cgtcgactcg gctaaccttg ggaacgagtt tgtcttctat 601 gactcagttg actactgttt atatccgccg ccgtggggac agtcgtccga agataactat 661 ggtcacggaa ttagccctaa ttttggccat ggcttgtcat gggatctcta a

Example of a (tiny) GenBank file

Header with information describing the sequence

the sequence, 60 nt per line, numbered

Accession #

LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM

REFERENCE AUTHORS TITLE JOURNAL MEDLINE PUBMED REFERENCE AUTHORS TITLE JOURNAL MEDLINE PUBMED REFERENCE AUTHORS TITLE JOURNAL

DMCECPN 4838 bp DNA linear INV 10-MAR-2001 Drosophila melanogaster cecropin gene cluster. X16972 X16972.1 GI:7712 andropin; cecropin; cecropin A; cecropin B. fruit fly. Drosophila melanogaster Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophila. 1 (bases 421 to 4838) Kylsten,P., Samakovlis,C. and Hultmark,D. The cecropin locus in Drosophila; a compact gene cluster involved in the response to infection EMBO J. 9 (1), 217-224 (1990) 90107946 2104802 2 (bases 1 to 620) Samakovlis,C., Kylsten,P., Kimbrell,D.A., Engstrom,A. and Hultmark,D. The andropin gene and its product, a male-specific antibacterial peptide in Drosophila melanogaster EMBO J. 10 (1), 163-169 (1991) 91114699 1899226 3 (bases 421 to 4838) Kylsten,P., Samakovlis,C. and Hultmark,D. Direct Submission Submitted (20-OCT-1989) Kylsten P., Samakovlis C., Hultmark D., Department of Microbiology, University of Stockholm, S-106 91 Stockholm, Sweden

. . .

7

Length

LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM

REFERENCE AUTHORS TITLE JOURNAL MEDLINE PUBMED REFERENCE AUTHORS TITLE JOURNAL MEDLINE PUBMED REFERENCE AUTHORS TITLE JOURNAL

DMCECPN 4838 bp DNA linear INV 10-MAR-2001 Drosophila melanogaster cecropin gene cluster. X16972 X16972.1 GI:7712 andropin; cecropin; cecropin A; cecropin B. fruit fly. Drosophila melanogaster Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophila. 1 (bases 421 to 4838) Kylsten,P., Samakovlis,C. and Hultmark,D. The cecropin locus in Drosophila; a compact gene cluster involved in the response to infection EMBO J. 9 (1), 217-224 (1990) 90107946 2104802 2 (bases 1 to 620) Samakovlis,C., Kylsten,P., Kimbrell,D.A., Engstrom,A. and Hultmark,D. The andropin gene and its product, a male-specific antibacterial peptide in Drosophila melanogaster EMBO J. 10 (1), 163-169 (1991) 91114699 1899226 3 (bases 421 to 4838) Kylsten,P., Samakovlis,C. and Hultmark,D. Direct Submission Submitted (20-OCT-1989) Kylsten P., Samakovlis C., Hultmark D., Department of Microbiology, University of Stockholm, S-106 91 Stockholm, Sweden

. . .

Definition of sequence

LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM

REFERENCE AUTHORS TITLE JOURNAL MEDLINE PUBMED REFERENCE AUTHORS TITLE JOURNAL MEDLINE PUBMED REFERENCE AUTHORS TITLE JOURNAL

DMCECPN 4838 bp DNA linear INV 10-MAR-2001 Drosophila melanogaster cecropin gene cluster. X16972 X16972.1 GI:7712 andropin; cecropin; cecropin A; cecropin B. fruit fly. Drosophila melanogaster Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophila. 1 (bases 421 to 4838) Kylsten,P., Samakovlis,C. and Hultmark,D. The cecropin locus in Drosophila; a compact gene cluster involved in the response to infection EMBO J. 9 (1), 217-224 (1990) 90107946 2104802 2 (bases 1 to 620) Samakovlis,C., Kylsten,P., Kimbrell,D.A., Engstrom,A. and Hultmark,D. The andropin gene and its product, a male-specific antibacterial peptide in Drosophila melanogaster EMBO J. 10 (1), 163-169 (1991) 91114699 1899226 3 (bases 421 to 4838) Kylsten,P., Samakovlis,C. and Hultmark,D. Direct Submission Submitted (20-OCT-1989) Kylsten P., Samakovlis C., Hultmark D., Department of Microbiology, University of Stockholm, S-106 91 Stockholm, Sweden

. . .

8

Journal articles in which sequence appeared

LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM

REFERENCE AUTHORS TITLE JOURNAL MEDLINE PUBMED REFERENCE AUTHORS TITLE JOURNAL MEDLINE PUBMED REFERENCE AUTHORS TITLE JOURNAL

DMCECPN 4838 bp DNA linear INV 10-MAR-2001 Drosophila melanogaster cecropin gene cluster. X1697 X16972.1 GI:7712 andropin; cecropin; cecropin A; cecropin B. fruit fly. Drosophila melanogaster Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophila. 1 (bases 421 to 4838) Kylsten,P., Samakovlis,C. and Hultmark,D. The cecropin locus in Drosophila; a compact gene cluster involved in the response to infection EMBO J. 9 (1), 217-224 (1990) 90107946 2104802 2 (bases 1 to 620) Samakovlis,C., Kylsten,P., Kimbrell,D.A., Engstrom,A. and Hultmark,D. The andropin gene and its product, a male-specific antibacterial peptide in Drosophila melanogaster EMBO J. 10 (1), 163-169 (1991) 91114699 1899226 3 (bases 421 to 4838) Kylsten,P., Samakovlis,C. and Hultmark,D. Direct Submission Submitted (20-OCT-1989) Kylsten P., Samakovlis C., Hultmark D., Department of Microbiology, University of Stockholm, S-106 91 Stockholm, Sweden

. . .

Date last modified

Organism & classification

LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM

REFERENCE AUTHORS TITLE JOURNAL MEDLINE PUBMED REFERENCE AUTHORS TITLE JOURNAL MEDLINE PUBMED REFERENCE AUTHORS TITLE JOURNAL

DMCECPN 4838 bp DNA linear INV 10-MAR-2001 Drosophila melanogaster cecropin gene cluster. X1697 X16972.1 GI:7712 andropin; cecropin; cecropin A; cecropin B. fruit fly. Drosophila melanogaster Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophila. 1 (bases 421 to 4838) Kylsten,P., Samakovlis,C. and Hultmark,D. The cecropin locus in Drosophila; a compact gene cluster involved in the response to infection EMBO J. 9 (1), 217-224 (1990) 90107946 2104802 2 (bases 1 to 620) Samakovlis,C., Kylsten,P., Kimbrell,D.A., Engstrom,A. and Hultmark,D. The andropin gene and its product, a male-specific antibacterial peptide in Drosophila melanogaster EMBO J. 10 (1), 163-169 (1991) 91114699 1899226 3 (bases 421 to 4838) Kylsten,P., Samakovlis,C. and Hultmark,D. Direct Submission Submitted (20-OCT-1989) Kylsten P., Samakovlis C., Hultmark D., Department of Microbiology, University of Stockholm, S-106 91 Stockholm, Sweden

. . .

9

. . . FEATURES source Info on origin of sequence, How it was amplified Location/Qualifiers 1..4838 /organism="Drosophila melanogaster" /strain="Canton S" /db_xref="taxon:7227" /chromosome="3" /map="99E" /clone="pTZ18R, pT2 18/19, Bluescript YS" /clone_lib="lambda ZAP II, lambda Charon 4" /dev_stage="adult”

. . .

Often necessary to look at journal article for methods...

TATA_signal mRNA

Positions of coding regions, introns, exons, promoters, other elements, transcribed regions

1147..1153 join(1177..1349,1411..1585) /note="cecropin A1" prim_transcript <1177..>1585 exon 1177..1349 CDS join(1251..1349,1411..1503) /codon_start=1 /product="cecropin A1" /protein_id="CAA34843.1" /db_xref="GI:1515177" /db_xref="FLYBASE:FBgn0000276;" /db_xref="SWISS-PROT:P14954" /translation="MNFYNIFVFVALILAITIGQSEAGWLKKIGKKIERVGQHTRDAT IQGLGIAQQAANVAATARG" mat_peptide join(1251..1349,1411..1500) intron 1350..1410 exon 1411..1585 polyA_signal 1563..1568 TATA_signal 2429..2435 mRNA join(2459..2639,2698..2872) /note="cecropin A2" prim_transcript <2459..>2872 exon 2459..2639 CDS join(2541..2639,2698..2790) /codon_start=1 /product="cecropin A2" /protein_id="CAA34844.1" /db_xref="GI:1515178" /db_xref="FLYBASE:FBgn0000277;" /db_xref="SWISS-PROT:P14954" /translation="MNFYNIFVFVALILAITIGQSEAGWLKKIGKKIERVGQHTRDAT

10

tata box

exons
cagatgtgtg cgtgcatgcc ccttttgtat ctcctaagaa gctctcattc aagaaaatcg cttttcgaag cgctcaacaa aattatttat atttaaagaa atttatacac gcttcttttc tcacctataa caattagatt ggaagctggc cttggaatca ttatctgtca cgcagtcatc aaaatcaaga tggccatcac taagttcttc gaacgcgttg gccgccaatg ttaaagatct ttctattcaa acattttaaa cgatttggaa aaggaccagt aatttgtgga ttcgacggga

intron
gatcggttac ttttgttttt agtcgctcag aaatatcacc cattggacaa catttgaaat gtcagcacac tcgccgcaac atttattctg actttgtttt tacacttaag aggccgagat cttttagttt ttttatttgt cattagtaag cttcagtgta caagctgctg acctcactgc atgaacttct tcggaagctg ctgttaagac tcgggatgcc tgcccgaggt ttgctccctg ttaaagagtt aggcattatt tatgtcttat aaattatcag cctcatcctg cttagtcatt

1021 1081 1141 1201 1261 1321 1381 1441 1501 1561 1621 1681 1741 1801 1861

ccgattgttc cttttctctg ttcgcctata aatatcaata acaacatctt ggtggctgaa ggaaactaac acaatccagg tgaccacgat taaataaaac ggagaaaagc tatacaggat ctgttgaaat tcgcttgtca accacttatt

cctagatgtg caaaaatccc aaagctctcg tctttagctt cgttttcgtc gaaaattggc tgactaactt gactgggaat gattatttat aattttaaaa gaactcttga attacaaatc ataattcgtt aatactgaaa ggccacaatt

Transcribed Region for cecropin A1

poly A tail

Info on protein coding sequences and corresponding Protein Including Links to protein database, genome database, Model organism site translated sequence

1147..1153 join(1177..1349,1411..1585) /note="cecropin A1" prim_transcript <1177..>1585 exon 1177..1349 CDS join(1251..1349,1411..1503) /codon_start=1 /product="cecropin A1" /protein_id="CAA34843.1" /db_xref="GI:1515177" /db_xref="FLYBASE:FBgn0000276;" /db_xref="SWISS-PROT:P14954" /translation="MNFYNIFVFVALILAITIGQSEAGWLKKIGKKIERVGQHTRDAT IQGLGIAQQAANVAATARG" mat_peptide join(1251..1349,1411..1500) intron 1350..1410 exon 1411..1585 polyA_signal 1563..1568 TATA_signal 2429..2435 mRNA join(2459..2639,2698..2872) /note="cecropin A2" prim_transcript <2459..>2872 exon 2459..2639 CDS join(2541..2639,2698..2790) /codon_start=1 /product="cecropin A2" /protein_id="CAA34844.1" /db_xref="GI:1515178" /db_xref="FLYBASE:FBgn0000277;" /db_xref="SWISS-PROT:P14954" /translation="MNFYNIFVFVALILAITIGQSEAGWLKKIGKKIERVGQHTRDAT

TATA_signal mRNA

11

• GenBank provides software for generating the file format from the annotated sequence • GenBank staff check the entry and may add some information • The depositors of a sequence can update it or correct it anytime after it is deposited
– Depositors “own” their sequences, which cannot be changed or “corrected” by a 3rd party – Date of modification is added – E.g., translation

• Some entries have errors or missing information in the sequence or annotation
– User must evaluate – The annotation is usually just a hypothesis about the sequence features and can be quite wrong

The Difference between deposited GenBank sequences and “RefSeq” sequences
The GenBank archival sequence database includes publicly available DNA sequences submitted from individual laboratories or sequencing projects.
As an archival database, GenBank can be very redundant for some loci.

RefSeq sequences are derived from GenBank and provide non-redundant curated data representing our current knowledge of known genes.

Some records include additional sequence information that was never submitted to an archival database but is available in the literature. Some sequence records are provided through collaboration; the underlying primary sequence data is available in GenBank, but may not be available in any one GenBank record. RefSeq sequences are not submitted primary sequences. RefSeq records are owned by NCBI and therefore can be updated as needed to maintain current annotation or to incorporate additional sequence information can be more efficient to search RefSeq because it is not redundant and can combine overlapping fragments of different submissions

12

• Included in GenBank (=the International Nucleotide Sequence Database Collaboration) – NCBI Database of "Expressed Sequence Tags" (dbEST) – NCBI Database of ”Genome Sequence Surveys" (dbGSS) – “Complete” genome sequences of organelles, chromosomes or whole genomes (Genome)
• Random sequencing of a DNA sample

http://www.ncbi.nih.gov/Database/datamodel/index.html

What is an EST? mRNA

“Expressed sequence tag”

reverse transcription, DNA polymerization

(from some species/strain under certain environmental conditions)

cDNA EST
sequence

double-stranded DNA copy of mRNA (cloned using a vector)
sequence of a cDNA, usually partial sequence from one end and terminated at some arbitrary point, dependent on the sequence quality

Gives partial sequences of coding genes expressed at one stage, in one environment, or in one tissue

13

Why obtain ESTs rather than whole genome sequences or sequences straight from genomic DNA?
– ESTs are a way to enrich for sequences important to organism function – Proportion of DNA that encodes genes varies among organisms – mRNA corresponds mostly to functioning genes
• Eg, sequence of humans is 3200 Mb (megabase pairs); only ~1% is functional protein-coding genes • ESTs used mainly for eukaryotes (which usually have more non-coding DNA) • how true is this?--nonfunctional regions sometimes transcribed and stable RNAs (not translated) are another important kind of gene

– Major use is as an aid in gene annotation--recognizing which DNA regions are “real” genes – Can reveal alternatively spliced transcripts (different exon combinations that make different proteins) – Can be used to examine differential gene expression of tissues or cell types or among phenotypes such as healthy vs.diseased. – ESTs can provide candidate genes for a given process since genes usually must be transcribed to have an effect on the phenotype

www.ncbi.nlm.nih.gov/dbEST/

14

• What nucleotide sequence databases are not included in GenBank (=the International Nucleotide Sequence Database Collaboration)? – Individual genome sequencing projects produce databases consisting of sequence obtained in the course of sequencing a genome or parts of a genome
• Not available for downloading as GenBank entries, but can be searched using sequence similarity at NCBI or at individual Sequencing Center websites.

– Trace Archive available for most large ongoing genome sequencing projects – NCBI Database of Single Nucleotide Polymorphisms (dbSNP) – Private databases

• Weekly deposition of new sequences, along with sequence quality information (traces from electropherograms)

• Separate database on genetic variants within species, available at NCBI

Trace Archive Raw sequence data from ongoing projeccts http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?

15

16

• Genome sequences started with bacteria

– Smaller genomes (1-10 megabases) – Most DNA encodes functional proteins • more “meaning” per nucleotide – Haemophilus influenzae published in 1995 • First full genome of a cellular organism • Showed that random, “shotgun” sequencing could work (more in 2 weeks) and paved the way for larger projects (human genome) • Idea that we could fully define the blueprint for life, a list of all the required “parts” of a functioning cell – Mycoplasma genitalium published later in 1995 – Idea of “Minimal Genome”: the set of genes essential for cell replication

17

A genome sequence is deposited as a GenBank file or as a set of large Genbank files… files…

Visualizing features of a genome
The outer scale is marked in megabases. Circles 1 and 2, all genes color coded by function, forward and reverse strand* circles 3 and 4, pseudogenes circles 5 and 6, insertion sequence elements circle 7, G + C content (higher values outward); circle 8, GC bias = (G - C/G + C) khaki indicates values >1, purple <1 Indicates position of ORIGIN OF REPLICATION

*Color coding by FUNCTIONAL CATEGORY: dark blue, pathogenicity or adaptation; black, energy metabolism; red, information transfer; dark green, surface associated; cyan, degradation of large molecules; magenta, degradation of small molecules; yellow, central or intermediary metabolism; pale blue, regulators; orange, conserved hypothetical; brown, pseudogenes; pink, phage and insertion sequence elements; pale green, unknown; grey, miscellaneous. Circular representation of the Yersinia pestis genome.

18

Genome of Buchnera APS

Shigenobu et al 2000 Nature

19

Genome sequences, 8/06
• Genomes “finished” & on Genbank
– 369 bacterial genomes finished – 30 archaeal genomes finished – ~25 eukaryotes, “finished” – 1000 mitochondrial genomes – >2000 viruses – 790 plasmids

• Small genomes

• Many more in various stages of completion, not on GenBank but links on “Eukaryotic Projects” page – http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi
– ~20 Drosophila species – Rice species – Many yeast species

Some major sequencing landmarks in eukaryotes

• • • • • •

Yeast C. elegans Drosophila melanogaster Arabidopsis Human Genome draft Human Genome finished

12 Mb 97 Mb 180 Mb 125 Mb 3200 Mb 3020 Mb

1996 1998 2000 and after 2000 Feb 2001 Feb 2004

20

• Sequences in ongoing genome projects may be deposited in GenBank or other databases as fragments
– “contigs”
• fragment of the genome sequence that is assembled from multiple overlapping sequence reads • continuous sequence, of varying quality • fragments consisting of sets of ordered and oriented contigs with statistically derived estimates of the gap sizes between component contigs • produced in the course of sequencing and assembling a genome • NCBI or other sites (most linked to NCBI)

– “scaffolds”

– Usually available for downloading or searching online

A scaffold from Anopheles gambiae

21

Unfinished sequences on GenBank, 2005 making data available GenBank, in useful form before the “final” sequence is done. final”

Genome Projects pages with links to: Sequence data Protein databases. ESTs Similarity (BLAST) searching Annotation Biological information Publications Researchers, etc

Example of genome project page: Anopheles gambiae

22

Example of genome project page: Anopheles gambiae Mapview of x chromosome

Searching sequences directly
• • • • Without depending on annotation Use sequence similarity to a known sequence Nucleotide or protein sequences Common to search protein sequences that are derived from nucleotide sequences • Protein sequences are more informative especially for divergent sequences… • Why?

23

• BLAST = Basic Local Alignment Search Tool

– Most common method to find sequences in database that are similar to a query sequence – Similarity is (usually) based on evolutionary homology – Many variants of BLAST searching, using DNA, protein, different criteria for scoring matches – More on this later in this course, here just mentioned as a basic tool for finding similar sequences in a database – For small questions, such as finding a match to a newly obtained single sequence read within a genome database, can do BLAST query on line – For large questions, such as finding matches to a large set of ESTs within a genome, can download database and BLAST locally

Query sequence

Database to search, type of search, Type of query, other parameters (more in 2 weeks)
Blasting unfinished genomes at NCBI--some of these are not annotated and some are not GenBank entries. Can be accessed at sequencing centers also.

24

Model organism sites, eg,

Model organism sites, eg,

Genome sequences can be searched with query sequences to find similar sequences--many variants available depending on purposes (also available at NCBI)

Can also search by gene name, other identifier

25

Linking sequence data to lab studies, older experimental results

Examples of databases that are not nucleotide sequences
NCBI Protein Database The Protein database contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL and DDBJ as well as protein sequences submitted to Protein Information Resource (PIR), SWISSPROT, Protein Research Foundation (PRF), and Protein Data Bank (PDB) (sequences from solved structures). NCBI Structure Database The Structure database or Molecular Modeling DataBase (MMDB) contains experimental data from crystallographic and NMR structure determinations. The data for MMDB are obtained from the Protein Data Bank (PDB). The NCBI has cross-linked structural data to bibliographic information, to the sequence databases, and to the NCBI taxonomy. Gene Expression Omnibus, SAGEmap (http://www.ncbi.nlm.nih.gov/geo/) Microarray data on gene expression

26

Precomputed BLAST scores in NCBI protein database Precomputed phylogenetic trees for protein families

Many specialized databases exist for particular aspects of genomic data, including evolutionary comparisons, phylogenetics, gene expression, structure, metabolism Genome alignment of bacterial species in Enterix

Metabolic map interfaced with genome sequence at BioCyc

27

Human genome resources OMIM

Exercise and Readings
Exercise 1 is on the homework site. Due Tues Sept 5 at noon Username = UA username Pw=first 3 letters of username CHANGE your PASSWORD!!

If there is a problem in making this system work, please note what happened and email me (moran) or (if all else fails) do the exercise on paper. The exercise will also be placed on the course homepage.
Readings: See syllabus!! GM 147-153 plus Li and Graur handout

28


				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:91
posted:4/17/2008
language:English
pages:28