Sequencing a genome by Ma216Dm

VIEWS: 0 PAGES: 45

									Sequencing a genome
                Definition
• Determining the identity and order of
  nucleotides in the genetic material – usually
  DNA, sometimes RNA, of an organism
             Basic problem
• Genomes are large (typically millions or
  billions of base pairs)
• Current technology can only reliably ‘read’
  a short stretch – typically hundreds of base
  pairs
       Elements of a solution
• Automation – over the past decade, the
  amount of hand-labor in the ‘reads’ has
  been steadily and dramatically reduced
• Assembly of the reads into sequences is an
  algorithmic and computational problem
           A human drama
• There are competing methods of assembly
• The competing – public and private –
  sequencing teams used competing assembly
  methods
           Assembly:
• Putting sequenced fragments of DNA
  into their correct chromosomal
  positions
                 BAC
• Bacterial artificial chromosome:
  bacterial DNA spliced with a medium-
  sized fragment of a genome (100 to
  300 kb) to be amplified in bacteria and
  sequenced.
              Contig
• Contiguous sequence of DNA created
  by assembling overlapping sequenced
  fragments of a chromosome (whether
  natural or artificial, as in BACs)
              Cosmid
• DNA from a bacterial virus spliced
  with a small fragment of a genome (45
  kb or less) to be amplified and
  sequenced
   Directed sequencing
• Successively sequencing DNA from
  adjacent stretches of chromosome
       Draft sequence
• Sequence with lower accuracy than a
  finished sequence; some segments are
  missing or in the wrong order or
  orientation
                  EST
• Expressed sequence tag: a unique
  stretch of DNA within a coding region
  of a gene; useful for identifying full-
  length genes and as a landmark for
  mapping
               Exon
• Region of a gene’s DNA that encodes
  a portion of its protein; exons are
  interspersed with noncoding introns
             Genome
• The entire chromosomal genetic
  material of an organism
               Intron
• Region of a gene’s DNA that is not
  translated into a protein
        Kilobase (kb)
• Unit of DNA equal to 1000 bases
               Locus
• Chromosomal location of a gene or
  other piece of DNA
       Megabase (mb)
• Unit of DNA equal to 1 million bases
                 PCR
• Polymerase chain reaction: a technique
  for amplifying a piece of DNA quickly
  and cheaply
          Physical map
• A map of the locations of identifiable
  markers spaced along the
  chromosomes; a physical map may
  also be a set of overlapping clones
               Plasmid
• Loop of bacterial DNA that replicates
  independently of the chromosomes;
  artificial plasmids can be inserted into
  bacteria to amplify DNA for
  sequencing
     Regulatory region
• A segment of DNA that controls
  whether a gene will be expressed and
  to what degree
       Repetitive DNA
• Sequences of varying lenths that occur
  in multiple copies in the genome; it
  represents much of the genome
    Restriction enzyme
• An enzyme that cuts DNA at specific
  sequences of base pairs
                RFLP
• Restriction fragment length
  polymorphism: genetic variation in the
  length of DNA fragments produced by
  restriction enzymes; useful as markers
  on maps
               Scaffold
• A series of contigs that are in the right
  order but are not necessarily connected
  in one continuous stretch of sequence
   Shotgun sequencing
• Breaking DNA into many small pieces,
  sequencing the pieces, and assembling
  the fragments
                  STS
• Sequence tagged site: a unique stretch
  of DNA whose location is known;
  serves as a landmark for mapping and
  assembly
                YAC
• Yeast artificial chromosome: yeast
  DNA spliced with a large fragment of
  a genome (up to 1 mb) to be amplified
  in yeast cells and sequenced
                     Readings
• Myers, “Whole Genome DNA Sequencing,”
  http://www.cs.arizona.edu/people/gene/PAPERS/whole.IEE
  E.pdf
• Venter, et al, “The Sequence of the Human Genome,”
  Science, 16 Feb 2001, Vol. 291 No 5507, 1304 (parts 1 & 2)
• Waterston, Lander, Sulston, “On the sequencing of the
  human genome,” PNAS, March 19, 2002, Vol 99, no 6,
  3712-3716
• Myers, et.al., “On the sequencing and assembly of the
  human genome,”
  www.pnas.org/cgi/doi/10.1073/pnas.092136699
      Hierarchical sequencing
• Create a high-level physical map, using
  ESTs and STSs
• Shred genome into overlapping clones
• Multiply clones in BACs
• ‘shotgun’ each clone
• Read each ‘shotgunned’ fragment
• Assemble the fragments
Physical map
Whole genome sequencing (WGS)
• Make multiple copies of the target
• Randomly ‘shotgun’ each target, discarding
  very big and very small pieces
• Read each fragment
• Reassemble the ‘reads’
Hierarchical v. whole-genome
 The fragment assembly problem
• Aim: infer the target from the reads
• Difficulties –
  – Incomplete coverage. Leaves contigs separated
    by gaps of unknown size.
  – Sequencing errors. Rate increases with length
    of read. Less than some .
  – Unknown orientation. Don’t know whether to
    use read or its Watson-Crick complement.
     Scaling and computational
            complexity
• Increasing size of target G.
  – 1990 – 40kb (one cosmid)
  – 1995 – 1.8 mb (H. Influenza)
  – 2001 – 3,200 mb (H. sapiens)
         The repeat problem
• Repeats
  – Bigger G means more repeats
  – Complex organisms have more repetitive
    elements
  – Small repeats may appear multiple times in a
    read
  – Long repeats may be bigger than reads (no
    unique region)
                  Gaps
• Read length LR hasn’t changed much
•  = LR /G gets steadily smaller
• Gaps ~ Re- R (Waterman & Lander)
How deep must coverage be?
        Double-barreled shotgun
              sequencing
•   Choose longer fragments (say, 2 x LR)
•   Read both ends
•   Such fragments probably span gaps
•   This gives an approximate size of the gap
•   This links contigs into scaffolds
Genomic results
HGSC v Celera results
          To do or not to do?
• “The idea is gathering momentum. I shiver
  at the thought.” – David Baltimore, 1986
• “If there is anything worth doing twice, it’s
  the human genome.” – David Haussler,
  2000
          Public or private?
• “This information is so important that it
  cannot be proprietary.” – C Thomas Caskey,
  1987
• “If a company behaves in what scientists
  believe is a socially responsible manner,
  they can’t make a profit.” – Robert Cook-
  Deegan, 1987
             HW for Feb 17
• Comment on these assertions (500-1000
  words):
  – WLS – “Our analysis indicates that the Celera
    paper provides neither a meaningful test of the
    WGS approach nor an independent sequence of
    the human genome.”
  – Venter – “This conclusion is based on incorrect
    assumptions and flawed reasoning.”

								
To top