Genome assembly and features

Document Sample
Genome assembly and features Powered By Docstoc
					Genome assembly and an initial
     look at features
Celera believed they could assemble the whole
human genome from shotgun sequence fragments in
this way. But this approach failed. They had to use
the public domain map data to resolve problems in
their assembly.
      Assembly of large DNA
• Several assembly programs exist and can be run with
  different degrees of success: Phrap, TIGR
  Assembler, CAP, STROLL, etc.
• Most fragment assembly algorithms include the
  following three steps:
  – Overlap. Finding potentially overlapping fragments.
  – Layout. Finding the order of fragments.
  – Consensus. Deriving the DNA sequence from the

  – New method:
• The overlap problem is to find the best match
  between the suffix of one sequence and the
  prefix of another.
• If no sequencing errors, simply find the longest
  suffix of one string that exactly matches the
  prefix of another string.
• Since errors are small, the common practice is
  to use filtration method and to filter out pairs of
  fragments that do not share a significantly long
  common substring.
          TIGR assembler
• Finds exact 32 base matches between
  sequences; alignment between two
  sequences is scored based on the number
  and uniqueness of the 32-mer match (how
  often does 32-mer appear?)
• Interestingly, 32 was not chosen in a
  particularly rigorous manner, 16 gave too
  many alignments, >32 too few
     32-mer table example

• 8-MER             Occurences
• AGCTTAGA              1
• GCTTAGAT              1
• CTTAGATC              1
• TTAGATCT              2
Internal repeat sequences are ignored,
  because they confuse the assembler
         32-mer table…cont.

• 8-mer   Occurences            Belongs to:
• …
• ACATTGCA 10              seqA, seqB,…
• …
Sequences seqA and seqB are said to overlap
  when they share 32-mers. Quality of overlap
  depends on number of 32-mers and their
• Many algorithms select a pair of fragments with
  the best overlap at every step.
• The score of overlap is either the similarity score
  or a more involved probablilistic score.
• The selected pair of fragments with the best
  overlap score is checked for consistency.
• If this check is accepted, the two fragments are
          Sorting fragments
• Assembler sorts all potential merges
  according to their 32-mer scores
• Merges are performed in order of their
  scores (subject to quality restrictions =
  Phred scores)
• After half of the merges are performed, all
  scores are re-evaluated and list is re-
  sorted..continued until no more merges
        Merging two sequences
• Percent identity = 18/19% = 94.7%
• Overlap = region of similarity between regions
• Overhang = unaligned sequences at ends (underlined)

• The assembler screens merges based on:
   – Length of overlap
   – % identity in overlap region (TIGR default = 97.5%)
   – Maximum overhang size (can be trimmed)
• At later stages of the algorithm the collections
  of fragments (contig) – rather than individual
  fragments – are merged.
• The difficulty with the layout step is deciding
  whether two fragments with a good overlap
  really overlap (i.e. their differences are
  caused by sequencing errors) or represent a
  repeat in a genome (i.e. their differences are
  caused by mutations).
• Use additional “scaffolding” measures –
• The simplest way to build the consensus is
  to report the most frequent character in the
  substring layout that is (implicitly)
  constructed after the layout step is
         The Human Touch
• Consed – A Graphical Tool for Editing
  Phrap Assemblies.
   Assembly can be greatly
enhanced through use of maps
• Genetic maps based on recombination
  frequencies at meiosis. Linked markers
  are co-inherited (closer the higher
  frequency of co-inheritance) – only maps
• Physical maps describe location of DNA
  sequences, use several physical mapping
• Expression maps - mRNA
  Sequence tagged sites (STS)
     are used for each map
• An STS is a stretch of DNA ~300 bp in length
  generated using PCR, which tags the larger
  DNA molecule from which it is derived
• The nucleotide sequence of the STS is used to
  specify the sequence of two synthetic
  oligonucleotides that will bind in opposite
  orientations at either end of the STS
• Can be used to detect length polymorphisms or
• Allow different sources of DNA fragments
  to be examined for common sequences
• Sequences for STS are widely available
• Small number of false positives
• Automation
               Genetic Maps
• Linkage between markers measured in cM
• Haplotypes
   – Closely linked alleles that tend to be co-inherited
     (can be >2)
• CEPH families
   – Permanent cell lines derived from Mormons and
     French-Venezuelian families (Centre dEtude
     Polymorphism Human). Each family consists of
     three generations with four grandparents, 2
     parents and minimum of 6 children – great
      Physical mapping markers
• Minisatellites
    – VNTR’s
•   Microsatellites
•   Radiation hybrid mapping
•   FISH
•   EST maps
•   Clone maps
 Restriction fragment length
– Based on presence or absence of a target for
  a restriction enzyme usually due to a
  polymorphism at one base (only two alleles at
  any one locus; either there or not)
– Used extensively in pre-natal screening
– Can be performed on high MW fragments
  using Pulsed Field Gel Electrophoresis and
  • Can also be used for long range restriction
    mapping (ie. 8 bp or 16 bp cutters)
• Variable number tandem repeats
• Determine the different lengths by PCR or
• Multiple AluI repeats at a particular
• However, use is limited by their distribution
  in the genome, as they tend to be
  clustered near telomeres
• Southerns can be laborious and PCR can
  be difficult with large minisatellites
• More common and more evenly distributed
  than minisatellites
• These are variable number of dinucleotide
• Microsatellite based on CA repeats is the
  standard in construction of genetic maps
• Both mini and microsatellites are used in
  forensics as DNA fingerprints
    Radiation Hybrid mapping
Cells (human) are irradiated to fragment
Irradiated cells fused with a cell line (rat) to form a
   panel of hybrids (retains ~20% of donor fragments
   of ~ 10Mb)
Radiation hybrids have an assortment of human
   chromosome fragments; further apart two markers
   are, less likely to be on same fragment (map units
   are centiRays, analogous to cM but depend on
   radiation dose)
            Clone maps
• Generate YAC, PAC, or BAC library
• Order by detecting sequences in common
  (overlapping clones): STS content,
  hybridizations (using EST cDNA’s), and
     The human genetic map
• Took 15 million separate PCR reactions
  performed by a robotic line
• Results description of ensuing paper
  required 900 printed pages
• Check out:
Restriction mapping on a
     genomic scale

The age of complete genomic sequences
    DNA has distinctive, non-
    random base composition
• In all DNA, regardless of species, the
  number of A’s equals # of T’s, and # G’s =
  # C’s, such that A + G = T + C
• DNA specimens from different tissues of
  same organism have same base
• Base composition of DNA can vary wildly
  among organisms (25% GC vs. 80% GC)
• Non-randomness generates signals
     Signals within nucleotide
    sequences (and structure?)
•   Promoter
•   Transcriptional terminators
•   Ribosome binding site
•   Genetic code (genes)
•   Splice sites
•   Restriction sites
 Genes = Functional portions of
• Untranscribed genes
  – Replicator genes: sites for initiation and
    termination of DNA replication
  – Segregator genes sites of attachment to spindle
    machinery during meiosis and mitosis
  – Etc.
• Pseudogenes
  – Looks similar to functional genes but contains
    mutations such as frameshift and nonsense
• Protein-encoding genes
  – Transcribed and translated (and maybe modified)
• RNA-specific genes
  – Transcribed (and maybe modified)
                     The Minimal Genome

               E. coli


                         1,129         1
H.influenzae                                           M.genitalium
   A consequence of mechanism:
            GC skew

• Looking at microbial genomes, a plot G-C/(G+C)
  is an excellent predictor of origin of replication,
  where polarity switches
Cumulative GC skew can help
      Why is there GC skew?
• Consequence of asymmetry in replication or
• Leading vs. lagging strand synthesis
• Transcription-repair coupling
  – Removes most frequent types of DNA damage
    (deaminated cytosines and pyrimidine dimers, both of
    which lead to base substitution
  – This repair would occur on template strand (not
    coding), which then should become pyrimidine rich
  – Also, the template strand is significantly protected
    against DNA damage during transcription, while the
    coding strand is exposed. This should help increase
    the purine load of the coding strand.
  Another example of base skew:
           CpG islands
• CpG refers to the dinucleotide, while C-G
  to the base pair
• In the human genome, the C of CpG is
  typically methylated, and there is a high
  chance of this methyl C mutating to a T
• As a consequence, CpG are rarer in this
  genome than one would expect
             CpG islands
• However, for biologically important
  reasons methylation is suppressed in short
  stretches, such as around promoters
• As a result, CpG islands have been used
  to define start sites for putative genes
Eukaryotes -vs- Prokaryotes
  Many eukaryotic genes contain
 introns, few prokaryotic genes do
• Fig1.1 patthy
    Intron splicing and phases
• Fig 1.2 and 1.3
Group I and II self-splicing introns
Genetic code(s)
Third codon position often
Types of substitution
Frameshift mutations affect
 multiple codon positions
        Additional “codes” are intrinsic to
          a protein’s primary sequence
Acquire secondary structure either coming off ribosome (a),
or by interactions with chaperones (b), however,some
proteins fold into tertiary structures autonomously

Post-translational processing and modifications

            Insulin and proteases

            Signal Sequences

            Glycosylation, etc.
    Protein targeting utilizes signal peptides

Various protein activities are regulated by processing
     to yield a “mature” enzyme
Formation of disulfide bonds; addition of prosthetic groups
      How do we interpret this
• Attempt to define “rules”
  – Genetic code, but as you can see there are
    several variations within these rules depending on
    cell or organelle type
• Rules lead to development of algorithms or
  software for characterizing new data (a gene
  in a fruit fly should look like a gene in
• Evolution and cell physiology offer a
  framework to interpret this information, but
  false positives and false negatives remain

Shared By: