Docstoc

Gene finding - PowerPoint Presen

Document Sample
Gene finding - PowerPoint Presen Powered By Docstoc
					Finding genes in genomes

      Hedi Hegyi, PhD
 @ Institute of Enzymology, Budapest

 BioSapiens Permanent School of
         Bioinformatics
    Budapest, Sept 4-8, 2006

           Day 1, Part II
1953: Watson and Crick discover
the double-helix nature of DNA…
  2000: Academia vs. Celera: who will
       own the human genome?
                                  Cr

Jim Kent                     Craig Venter,
Graduate                     Head of
student @                    Celera
UCSC
                   GigAssembler

Jim Kent                                                     David Haussler




       (meanwhile Celera working on whole genome shotgun version)
     …2001: 3,000,000,000 base pairs
     of the human genome decoded


February,                        February,
2001                             2001
      How many genes in human genome?
• 2000: must be at least 100,000 (Rice has 40,000,
  C.elegans has 19,000)

• 2001: only 35,000?

• 2005, Dec: Ensembl NCBI 35 release: 22,218 genes
  (33,869 transcripts)

• 2006, Apr: Ensembl NCBI 36 release: 23,710 genes
  (48,851 transcripts)

Problem not trivial!
Human genome: 3 billion base pairs of
              DNA
        DNA: 4-letter alphabet, A (adenosine), T
        (thymine), C (cytosine) and G (guanine).

        RNA: same as DNA but T -> U (uracil)

        3 letters (triplet – a codon) code for one
        amino acid in a protein.

        Proteins: units are the amino acid residues
        A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S,
        T, V, W and Y.
The Genetic Code
      Reverse codon table
Amino Acid         SLC   DNA codons
Isoleucine         I     ATT, ATC, ATA
Leucine            L     CTT, CTC, CTA, CTG, TTA, TTG
Valine             V     GTT, GTC, GTA, GTG
Phenylalanine      F     TTT, TTC
Methionine         M     ATG
Cysteine           C     TGT, TGC
Alanine            A     GCT, GCC, GCA, GCG
Glycine            G     GGT, GGC, GGA, GGG
Proline            P     CCT, CCC, CCA, CCG
Threonine          T     ACT, ACC, ACA, ACG
Serine             S     TCT, TCC, TCA, TCG, AGT, AGC
Tyrosine           Y     TAT, TAC
Tryptophan         W     TGG
Glutamine          Q     CAA, CAG
Asparagine         N     AAT, AAC
Histidine          H     CAT, CAC
Glutamic acid      E     GAA, GAG
Aspartic acid      D     GAT, GAC
Lysine             K     AAA, AAG
Arginine           R     CGT, CGC, CGA, CGG, AGA, AGG
Stop codons Stop         TAA, TAG, TGA
After completing the human genome we face 3 Gigabytes of this
Not immediately apparent where the genes are…
                      atg

                      caggtg
  ggtgag

           cagatg
ggtgag
              cagttg
                     ggtgag
                       caggcc
                ggtgag


                    tga
Early gene predicting approaches

• Focused on individual features
  –   Coding regions (ORFs)
  –   Splice sites
  –   Promoters
  –   Codon bias
  –   CpG islands
  –   GC content
      Six Frames in a DNA Sequence

CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC

CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG

GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG
GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG



 • start codons – ATG
 • stop codons – TAA, TAG, TGA
  Stop codons: 3 out of 64 codons ~ 1 in 20
       Exercise 1: Open Reading Frames
• 1. Get sequence 2 (AF008216) of HMR195 dataset
  from
  http://www.cs.ubc.ca/~rogic/evaluation/dataset.ht
  ml
• 2. Either paste sequence from here or get it from
  NCBI’s Entrez:
  http://www.ncbi.nlm.nih.gov/entrez/

• 3. submit sequence to a DNA translator:
• E.g. http://www.ebi.ac.uk/emboss/transeq/
• Or http://www.expasy.org/tools/#translate
Scheme of a
eukaryotic gene
                  What Makes an Intron?




                         Branch point
                         Adenosine
 5’ splice site          (usually closer
                         to 3’ss)
                                           3’ splice site




R: Purine A or G
Y: Pyrimidine U or C
           Splicing Mechanism




   Note: small exons in an
       ‘ocean’ of introns
typical exon – hundreds bp
typical intron – thousands bp
Splice sites are conserved (can be an
           important signal)
Eukaryotic splice sites




                Poly-pyrimidine
                     tract
              Splice site detection
              Donor site
5’                                                                         3’




                   Position
% -8 … -2 -1 0               1     2   … 17
A    26   …   60    9 0 1         54   …     21
C    26   …   15    5 0 1         2    …     27
G    25   …   12   78 99 0        41   …     27
T    23   …   13    8 1 98        3    …     25


                           From lectures by Serafim Batzoglou (Stanford)
  Position-specific scoring matrix (PSSM)



     Pos   -3      -2      -1     +1   +2    +3    +4    +5    +6
      A    0.3    0.6    0.1     0.0   0.0   0.4   0.7   0.1   0.1
      C    0.4    0.1    0.0     0.0   0.0   0.1   0.1   0.1   0.2
      G    0.2    0.2    0.8     1.0   0.0   0.4   0.1   0.8   0.2
      T    0.1     0.1   0.1     0.0   1.0   0.1   0.1   0.0   0.5


                    S = S1 S2 S3 S4 S5 S6 S7 S8 S9

                  P(S|+)        P-3(S1)P-2(S2)P-1(S3) ••• P5(S8)P6(S9)
Odds Ratio R =             =
                  P(S|-)        Pbg(S1)Pbg(S2)Pbg(S3) ••• Pbg(S8)Pbg(S9)

Score s = log2R
Scheme of a
eukaryotic gene
      Pol II Promoter Elements

                                 Exon Intron Exon



GC box    CCAAT box   TATA box        Gene
~200 bp    ~100 bp     ~30 bp


                                  Transcription
                                  start site (TSS)
Scheme of a
eukaryotic gene
               Poly-adenylation
                                        cleavage site
5’-cap
         Coding sequence
                               AAUAA
                                                        GU-rich
                                 A

                                   10 – 30 nts    20 – 40 nts

                           Endonuclease


                              AAUAAA

             Polyadenylate polymerase


                              AAUAAA                    AAA(A)n

                                                        80 – 250 A’s
    What other types of
 differences between coding
regions (exons) and noncoding
          sequences?
              CpG Islands
• CpG islands are regions of the genome
  with a higher frequency of CG
  dinucleotides (not base-pairs!) than
  the rest of the genome
• CpG islands often occur near the
  beginning of genes  maybe
  related to the binding of the
  Transcription Factor Sp1
                Codon Bias
• Unequal usage of codons in the coding
  regions is a universal feature of the
  genomes
  – uneven usage of synonymous codons
    (correlates with the abundance of
    corresponding tRNAs)
  – uneven usage of amino acids in existing
    proteins
               The Human Codon Usage Table
The human codon
 usage and codon
preference table.


 For each codon:
   3rd column:
  frequency of
    usage per
    thousand


    4 th column:
relative frequency
  of each codon
among synonymous
      codons.
            GC content

Higher GC content in coding regions
Gene density as a function of
        GC content
   ic h ac eis ic um ne
Bas c ar t r t sofh an ge s

                                                           d
                                                         Me ian                           Mean                am
                                                                                                             S ple(size )
 ntenal x
I r e ons                                                     2p
                                                           12 b                             14 b
                                                                                               5p                ir e y RN
                                                                                                             Conf m db m AorES (4 3 )T 3 17
   on b r
Ex num e                                                        7                               8.8            fq         e     inish d que e 5 1)
                                                                                                             Re se alignm ntstof e se nc (3 0
 ntr
I ons                                                       2 p
                                                          10 3b                            36 b
                                                                                            35 p               fq         e     inish d que e 7 3 )
                                                                                                             Re se alignm ntstof e se nc (2 2 8
 ’UT
3 R                                                         0 p
                                                           4 0b                             7 0b
                                                                                             7 p                 ir e y RN           T
                                                                                                             Conf m db m AorES onCh 2 (6 9r 2 8)
 ’UT
5 R                                                         4 p
                                                           2 0b                             3 0b
                                                                                             0 p                 ir e y RN           T
                                                                                                             Conf m db m AorES onCh 2 (4 3r 2 6)
                                                             0p
                                                          110 b                              4 p
                                                                                           13 0b              e c d f q ntr s 0 )
                                                                                                             Sle te Re se e ie (18 4
   ing que e
Cod se nc (CDS)
                                                            6
                                                           3 7aa                             4
                                                                                            4 7aa
  nom x nt
Ge ice te                                                   14kb                               7
                                                                                             2 kb             e c d f q ntr s 0 )
                                                                                                             Sle te Re se e ie (18 4




                                                                                                                                               Percentage of introns
                                                                                                                                                                                       Introns
                                                            Exons
             Percentage of Exons




                                                                                                                                                                        1

                                                                                                                                                                       0.5
                                    1                                                                                                                                   0

                                   0.8                                                                                                                                       0   100      200      300       400    500


                                   0.6                                                                                                                                                 Intron Length (bp)
                                   0.4

                                   0.2                                                                                                                                       Col-        Col-               Col-
                                    0

                                         0   100   200    300     400   500   600   700   800   900   1000
                                                                                                                                                                             umn A       umn C              umn C
                                                                Exon Length (bp)

                                                         Column C         Column D
     Observed lengths

Internal exons   Single exons
Median intron and exon length as the function
                of GC content
Top Ten Intronic Pentamers

Arabidopsis   Drosophila   Human

TCTCT         ATATA        GTGGG
TTTTT         AAATA        CTGGG
TTTGT         TATAT        GAGGG
TCTTT         TGATT        CAGGG
TGTTT         ACTTA        TGGGG
TCTGT         ACATA        GCAGG
TTCTT         TTTGT        GGTGG
TGTGT         CATTT        GGAGG
CTTTT         TTAAA        GCGGG
TTTCT         TCATT        GCTGG
   Top Ten Exonic Pentamers

Arabidopsis   Drosophila   Human

TGAAG         GGCGG        GATGA
CAAAG         CGAGG        CAGAA
AGAAG         CGCTG        GAAGA
TGCTG         AGGAG        CAGCA
TCTGA         TGGCC        CACCA
TGCAG         AGCTG        CTGAA
TGGAG         TGCTG        GTGGA
GGAAG         AGCAG        CAGGA
CGAAG         AGAAG        GAGGA
GAAGG         TGCAG        CTGGA
     Finding Eukaryotic Genes
          Computationally
• Content-based Methods
  – GC content, hexamer repeats,
    composition statistics, codon frequencies
• Site-based Methods
  – donor sites, acceptor sites, promoter
    sites, start/stop codons, polyA signals,
    lengths
• Comparative Methods
  – sequence homology, EST searches
• Combined Methods
         Methods for gene prediction
• Ab initio (“From the beginning”)
   – use general knowledge of gene structure: rules and statistics
   – current best methods are all based on hidden Markov models,
     which use Dynamic Programming
      • Genscan (Burge)
      • FGENES (Solovyev)
      • HMMGene (Krogh)

• Similarity based
   – comparison to known proteins, cDNAs, ESTs
   – better, but only available if you have similar data to compare to
      • Genewise
      • Genomewise
      • Genomescan
              An early method: TestCode
                (implemented in GCG)


                                                        Coding

                                                              No opinion

                                                         Non-coding



TestCode finds ORFs based on compositional bias with a periodicity of three
Fickett, (Nucl. Acids Res. 1982)
 Integrated methods: Hidden Markov
              Models
• Fully probabilistic, so can do proper
  statistics
  – Can estimate the parameters from labeled data
  – Can give confidence values
• Semi- or Generalized HMMs
  – A state explains a subsequence (e.g. a whole
    exon), rather than a single base
  – transition between states at features detected
    by other methods (e.g. splice site consensus)
     Models of Sequence Generation:
             Markov Chains

• A Markov chain is a model for stochastic
  generation of sequential phenomena

• The order of the Markov chain is the number of
  previous positions on which the current position
  depends – e.g. 5th order Markov Chain:



     A   C   G   A   T   C   G   T   C   C
       Hidden Markov Models
• Hidden Markov Models (HMMs) allow us to
  model complex sequences, in which the
  character emission probabilities depend
  upon the state

• Think of an HMM as a probabilistic or
  stochastic sequence generator, and what is
  hidden is the current state of the model
                  HMM Details
• An HMM is completely defined by its:
   – State-to-state transition matrix ()
   – Emission matrix (H)
   – State vector (x)

• We want to determine the probability of any
  specific (query) sequence having been generated
  by the model

• Two algorithms are typically used for the
  likelihood calculation:
   – Viterbi
   – Forward
An experiment with 2 dice: revealing hidden states with
                  Viterbi algorithm
  A Parse
S = ACTGACTACTACGACTACGATCTACTACGGGCGCGACCTATGCG
P = IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGGGG

    TATGTTTTGAACTGACTATGCGATCTACGACTCGACTAGCTAC
    GGGGGGGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
  For a given sequence, a parse is an assignment of gene          •
                                   structure to that sequence.
   In a parse, every base is labeled, corresponding to the        •
                       content it (is predicted to) belongs to.
           In our simple model, the parse contains only “I”       •
                                  (intergenic) and “G” (gene).
        A more complete model would contain, e.g., “-” for        •
                   intergenic, “E” for exon and “I” for intron.
            The HMM Matrixes:  and H
                                       0    0      0     0
                                      0.5 0.998 0. 002   0
                                    =
                                   Φ
                                      0.5 0.001 0. 996   0
                                       0 0.001 0. 002    0


                                      0 . 28
                                      0 . 22
                                             0 . 32
                                      0 . 25
                                             0 . 18
                                      0 . 25
                                             0 . 18
                                             0 . 32

                                          H=


xm(i) = probability of being in state m at position i;
H(m,yi) = probability of emitting character yi in state m;
mk = probability of transition from state k to m.
Bacteriomaker HMM Machine
                         A   0.22
                         C   0.28
                Coding   G   0.28
                         T   0.22

          1
                             0.1
                 0.9


    ATG                             TAA


                               1
          0.1
                         A   0.25
                Inter    C   0.25
                gene     G   0.25
                         T   0.25

                  0.9
Genscan (Burge and Karlin, 1998)
• Dramatic
  improvement over
  previous methods
• Generalised HMM
• Different parameter
  sets for different GC
  content regions
  (intron length
  distribution and exon
  stats)
GENSCAN (Burge & Karlin)
                                                                           6201   ttaaggagcagtgactagcgactagcatcgatgctacgtacg
                                                                           6261   acgtactagctagctagcgcatgacgtagctagcacgcatcg
                      E0           E1         E2                                                     atgc
                                                                           6321
                                                                                             ……….. aga
                                                                           6381
                                                                           6441
                                                                           6501
                                                                           6561                           …
                       I0          I1         I2                           6621
                                                                           6681
                                                                           6741    …
                                                                           6801
                                                                           6861
                      E ini                   E term                       6921
                                                                           6981
                                                                           7041
                                                                           7101
                                                                           7601
          5'UTR                   E single              3'UTR              7661
                                                                           7721
                                                                           7781
                                                                           7841
                                                                           7901
                     promoter                poly-A                        7961
                                                                           8021
                                                                           8081
                                                                           8141
                                                                           8201
Forward (+) strand                                    Forward (+) strand   8261
                                intergenic
                                                                           8321
Reverse (-) strand                region              Reverse (-) strand   8381
                                                                           8441
                                                                           8901
                                                                           8961
                                                                           9021
                                                                           9081
                                                                           9141
                                                                           9201
                                                                           9261
Exercise 2: Gene prediction with Genscan
 1. submit sequence 2 of HMR195 data set
to Genscan server:
http://genes.mit.edu/GENSCAN.html

2. compare result to annotated sequence:

>gi|2738513|gb|AAD12746.1| pp32r1 [Homo sapiens]
MEMGRRIHSELRNRAPSDVKELALDNSRSNEGKLEALTDEFEELEFLSKINGGLTSISDLPKLKLRKLEL
RVSGGLEVLAEKCPNLTHLYLSGNKIKDLSTIEPLKQLENLKSLDLFNCEVTNLNDYGENVFKLLLQLTY
LDSCYWDHKEAPYSDIEDHVEGLDDEEEGEHEEEYDEDAQVVEDEEGEEEEEEGEEEDVSGGDEEDEEGY
NDGEVDGEEDEEELGEEERGQKRK

>gi|GENSCAN_predicted_peptide_1|234_aa
MEMGRRIHSELRNRAPSDVKELALDNSRSNEGKLEALTDEFEELEFLSKINGGLTSISDLPKLKLRKLEL
RVSGGLEVLAEKCPNLTHLYLSGNKIKDLSTIEPLKQLENLKSLDLFNCEVTNLNDYGENVFKLLLQLTY
LDSCYWDHKEAPYSDIEDHVEGLDDEEEGEHEEEYDEDAQVVEDEEGEEEEEEGEEEDVSGGDEEDEEGY
NDGEVDGEEDEEELGEEERGQKRK
    Exercise 2 cont’d: Gene prediction with Genscan
3. Submit sequence 1 of HMR195 to Genscan server.

4. Analyze result:
>gi|GENSCAN_predicted_peptide_1|185_aa
MRDYDEVTAFLGEWGPFQRLIFFLLSASIIPNGFTGLSSVFLIATPEHRCRVPDAANLSS
AWRNHTVPLRLRDGREVPHSCRRYRLATIANFSALGLEPGRDVDLGQLEQESCLDGWEFS
QDVYLSTIVTEVVSLASRPDGHFEEGTRKVLAVPGRCSMALYSFPYRPQPKNDNLVMPYT
GTLAP
>gi|GENSCAN_predicted_peptide_2|410_aa
MGKKPALCLPAQWNLVCEDDWKAPLTISLFFVGVLLGSFISGQLSDRFGRKNVLFVTMGM
QTGFSFLQIFSKNFEMFVVLFVLVGMGQISNYVAAFVLGTEILGKSVRIIFSTLGVCIFY
AFGYMVLPLFAYFIRDWRMLLVALTMPGVLCVALWWFIPESPRWLISQGRFEEAEVIIRK
AAKANGIVVPSTIFDPSELQDLSSKKQQSHNILDLLRTWNIRMVTIMSIMLWMTISVGYF
GLSLDTPNLHGDIFVNCFLSAMVEVPAYVLAWLLLQYLPRRYSMATALFLGGSVLLFMQL
VPPDLYYLATVLVMVGKFGVTAAFSMVYVYTAELYPTVVRNMGVGVSSTASRLGSILSPY
FVYLGAYDRFLPYILMGSLTILTAILTLFLPESFGTPLPDTIDQMLRVKX
>gi|4126718|dbj|BAA36712.1| OCTN2 [Homo sapiens]
MRDYDEVTAFLGEWGPFQRLIFFLLSASIIPNGFTGLSSVFLIATPEHRCRVPDAANLSS
AWRNHTVPLRLRDGREVPHSCRRYRLATIANFSALGLEPGRDVDLGQLEQESCLDGWEFS
QDVYLSTIVTEWNLVCEDDWKAPLTISLFFVGVLLGSFISGQLSDRFGRKNVLFVTMGMQ
TGFSFLQIFSKNFEMFVVLFVLVGMGQISNYVAAFVLGTEILGKSVRIIFSTLGVCIFYA
FGYMVLPLFAYFIRDWRMLLVALTMPGVLCVALWWFIPESPRWLISQGRFEEAEVIIRKA
AKANGIVVPSTIFDPSELQDLSSKKQQSHNILDLLRTWNIRMVTIMSIMLWMTISVGYFG
LSLDTPNLHGDIFVNCFLSAMVEVPAYVLAWLLLQYLPRRYSMATALFLGGSVLLFMQLV
PPDLYYLATVLVMVGKFGVTAAFSMVYVYTAELYPTVVRNMGVGVSSTASRLGSILSPYF
VYLGAYDRFLPYILMGSLTILTAILTLFLPESFGTPLPDTIDQMLRVKGMKHRKTPSHTR
MLKDGQERPTILKSTAF
  Exercise 3: Analyze Genscan output with
                  clustalw

Clustalw: multiple alignment program
Clustalx: graphical version
1. Download clustalx from
   http://www.protein.sdu.dk/bm131/html/do
   wnloads.html for Windows or
Download clustalw from ftp://ftp-igbmc.u-
   strasbg.fr/pub/ClustalX/ for linux. The
   linux version of clustalx will also have a
   clustalw (text-oriented) version.
 Current performance of ab initio
            methods
• We can confirm gene structures
  experimentally by sequencing cDNA
• Current methods are not really good enough
  – 75% correct per exon, worse with initial and
    final exons
  – 20% correct per gene
  – easier for simpler organisms, e.g. nematode C.
    elegans
• Options are to improve methods, or get
  extra information
  – an attractive source of new information is whole
    genome sequence from a related species, e.g.
    mouse for man
             Assessing performance:
            Sensitivity and Specificity
• Testing of predictions is performed on sequences
  where the gene structure is known
• Sensitivity is the fraction of known genes (or bases
  or exons) correctly predicted: Sn=NTrue Positives /NAll
  True
   – “Am I finding the things that I’m supposed to find?”

• Specificity is the fraction of predicted genes (or bases or
  exons) that correspond to true genes: Sp=NTrue Positives
  /NAll Positives
   – “What fraction of my predictions are true?”

• In general, increasing one decreases the other
Graphic View of Specificity and Sensitivity




            TruePositive        TruePositive
      Sn=
              AllTrue    TruePositive+FalseNegative
            TruePositive       TruePositive
     Sp=
             AllPositive TruePositive+FalsePositive
     How Well Do They Do? Comparing seven
        ab initio gene-finding programs.




(Sn) nucleotide level sensitivity; (Sp) nucleotide level specificity; (AC) approximate correlation;
(CC) correlation coefficient; (ESn) exon level sensitivity; (ESp) exon level specificity; (ME) missed
exons; (WE) wrong exons; (PCa) proportion of real exons that were partially predicted (only one exon
boundary correct); (PCp) proportion of predicted exons that were only partially correct; (OL)
proportion of predicted exons that overlap an actual exon. S. Rogic, et al. Genome Research, 11: 817-
832 (2001).
 Eukaryotic gene prediction tools and
                  web servers
• Genscan (ab initio), GenomeScan (hybrid)
–   (http://genes.mit.edu/)
•   Twinscan (hybrid)
–   (http://genes.cs.wustl.edu/)
•   FGENESH (ab initio)
–   (http://www.softberry.com/berry.phtml?topic=gfi
    nd)
•   GeneMark.hmm (ab initio)
–   (http://opal.biology.gatech.edu/GeneMark/eukhm
    m.cgi)
•   MZEF (ab initio)
–   (http://rulai.cshl.org/tools/genefinder/)
•   GrailEXP (hybrid)
–   (http://grail.lsd.ornl.gov/grailexp/)
•   GeneID (hybrid)
–   (http://www1.imim.es/geneid.html)
     GeneWise (Ewan Birney)

• GeneWise aligns a protein sequence
  (or HMM) to genomic DNA taking into
  account splicing information
                                      GeneWise output
                                                                                                  -20bp
pkinase.hmm        1 YELGEKLGEGA                          GKVYKAKHK---TGKIVAVKILKKESLSLL                         REIQI
                       ++ LG +                            G+ Y+A +     ++I+ + +K + + +                            E+ +
                     INIKNLLGGDT                          GCLYMAPKVQATKQQIYKLCFIKIKTFVLQ                         TELNL
HSU71B4       -27753 aaaaactgggaGTGTGAGTA Intron 1   CAGTgtttagcagcgaaccatatttaaaaatgccAGGTCACTA Intron 2   CAGGagcac
                     tataattggac <2-----[27718:22469]-2> ggtatccataccaaataatgttatacttta <2-----[22375:21185]-2> catat
                     atcatggtata                         acatgaaaaaaaaaattagcctaaattgta                         tacct


                                                        +3bp       - 6bp
pkinase.hmm   45 LKRLN-HPNIVRLLGVFED-----SKDHLY                       LVLEYMEGGDLFDYLRRKG--PLSEKEAKKIALQILR
                 L++++ H+NIV ++G+F           L+                       +V+E++ G+ D++R+       L E+++ +I ++IL+
                 LRKYSFHKNIVSFYGAFFKLSPPGQRHQLW                       MVMELCAAGSVTDVVRMTSNQSLKEDWIAYICREILQ
HSU71B4   -21168 caatttcaaagtttggttacaccgccccctGTATGTT Intron 3    CAGagagttgggtgagggaaaaacataggtagtatcgacc
                 tgaactaaattctagcttatgccgagaatg<0-----[21078:15667]-0>tttatgccgctcattgtcgaagtaaagtcatggatta
                 gggctccactgcctaatcggtcttggcatg                       ggggataatgcttagagcttgtaaatgtttccaactg


                      - 66bp                                               - 8bp            - 1bp
pkinase.hmm     104                     GLEYLHSNGIVHRDLKPENILLDENGTVKI                          DFGLAKLLK-SGEKLTTFV
                                        GL++LH ++++HRD+K +N+LL++N VK+                           DFG++++++ ++++++F+
                                        GLAHLHAHRVIHRDIKGQNVLLTHNAEVKL                          DFGVSAQVSRTNGRRNSFI
HSU71B4   -15555 GTGAGTC Intron 4    CAGgtgcccgccgaccgaagcagccacagggacGGTAAGTT Intron 5    CAGTTgtggagcgaaaagaaaata
                 <0-----[15555:14066]-0>gtcatacagttagatagaatttcaacatat <1-----[13974:10915]-1> atgtgcatggcagggagtt
                                        catctcacaatcgccatgtgggttttaaag                          ttagtcggcattaagttct


                                                  0bp          - 3bp                      -1 bp            +2bp
pkinase.hmm  153 GTPWYMMAPEVILKG-----RGYSTK                       VDVWSLGVILYELLTGKL                          FPG-D
                 GTP++M APEV +       R Y+ +                       +DVWS+G++ +E++ G +                            +
                 GTPYWM-APEV-IDCDEDPRRSYDYR                       SDVWSVGITAIEMAEGAP                          LCNLQ
HSU71B4   -10855 gactta gcgg agtgggcacttgtaGTGAGTG Intron 6    CAGaggttggaagagaggggcCGTGAGTA Intron 7    CAGCTctacc
                 gccagt ccat tagaaacggcaaag<0-----[10783: 8881]-0>gatgctgtcctatcagcc <1-----[8825 : 4234]-1> tgata
                 gaacgg atgg tcttgcaaccttca                       ttggtgattctagtaact                          gtcta


                                            +1bp               +1bp
pkinase.hmm     196 PLEELFRIKKRLRLPLPPNC                         SEELKDLLKKCLNKDPSKRPTAKELLEHPW
                    PLE+LF I+++ ++ + ++                          S+ + +++KC K+     RPT   +L+HP+
                    PLEALFVILRESAPTVKSSG                         SRKFHNFMEKCTIKNFLFRPTSANMLQHPF
HSU71B4       -4214 ctggctgatcgtgcagatagTGGTAAAGA Intron 8   TAGGtcatcatagataaaatctccatgaacccct
                    ctactttttgacccctacgg <2-----[4154 : 3085]-2> cgataattaagctaatttgccccattaact
                    cgatccttggattcacacca                         ctgcctcgagtgaatcgtttttacgtacat
The birth of comparative genomics…




February 2001        December 2002
  9,471,780 alignments of mouse
shotgun reads to the human genome
            in Ensembl
Human-mouse homology

 Human         Mouse
         Comparative Genefinders

• Twinscan -based on Genscan (Korf, 2001)
• ROSETTA – global alignment of orthologous
  regions (Batzoglou et al, 2000)
• SLAM (Alexandersson) and Doublescan
  (Meyer)
  – jointly align while finding predicting genes
  – computationally expensive (extra search dimension)
    and hard to parameterise
  – still effectively requires a full alignment
• SGP2 (Parra, 2003)
  – Combines ab initio gene prediction with TBLASTX
     Twinscan (Korf et al, 2001)
-Based on Genscan
-Fit a “conservation sequence” alongside the target
sequence
                  Evaluation of Twinscan




 Paul Flicek et al. Genome Res. 2003: TWINSCAN pred (red), GENSCAN (green), and an
aligned RefSeq transcript (blue). Yellow: low-complexity regions, black: mouse alignments
     Evaluation of Twinscan




Paul Flicek et al. (Genome Res. 2003): Accuracy of GENSCAN
and TWINSCAN by the exact gene, exact exon, and coding
nucleotide measures
 Twinscan (N-SCAN) in UCSC genome browser




Region of seq1 (from HMR195 dataset) presented
Twinscan URL: http://ardor.wustl.edu
    ROSETTA (Batzoglou et al, 2000, Genome Res)
          Human Locus of proliferating cell nuclear antigen (PCNA)




 Alignment:                                                          Parse



                                 Mouse Locus
                  coding exons                 intergenic regions
                  noncoding exons              strong alignment
                  introns                      weak alignment
                  intergenic regions

- Global alignment of orthologous genomic loci from human and
mouse in an iterative fashion
- Identifies genes based on conservation of exonic features
-Requires full alignment, likely to be incorrect where weak
(http://theory.lcs.mit.edu/crossspecies/ )
        SLAM (Alexandersson, Genome Res, 2003)
- Probabilistic framework for gene structure and alignment
- Used simultaneously to find both gene structure and alignment of two syntenic
genomic regions
-Find the best alignment between two syntenic regions
-Predicts both coding (SLAM_CDS) and conserved noncoding (SLAM_CNS) regions




   Fourteen thousand bp from the HoxA cluster showing the HoxA2 and HoxA3 genes
SLAM data in UCSC genome browser




 SLAM CNS: Conserved Noncoding Sequences
 Server: http://baboon.math.berkeley.edu/~syntenic/sla
             SGP2 (Parra & Guigo, 2003)
                                  •Combines ab initio gene finding with TBLASTX
                                  searches




TBLASTX pairwise comparison of
human & mouse genomic sequences
coding HLA class II alpha chain
                                          Rescoring of the exons predicted by GENEID
                                          according to the results of a TBLASTX search.


   •SGP2 server: http://genome.imim.es/software/sgp2/sgp2.html
                 Exercise 4
• Submit sequence 1 of HMR195 to Genomescan,
  FGENESH, SGP2 and Grail servers (besides Genscan)
   – for those servers that require a homologous
     sequence from a related species, use the mouse
     OCTN2 gene/protein, which you can look up using
     NCBI's Entrez (cf. the morning lecture on
     databases) or the expasy.org website
• Align the predicted peptide sequences (provided on
  the next pages) to the default gene, OCTN2, with
  clustalx , or use the EBI's clustalw server at
  http://www.ebi.ac.uk/clustalw/
• You can also visualise the resulting “tree”
  (automatically generated by clustalx) by TreeView -
  you can download it from:
• http://taxonomy.zoology.ac.uk/rod/treeview.html
                                  Exercise 4, cont’d
>OCTN2 [Homo sapiens] (the real protein)
MRDYDEVTAFLGEWGPFQRLIFFLLSASIIPNGFTGLSSVFLIATPEHRCRVPDAANLSSAWRNHTVPLRLRDGREVPHSCRRYRLATIANFSALG
LEPGRDVDLGQLEQESCLDGWEFSQDVYLSTIVTEWNLVCEDDWKAPLTISLFFVGVLLGSFISGQLSDRFGRKNVLFVTMGMQTGFSFLQIFSKN
FEMFVVLFVLVGMGQISNYVAAFVLGTEILGKSVRIIFSTLGVCIFYAFGYMVLPLFAYFIRDWRMLLVALTMPGVLCVALWWFIPESPRWLISQG
RFEEAEVIIRKAAKANGIVVPSTIFDPSELQDLSSKKQQSHNILDLLRTWNIRMVTIMSIMLWMTISVGYFGLSLDTPNLHGDIFVNCFLSAMVEV
PAYVLAWLLLQYLPRRYSMATALFLGGSVLLFMQLVPPDLYYLATVLVMVGKFGVTAAFSMVYVYTAELYPTVVRNMGVGVSSTASRLGSILSPYF
VYLGAYDRFLPYILMGSLTILTAILTLFLPESFGTPLPDTIDQMLRVKGMKHRKTPSHTRMLKDGQERPTILKSTAF
>genscan1 185_aa
MRDYDEVTAFLGEWGPFQRLIFFLLSASIIPNGFTGLSSVFLIATPEHRCRVPDAANLSSAWRNHTVPLRLRDGREVPHSCRRYRLATIANFSALG
LEPGRDVDLGQLEQESCLDGWEFSQDVYLSTIVTEVVSLASRPDGHFEEGTRKVLAVPGRCSMALYSFPYRPQPKNDNLVMPYTGTLAP
>GENSCAN2 410_aa
MGKKPALCLPAQWNLVCEDDWKAPLTISLFFVGVLLGSFISGQLSDRFGRKNVLFVTMGMQTGFSFLQIFSKNFEMFVVLFVLVGMGQISNYVAAF
VLGTEILGKSVRIIFSTLGVCIFYAFGYMVLPLFAYFIRDWRMLLVALTMPGVLCVALWWFIPESPRWLISQGRFEEAEVIIRKAAKANGIVVPST
IFDPSELQDLSSKKQQSHNILDLLRTWNIRMVTIMSIMLWMTISVGYFGLSLDTPNLHGDIFVNCFLSAMVEVPAYVLAWLLLQYLPRRYSMATAL
FLGGSVLLFMQLVPPDLYYLATVLVMVGKFGVTAAFSMVYVYTAELYPTVVRNMGVGVSSTASRLGSILSPYFVYLGAYDRFLPYILMGSLTILTA
ILTLFLPESFGTPLPDTIDQMLRVKX
>FGENESH1   1      2 exon (s)    222   -     937   167 aa, chain +
MRDYDEVTAFLGEWGPFQRLIFFLLSASIIPNGFTGLSSVFLIATPEHRCRVPDAANLSSAWRNHTVPLRLRDGREVPHSCRRYRLATIANFSALG
LEPGRDVDLGQLEQESCLDGWEFSQDVYLSTIVTEVVSLASRPDGHFEEGTRKVLAVPGRCSMALCVQDLL
>FGENESH2   2   10 exon (s)     8459   -   24531   438 aa, chain +
MGKKPALCLPAQWNLVCEDDWKAPLTISLFFVGVLLGSFISGQLSDRFGRKNVLFVTMGMQTGFSFLQIFSKNFEMFVVLFVLVGMGQISNYVAAF
VLGTEILGKSVRIIFSTLGVCIFYAAKANGIVVPSTIFDPSELQDLSSKKQQSHNILDLLRTWNIRMVTIMSIMLWMTISVGYFAFGYMVLPLFAY
FIRDWRMLLVALTMPGVLCVALWWFIPESPRWLISQGRFEEAEVIIRKGLSLDTPNLHGDIFVNCFLSAMVEVPAYVLAWLLLQYLPRRYSMATAL
FLGGSVLLFMQLVPPDLYYLATVLVMVGKFGVTAAFSMVYVYTAELYPTVVRNMGVGVSSTASRLGSILSPYFVYLGAYDRFLPYILMGSLTILTA
ILTLFLPESFGTPLPDTIDQMLRVKGMKHRKTPSHTRMLKDGQERPTILKSTAF
                       Exercise 4, cont’d
>genomescan1 185_aa ENSMUSP00000019044+:1150..1180:E=7e-76
MRDYDEVTAFLGEWGPFQRLIFFLLSASIIPNGFTGLSSVFLIATPEHRCRVPDAANLSSAWRNHTVPLRLRDGREVPHSCRRYRLATIANFSALG
LEPGRDVDLGQLEQESCLDGWEFSQDVYLSTIVTEVVSLASRPDGHFEEGTRKVLAVPGRCSMALYSFPYRPQPKNDNLVMPYTGTLAP
>genomescan2 438_aa:ENSMUSP00000019044+:352..421:E=7e-40
MGKKPALCLPAQWNLVCEDDWKAPLTISLFFVGVLLGSFISGQLSDRFGRKNVLFVTMGMQTGFSFLQIFSKNFEMFVVLFVLVGMGQISNYVAAF
VLGTEILGKSVRIIFSTLGVCIFYAFGYMVLPLFAYFIRDWRMLLVALTMPGVLCVALWWFIPESPRWLISQGRFEEAEVIIRKAAKANGIVVPST
IFDPSELQDLSSKKQQSHNILDLLRTWNIRMVTIMSIMLWMTISVGYFGLSLDTPNLHGDIFVNCFLSAMVEVPAYVLAWLLLQYLPRRYSMATAL
FLGGSVLLFMQLVPPDLYYLATVLVMVGKFGVTAAFSMVYVYTAELYPTVVRNMGVGVSSTASRLGSILSPYFVYLGAYDRFLPYILMGSLTILTA
ILTLFLPESFGTPLPDTIDQMLRVKGMKHRKTPSHTRMLKDGQERPTILKSTAF
>sgp2 _v1.0_predicted_protein_1|528_AA
MRDYDEVTAFLGEWGPFQRLIFFLLSASIIPNGFTGLSSVFLIATPEHRCRVPDAANLSSAWRNHTVPLRLRDGREVPHSCRRYRLATIANFSALG
LEPGRDVDLGQLEQESCLDGWEFSQDVYLSTIVTEWNLVCEDDWKAPLTISLFFVGVLLGSFISGQLSDRFGRKNVLFVTMGMQTGFSFLQIFSKN
FEMFVVLFVLVGMGQISNYVAAFVLGTEILGKSVRIIFSTLGVCIFYAFGYMVLPLFAYFIRDWRMLLVALTMPGVLCVALWWFIPESPRWLISQG
RFEEAEVIIRKAAKANGIVVPSTIFDPSELQDLSSKKQQSHNILDLLRTWNIRMVTIMSIMLWMTISVGYFGLSLDTPNLHGDIFVNCFLSAMVEV
PAYVLAWLLLQYLPRRYSMATALFLGGSVLLFMQLVPPDLYYLATVLVMVGKFGVTAAFSMVYVYTAELYPTVVRNMGVGVSSTASRLGSILSPYF
VYLGAYDRFLPYILMGSLTILTAILTLFLPESFGTPLPDTIDQMLRVKg
>Grail1 Gene 1, Var 1 protein
MRDYDEVTAFLGEWGPFQRLIFFLLSASIIPNGFTGLSSVFLIATPEHRCRVPDAANLSSAWRNHTVPLRLRDGREVPHSCRRYRLATIANFSALG
LEPGRDVDLGQLEQESCLDGWEFSQDVYLSTIVTEVGAGPCWG
>Grail2 Var 1 protein
MVLPLFAYFIRDWRMLLVALTMPGVLCVALWWFIPESPRWLISQGRFEEAEVIIRKAAKANGIVVPSTIF DPSELQDLSSKKQQSHNILDLLRTW
NIRMVTIMSIMLWYLWHAYHGLGTVLGDNDTQENKPGRTPGPHGT YTLDEKTDNKQEVIETDSPKNLGRKYVCFALNSCMPWVGTYSYPLSFASP
DLYYLATVLVMVGKFGVTAAFSMVYVYTAELYPTVVRNMGVGVSSTASRLGSILSPYFVYLGAYDRFLPYILMGSLTILTAILTLFLPESFGTPLP
DTIDQMLRVKG
        Combined Approaches
• Programs that combine site, comparative
  and composition (3 in 1)
   – GenomeScan, FGENESH++, Twinscan
• Programs that use synteny between
  organisms
   – ROSETTA, SLAM, SGP
• Programs that combine predictions from
  multiple predictors
   – GeneComber, DIGIT
             GeneComber -
http://www.bioinformatics.ubc.ca/genecomber/
                  submit.php
              Summary
• Genes are complex structures, which
  are difficult to predict
• Different approaches to gene finding:
  – Ab Initio : GenScan
  – Ab Initio modified by BLAST homologies:
    GenomeScan
  – Homology guided: GeneWise
       Outstanding Issues
• Most Gene finders cannot handle
  untranslated regions (UTRs)
• ~40% of human genes have non-coding
  1st exons (UTRs)
• Most gene finders cannot handle
  alternative splicing
• Most gene finders cannot handle
  overlapping or nested genes
            Bottom Line
• Gene finding in eukaryotes is not solved
• Accuracy of the best methods
  approaches 80% at the exon level (90%
  at the nucleotide level) in coding-rich
  regions (much lower for whole genomes)
• Gene predictions should always be
  verified by other means (cDNA
  sequencing, BLAST search, mass
  spectr)
The End

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:43
posted:8/16/2010
language:English
pages:82