Yeast whole-genome analysis of conserved regulatory motifs by ewghwehws

VIEWS: 5 PAGES: 75

									                6.096
Algorithms for Computational Biology




            Prof. Manolis Kellis
            TA: Reina Riemann
                   Today’s Goals

• Introduction
   – Class introduction
   – Challenges in Computational Biology


• Gene Regulation: Regulatory Motif Discovery
   – Exhaustive search
   – Content-based indexing
   – Greedy optimization
                 Course Administrivia

• 6.096 – Algorithms for Computational Biology
   –   Taught jointly with 6.046, Introduction to Algorithms
   –   Explores specific application area of algorithms
   –   Algorithmic challenges in Computational Biology
   –   Design principles to address them
• Lectures
   – F930-11, in 32-123
   – http:// theory.csail.mit.edu / classes / 6.096 /
   – Grading: 4 problem sets = 60%. Final: 30%.
     Attendance: 10%
Book references
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA
TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC
TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC
TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT
CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG
AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT
TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG
TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT
TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA
TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG
CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA
TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT
CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT
GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA
TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC
CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG
TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG
GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA
AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT
TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG
CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA
GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA
ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT
AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA
GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG
ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG
CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC
GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA
AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA
TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT
CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT
AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA
GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA
GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA
CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC
CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA
TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC
TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC
TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT
CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG
AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT
TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG
TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT
TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA
TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG
          Genes
CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA
                                                             Regulatory motifs
TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT
CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT
                Encode
GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA
                                                                  Control
TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC
                proteins
CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG
                                                                  gene expression
TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG
GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA
AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT
TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG
CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA
GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA
ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT
AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA
GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG
ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG
CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC
GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA
AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA
TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT
CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT
AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA
GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA
GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA
CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC
CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA
TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC
TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC
TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT
CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG
AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT
TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG
TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT
TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA
TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG
CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA
TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT
CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT
GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA
TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC
CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG
TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG
GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA
AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT
TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG
CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA
GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA
ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT
AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA
GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG
ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG
CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC
GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA
AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA
TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT
CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT
AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA
GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA
GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA
CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC
CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT
ATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATA
TATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTC
TAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTC
     Extracting signal from noise
TGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACT
CTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATG
AATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAA
GCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAAT
TTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
CTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGG
TTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGAT
TGATATGCTTTGCGCCGTCAAAGTTTTGAACGAGAAAAATCCATCCATTACCTTAATAAATGCTGATCCCAAATTTGCTCAAAGGAA
TCGATTTGCCGTTGGACGGTTCTTATGTCACAATTGATCCTTCTGTGTCGGACTGGTCTAATTACTTTAAATGTGGTCTCCATGTTG
CACTCTTTTCTAAAGAAACTTGCACCGGAAAGGTTTGCCAGTGCTCCTCTGGCCGGGCTGCAAGTCTTCTGTGAGGGTGATGTACCA
TGGCAGTGGATTGTCTTCTTCGGCCGCATTCATTTGTGCCGTTGCTTTAGCTGTTGTTAAAGCGAATATGGGCCCTGGTTATCATAT
CCAAGCAAAATTTAATGCGTATTACGGTCGTTGCAGAACATTATGTTGGTGTTAACAATGGCGGTATGGATCAGGCTGCCTCTGTTT
GGTGAGGAAGATCATGCTCTATACGTTGAGTTCAAACCGCAGTTGAAGGCTACTCCGTTTAAATTTCCGCAATTAAAAAACCATGAA
TAGCTTTGTTATTGCGAACACCCTTGTTGTATCTAACAAGTTTGAAACCGCCCCAACCAACTATAATTTAAGAGTGGTAGAAGTCAC
CAGCTGCAAATGTTTTAGCTGCCACGTACGGTGTTGTTTTACTTTCTGGAAAAGAAGGATCGAGCACGAATAAAGGTAATCTAAGAG
TTCATGAACGTTTATTATGCCAGATATCACAACATTTCCACACCCTGGAACGGCGATATTGAATCCGGCATCGAACGGTTAACAAAG
GCTAGTACTAGTTGAAGAGTCTCTCGCCAATAAGAAACAGGGCTTTAGTGTTGACGATGTCGCACAATCCTTGAATTGTTCTCGCGA
AATTCACAAGAGACTACTTAACAACATCTCCAGTGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAAT
TTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATG
CGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATC
ATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAA
GAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCA
ATTGGGCAGCTGTCTATATGAATTATAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACT
AGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATA
GTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGG
ACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAG
CTTGGCAAGTTGCCAACTGACGAGATGCAGTAAAAAGAGATTGCCGTCTTGAAACTTTTTGTCCTTTTTTTTTTCCGGGGACTCTAC
GAACCCTTTGTCCTACTGATTAATTTTGTACTGAATTTGGACAATTCAGATTTTAGTAGACAAGCGCGAGGAGGAAAAGAAATGACA
AAAATTCCGATGGACAAGAAGATAGGAAAAAAAAAAAGCTTTCACCGATTTCCTAGACCGGAAAAAAGTCGTATGACATCAGAATGA
AATTTTCAAGTTAGACAAGGACAAAATCAGGACAAATTGTAAAGATATAATAAACTATTTGATTCAGCGCCAATTTGCCCTTTTCCA
TTCCATTAAATCTCTGTTCTCTCTTACTTATATGATGATTAGGTATCATCTGTATAAAACTCCTTTCTTAATTTCACTCTAAAGCAT
CCCATAGAGAAGATCTTTCGGTTCGAAGACATTCCTACGCATAATAAGAATAGGAGGGAATAATGCCAGACAATCTATCATTACATT
AGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAA
GTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATA
GCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACA
CAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATC
CACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGT
             Challenges in Computational Biology
       4 Genome Assembly

5 Regulatory motif discovery                                       1 Gene Finding
                                                                                     DNA

                                                                     2 Sequence alignment

                  6 Comparative Genomics
                                                 TCATGCTAT
                                                 TCGTGATAA
                                                 TGAGGATAT
                                                                         3 Database lookup
                     7 Evolutionary Theory       TTATCATAT
                                                 TTATGATTT




             8   Gene expression analysis


                                     RNA transcript
                                                         9 Cluster discovery   10 Gibbs sampling
       11 Protein network analysis

                               12 Regulatory network inference

                               13 Emerging network properties
       Algorithms and techniques covered

• Enumeration approaches
   – Exhaustive search, pruning, greedy algorithms, iterative
     refinement
• Content-based indexing
   – Hashing, database lookup, pre-processing
• Iterative methods
   – Combining sub-problems, memoization, dynamic programming
• Statistical methods
   – Hypothesis testing, maximum likelihood, Bayes’ Law, HMMs
• Machine learning techniques
   – Supervised and unsupervised learning, classification
                                                     Genomic Scales
                             Base pairs       Genes         Notes
Phi-X 174                                5,386         10   virus of E. coli
Human mitochondrion                     16,569         37   Energy production for human cells
Epstein-Barr virus (EBV)               172,282         80   causes mononucleosis
nucleomorph of Guillardia theta        551,264        511   Remains of the nuclear genome of a red alga (eukaryote) engulfed long ago by another eukaryote
Mycoplasma genitalium                  580,073        483   One of the smallest true organisms
Treponema pallidum                   1,138,011      1,039   bacterium that causes syphilis
Mimivirus                            1,181,404      1,262   A virus (of an amoeba) with a genome larger than several cellular organisms above
Helicobacter pylori                  1,667,867      1,589   chief cause of stomach ulcers (not stress and diet)
Methanococcus jannaschii             1,664,970      1,783   Classified in a third kingdom: Archaea.
Haemophilus influenzae               1,830,138      1,738   bacterium that causes middle ear infections
Streptococcus pneumoniae             2,160,837      2,236   the pneumococcus
Propionibacterium acnes              2,560,265      2,333   causes acne
E. coli                              4,639,221      4,377   Most well-studied bacterium
Saccharomyces cerevisiae            12,495,682      5,770   Budding yeast. A eukaryote.
Neurospora crassa                   38,639,769     10,082   Green mold fungus.
Caenorhabditis elegans             100,258,171     19,000   The first multi-cellular eukaryote to be sequenced.
Arabidopsis thaliana               115,409,949     25,498   a flowering plant (angiosperm) See note.
Drosophila melanogaster            122,653,977     13,379   the fruit fly
Anopheles gambiae                  278,244,063     13,683   Mosquito vector of malaria.
Humans                           3,000,000,000     22,000   Sequenced in 1999, completed in 2004.
Tetraodon nigroviridis             342,000,000     27,918   Much less repetitive DNA, but slightly more genes.
Rice                             4,300,000,000     60,000   Extremely repetitive. Genes show GC gradient
Amphibians                     109,000,000,000 ?


       •     Importance of algorithm design for efficiency
               – Compare human vs. mouse (blocks of 1,000 nucleotides)
                       • 3,000,000*3,000,000 comparisons, each 1,000*1,000 operations (w/dynamic progr.)
                       • At 1 trillion operations per second, it would take 104 days
               – Search all regulatory motifs of length 20 (11^20) in the human genome
                       • 426 years
                                   Today:

         Gene Regulation and Motif Discovery




Gene regulation:                            Regulatory motifs:
The process by which genes are              sequences that control gene usage;
turned on or off, in response to            short sequence patterns, ~6-12 letters
environmental stimuli                       long, possibly degenerate
                 Why cellular programs change
• Environmental Response                          • Cell differentiation




 Temperature        Food
 response           supply

 – Cells adapt to their environment, carry   – Cells have distinct functions: hair, nail,
   out different molecular processes,          skin, heart, eye, brain, muscle, bone
   depending on their environment            – Cells differentiate, by using different parts
                                               of the same genome
 – Produce same nutrients in entirely
                                             – These morphological changes are due to
   different pathways                          expression levels

                  • Genome Remains Unchanged!
                 How cellular programs change

Regulatory knobs
•   DNA level: gene dosage
     – How many copies of a particular gene
     – How many homologs, how many pathways
     – Accessibility of gene within chromatin
•   mRNA: Transcription initiation
     – Regulatory motifs recognized by transcription factors
     – Transcription factors recruit transcription machinery
     – Dictates number of messages sent to cytoplasm
•   mRNA: Post-transcriptional control
     – How long messages stay active
     – How fast messages they degraded
•   Protein: Translation level
     – How many times is each message translated to protein
     – How stable are protein products, how long before degraded
•   Protein: Post-translational modifications
     – Some proteins only perform their functions when phosphorylated
     – Some are only active as a hetero-dimer, can regulate only one.
                   Regulatory motif discovery

      Gal4               Gal4            Mig1
                                                           GAL1

                                                           ATGACTAAATCTCATTCAGAAGAA
    CGG      CCG     CGG        CCG     CCCCW

• Regulatory motifs
    – Genes are turned on / off in response to changing environments
    – No direct addressing: subroutines (genes) contain sequence tags (motifs)
    – Specialized proteins (transcription factors) recognize these tags


•   What makes motif discovery hard?
    – Motifs are short (6-8 bp), sometimes degenerate
    – Can contain any set of nucleotides (no ATG or other rules)
    – Act at variable distances upstream (or downstream) of target gene
    Protein/DNA contact dictates regulatory motifs

•   Sequence specificity
     – Topology of 3D contact dictates
       sequence specificity of binding
     – Some positions are fully
       constrained; other positions are
       degenerate
•   Protein-DNA interactions
     – Proteins read DNA by “feeling”
       the chemical properties of the
       bases
     – Without opening DNA (not by
       base complementarity)
          Computational approaches

• Method #1: Enumerate all motifs



• Method #2: Randomly sample the genome



• Method #3: Enumerate motif seeds + refinement



• Method #4: Content-based addressing
            Need: Evaluation method




          Motif Generator
                               ?
                            Motif Evaluator
                                              Candidate
                                               Motifs




• To test whether a motif is meaningful:
   – Evaluate its conservation rate
Lecture continued on the blackboard



      Slides will be available soon
Regulatory motif discovery



     Study known motifs



   Derive conservation rules




    Discover novel motifs
            Comparison of related species

                                         S.cerevisiae
                                  0.13



                           0.07   0.10
                                         S.paradoxus

              0.08

                                  0.19


                                         S.mikatae
Total length: 0.83
(substitutions per site)
                                  0.27


                                         S.bayanus
Conserved islands match known regulatory sites
                                                   Gal4
                                Gal10                                 Gal1
            Scer   TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACA
            Spar   CTATGTTGATCTTTTCAGAATTTTT-CACTATATTAAGATGGGTGCAAAGAAGTGTGATTATTATATTACATCGCTTTCCTATCATACACA
   GAL10    Smik   GTATATTGAATTTTTCAGTTTTTTTTCACTATCTTCAAGGTTATGTAAAAAA-TGTCAAGATAATATTACATTTCGTTACTATCATACACA
            Sbay   TTTTTTTGATTTCTTTAGTTTTCTTTCTTTAACTTCAAAATTATAAAAGAAAGTGTAGTCACATCATGCTATCT-GTCACTATCACATATA
                    * * **** * * *     ** ** * *    **           ** ** * *     *    **   **    * * * ** * * *
                            TBP
   Scer   TATCCATATCTAATCTTACTTATATGTTGT-GGAAAT-GTAAAGAGCCCCATTATCTTAGCCTAAAAAAACC--TTCTCTTTGGAACTTTCAGTAATACG
   Spar   TATCCATATCTAGTCTTACTTATATGTTGT-GAGAGT-GTTGATAACCCCAGTATCTTAACCCAAGAAAGCC--TT-TCTATGAAACTTGAACTG-TACG
   Smik   TACCGATGTCTAGTCTTACTTATATGTTAC-GGGAATTGTTGGTAATCCCAGTCTCCCAGATCAAAAAAGGT--CTTTCTATGGAGCTTTG-CTA-TATG
   Sbay   TAGATATTTCTGATCTTTCTTATATATTATAGAGAGATGCCAATAAACGTGCTACCTCGAACAAAAGAAGGGGATTTTCTGTAGGGCTTTCCCTATTTTG
          **   ** *** **** ******* **    * *    *     * *     * *        ** **       * *** *    ***    * * *
                                           GAL4               GAL4              GAL4
   Scer   CTTAACTGCTCATTGC-----TATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCT
   Spar   CTAAACTGCTCATTGC-----AATATTGAAGTACGGATCAGAAGCCGCCGAGCGGACGACAGCCCTCCGACGGAATATTCCCCTCCGTGCGTCGCCGTCT
   Smik   TTTAGCTGTTCAAG--------ATATTGAAATACGGATGAGAAGCCGCCGAACGGACGACAATTCCCCGACGGAACATTCTCCTCCGCGCGGCGTCCTCT
   Sbay   TCTTATTGTCCATTACTTCGCAATGTTGAAATACGGATCAGAAGCTGCCGACCGGATGACAGTACTCCGGCGGAAAACTGTCCTCCGTGCGAAGTCGTCT
                ** **           ** ***** ******* ****** ***** *** ****    * *** ***** * * ****** ***     * ***
                                           GAL4
   Scer   TCACCGG-TCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAA-----TACTAGCTTTT--ATGGTTATGAA
   Spar   TCGTCGGGTTGTGTCCCTTAA-CATCGATGTACCTCGCGCCGCCCTGCTCCGAACAATAAGGATTCTACAAGAAA-TACTTGTTTTTTTATGGTTATGAC
   Smik   ACGTTGG-TCGCGTCCCTGAA-CATAGGTACGGCTCGCACCACCGTGGTCCGAACTATAATACTGGCATAAAGAGGTACTAATTTCT--ACGGTGATGCC
   Sbay   GTG-CGGATCACGTCCCTGAT-TACTGAAGCGTCTCGCCCCGCCATACCCCGAACAATGCAAATGCAAGAACAAA-TGCCTGTAGTG--GCAGTTATGGT
               ** *   ** *** *      *      ***** ** * *    ****** **     *   * **     * *             ** ***
                                      MIG1
   Scer   GAGGA-AAAATTGGCAGTAA----CCTGGCCCCACAAACCTT-CAAATTAACGAATCAAATTAACAACCATA-GGATGATAATGCGA------TTAG--T
   Spar   AGGAACAAAATAAGCAGCCC----ACTGACCCCATATACCTTTCAAACTATTGAATCAAATTGGCCAGCATA-TGGTAATAGTACAG------TTAG--G
   Smik   CAACGCAAAATAAACAGTCC----CCCGGCCCCACATACCTT-CAAATCGATGCGTAAAACTGGCTAGCATA-GAATTTTGGTAGCAA-AATATTAG--G
   Sbay   GAACGTGAAATGACAATTCCTTGCCCCT-CCCCAATATACTTTGTTCCGTGTACAGCACACTGGATAGAACAATGATGGGGTTGCGGTCAAGCCTACTCG
                 ****    *         *   *****     ***              * * *    * * *     *     *           **
                   MIG1                                                      TBP
   Scer   TTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCG--ATGATTTTT-GATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCAC-----TT
   Spar   GTTTT--TCTTATTCCTGAGACAATTCATCCGCAAAAAATAATGGTTTTT-GGTCTATTAGCAAACATATAAATGCAAAAGTTGCATAGCCAC-----TT
   Smik   TTCTCA--CCTTTCTCTGTGATAATTCATCACCGAAATG--ATGGTTTA--GGACTATTAGCAAACATATAAATGCAAAAGTCGCAGAGATCA-----AT
   Sbay   TTTTCCGTTTTACTTCTGTAGTGGCTCAT--GCAGAAAGTAATGGTTTTCTGTTCCTTTTGCAAACATATAAATATGAAAGTAAGATCGCCTCAATTGTA
           * *      *    ***       * **   * *      *** ***   * * ** ** * ********      ****    *
   Scer   TAACTAATACTTTCAACATTTTCAGT--TTGTATTACTT-CTTATTCAAAT----GTCATAAAAGTATCAACA-AAAAATTGTTAATATACCTCTATACT
   Spar   TAAATAC-ATTTGCTCCTCCAAGATT--TTTAATTTCGT-TTTGTTTTATT----GTCATGGAAATATTAACA-ACAAGTAGTTAATATACATCTATACT
   Smik   TCATTCC-ATTCGAACCTTTGAGACTAATTATATTTAGTACTAGTTTTCTTTGGAGTTATAGAAATACCAAAA-AAAAATAGTCAGTATCTATACATACA
   Sbay   TAGTTTTTCTTTATTCCGTTTGTACTTCTTAGATTTGTTATTTCCGGTTTTACTTTGTCTCCAATTATCAAAACATCAATAACAAGTATTCAACATTTGT
          *   *     *     *      * * ** ***     * *         *        * ** ** ** * * * *       * ***       *
   Scer   TTAA-CGTCAAGGA---GAAAAAACTATA
   Spar   TTAT-CGTCAAGGAAA-GAACAAACTATA
                     GAL1
                                         in many regions
  Increase power by testing conservationFactor footprint
   Smik
   Sbay
          TCGTTCATCAAGAA----AAAAAACTA..
          TTATCCCAAAAAAACAACAACAACATATA
          *    *   ** *     ** ** **

                                                                             Conservation island
             Genome-wide conservation


  Scer
   Spar
  Smik
  Sbay


Evaluate conservation within:                Gal4   Controls
     (1) All intergenic regions              13%      2%
          (2) Intergenic : coding             4:1     1:3
          (3) Upstream : downstream          12:0     1:1

               A signature for regulatory motifs
         Hill-climbing in sequence space




• Seed selection
   – Three mini-motif conservation criteria (CC1, CC2, CC3)
• Motif extension
   – Non-random conservation of neighbors
• Motif collapsing
   – Merge neighbors using hierarchical clustering, avg-
     max-linkage
• Re-scoring complex motifs
   – Motif conservation score for full motifs (MCS)
                     Test 1: Selecting mini-motifs
N
                                       • Estimate basal rate of conservation
                                          – Expected conservation rate at the
                                            evolutionary distances observed
                                          – Average conservation rate of non-
                                            outlier mini-motifs
         r         Conservation rate


Binomial score                         • Score conservation of mini-motif
                                          –   k: conserved motif occurrences
         n k
 p(k )    p (1  p) n  k
         k                              –   n: total motif occurrences
                                        –   r: basal conservation rate
                  Specificity             –   Evaluate binomial probability of
                                              observing k successes out of n trials

                                       • Assign z-score to each mini-motif
                         Cutoff




                                          –   Bulk of distribution is symmetric
                                          –   Estimate specificity as (R-L)/R
                 Right tail               –   Select cutoff: 5.0 sigma
             Left tail                    –   1190 mini-motifs, 97.5% non-random
                  Test 1: Intergenic conservation




                        CGG-11-CCG
Conserved count




                              Total count
                          Test 2: Intergenic vs. Coding



                                             CGG-11-CCG
Intergenic Conservation




                               Higher Conservation in Genes


                                  Coding Conservation
                        Test 3: Upstream vs. Downstream

Upstream Conservation                       CGG-11-CCG




                                                   Downstream motifs?
                           Most
                          Patterns


                                     Downstream Conservation
                  Extending mini-motifs

     • Separate conserved and non-conserved instances

     • Find maximally discriminating neighborhood

N1
      M1
               T C A               6           A C G        Causal
             R T C A G                         A C G W       set
N2



      M2       T C x               6           A x G   Random
             Y T C x H                         A x G S   set

     • Evaluate non-randomness of neighborhood
        – chi-square contingency test on [N1,M1], [N2,M2]
                 Collapsing similar motifs

 • Motif similarity: sequence and genomic positions
    – Motifs share similar sequences, count bits in common
    – Motifs appear conserved in similar sets of regions



         Regions                   Regions               Regions
        with motif 1        containing both motifs      with motif 2



• Collapsing: Hierarchical clustering
   – Sort the order of joins by decreasing similarity
   – Average max-linkage cluster similarity score
               Constructing full motifs

  Test 1          Test 2       Test 3
                                            2,000
                                          Mini-motifs

Extend          Extend        Extend
R T C A Y         65
                Extend       A C G R

R T C G
Collapse   C    Collapse       C G A
                             ACollapse
G T C A    C    Collapse     A C G A
A T C R    Y                 A C G A
                Merge
R T C G    C                 A C G A
                                              72
               Full Motifs                Full motifs
             High sensitivity and specificity

Rank   Motif             Known
  1    RTCAY.....ACGR    Abf1
  2    RTTACCCGRM        Reb1
  3    gcGATGAGmtgaraw   Esr1    Most previously known
  4    TSGGCGGCTAWW      Ume6     motifs rediscovered
  5    RTCACGTGV         Cbf1
  6    WTATWTACADG        New
  7    GRRAAAWTTTTCACT   Esr2
  8    TTCC.aAtt.GGAAA   Mcm1
  9    CGTTTCTTTTTCY      New
 10    TYYTCGAGA         Xbp1
 11    WTTTCGCGTT        Swi4
 12    TKACGCGTT         Mbp1
 13    STGCGG...ttTCT     New
 14    YCTATTGTT          New
 15    TTTTGCCACCG       Rpn4
                                     Novel motifs
 16    tTTGTTTAC.TTT     Fkh2         discovered
           (...)
              Assigning function to novel motifs
Rank       Whole Motif   5' vs. 3'   Annot   Cat
   1
   2
          CCGGGTAAC
       TCGTANNNARTGAT
                            Up
                            Up
                                     ChIP
                                     ChIP
                                             REB1_YPD
                                             ABF1_YPD
                                                                                   • Functional Classes
   3       TATAWAWA                          TATA box
   4    TGCGATGAGCT        Up        Expr    E_166:64.6.74
   5
   6
           TTATTTACA
          TCAYGTGGC
                          Down
                           Up
                                     MIPS
                                     ChIP
                                             Mitochondrion
                                             CBF1_YPD
                                                                                       MIPS
                                                                                       MIPS
   7      AAAAAARAA                          DNA bending
   8     TGAAAAWTTTT       Up        Expr    Protein Syntheis
   9     CGGCGGCTAW        Up        MIPS    Cell cycle
  10       AAGGGTAA        Up        Expr    Cell cycle

                                                                                   • Chromatin IP
  11      TKWCGCGTT        Up        ChIP    MBP1_YPD
  12       TGAGTCAT        Up        MIPS    Amino Acid Metab
  13         CTTATCT       Up        Expr    Nitrogen & Sulfur
  14    AGATGAGATGA        Up        Expr    Protein Synthesis
  15      AATAAATAA       Down       MIPS    Localization                             Gal4
  16       RTAAACAA       Down       Expr    Mitochondrion
  17        MAGGGGH        Up        Expr    Energy metab                                    Gal1
  18       AACAATAG        Up        MIPS    Lipid, fatty acid, isoprenoid metab
  19       CGT-6-CGA       Up        Prot    Membrane Biogenesis and traffic
  20        TTACNTAA      Down       MIPS    Localization

                                                                                   • Protein Complexes
  21          ATATTC      Down       Expr    Respiration
  22       TCCCYTAW        Up        Expr    Energy
  23       TAAACGAG        Up        MIPS    Lipid, fatty acid, isoprenoid metab
  24        ATTACCC
  25       CGT-7-GAC       Up        ChIP    ABF1_YPD
  26       TGACACAA        Up        ChIP    DIG1_Alpha
  27      GGTGGCAAA        Up        MIPS    Cytoplasmic & Nuclear Degradation
  28        CTMTATA       Down       MIPS    Localization
  29       CCCACMCA        Up        ChIP    RAP1
  30        CCGATAA        Up        MIPS    Lipid, fatty acid, isoprenoid metab
  31        CACAAAA        Up        Expr    Cell cycle
  32        TCTCGAG        Up        Expr    MCM1_YPD
  33         AGGCAC        Up        Expr    Phosphate utilization
  34
  35
            CTTNTATA
           TATGCAAA
                           Up
                          Down
                                     MIPS
                                     Prot
                                             Protein Fate
                                             Protein synthesis and turnover
                                                                                   • Expression Clusters
  36         CCCGGA        Up        MIPS    DNA recombination & repair
  37         AGGGAT        Up        MIPS    Respiratory chain
  38       CGGNNAAA        Up        ChIP    RGT1_Low
  39        CTGNAAA       Down       ChIP    Protein Syntheis & turnover
  40         ATANTTA      Down       MIPS    Localization
  41        ACGNGTA
  42        AAGGGTC
  43         GGGGTA         Up       Expr    Respiration
  44        CTANAAA         Up       ChIP    RLM1_14hr
  45         GTCACA         Up       ChIP    SUM1_YPD
 New motifs show new functions
                        • 12 enriched in specific factors
  Gal4                     – Glucose transport
          Gal1             – Chromatin silencing
                           – Stress response

                        • 8 enriched in expression clusters
                           –   Major facilitator genes
                           –   Lipid metabolism
                           –   Nitrogen synthesis
                           –   Vesicular trafficking and secretion

                        • 6 downstream motifs
                           – Mitochondrial proteins
                           – Stress response

                        • 2 variable gap motifs
Swi4     Swi4    Swi4      – Swi4 and Ash1 show variable gap



  Most motifs show functional enrichment
           Application to human genome

                                          0.13    S.cerevisiae


                           0.07
                                          0.10   S.paradoxus

              0.08

                                      0.19

                                                 S.mikatae

                                      0.27
 Total branch length:
  0.83 substitutions per site
                                                 S.bayanus


                Human             Yeast          Signal: Lower by ~20-fold
Coding           1.5%             75%            Branch length: Increase by ~3.0
Regulatory       3.5%              4%            Needed branch length ~4
Excess conservation in specific regions
         173 promoter motifs discovered
Index     Discovered motif          MCS         Pc in      Pc in   Enrichment
                                             promoters   introns     Z-score
1       RCGCAnGCGY                  107.8       0.49       0.09        15.0
2       CACGTG                      85.3        0.47       0.01         8.8
3       SCGGAAGY                    80.4        0.44       0.02        22.4
4       ACTAYRnnnCCCR               69.5        0.61       0.06         8.1
5       GATTGGY                     64.6        0.51       0.04         9.8
6       GGGCGGR                     63.9        0.21       0.02        11.4
7       TGAnTCA                     62.8        0.38       0.08         6.5
8       TMTCGCGAnR                  55.7        0.64       0.08         9.4
9       TGAYRTCA                    55.7        0.50       0.07         6.1
10      GCCATnTTG                   54.7        0.72       0.03        12.2
11      MGGAAGTG                    51.6        0.43       0.02        13.9
12      CAGGTG                      47.6        0.26       0.06         9.9
13      CTTTGT                      46.0        0.42       0.05        13.6
14      TGACGTCA                    44.8        0.44       0.07         4.2
15      CAGCTG                      43.9        0.27       0.08         8.9
16      RYTTCCTG                    43.0        0.32       0.06         7.4
17      AACTTT                      42.1        0.43       0.04        11.1
18      TCAnnTGAY                   40.4        0.47       0.04         4.9
19      GKCGCnnnnnnnTGAYG           40.1        0.35       0.00         5.6
20      GTGACGY                     38.4        0.34       0.02         6.6
21      GGAAnCGGAAnY                37.7        0.68       0.00         7.0
22      TGCGCAnK                    37.4        0.24       0.02         8.2
23      TAATTA                      37.3        0.29       0.13         7.1
(…)     (…)                   (…)           (…)        (…)         (…)



                             Are they real?
  (1) Discovered motifs match TRANSFAC database

Factor    Known motif    MCS    Discovered motif   Factor           Known motif         MCS
SP-1      GGGGCGGGGC     46.8   GGGCGGR            IRF              bnCRSTTTCAnTTYY     4.6
YY1       GCCATnTT       34.7   GCCATnTTG          GATA             WGATAR              4.6
MYC       SCACGTG        32.7   CACGTG             MYB              GnCnGTT             4.4
NF-Y      YSATTGGYY      31.2   GATTGGY            MIF-1            GTTGCWWGGYAAC nGS   4.3
AP-1      CTGASTCA       30.8   TGAnTCA            HSF2             GAAnnWTCK           4.0
MAZ       GGGGAGGG       29.7   GGGAGGRR           HNF-1            GGTTAATnWTTAMC      4.0
CREB      TGACGTMA       29.5   TGACGTMR           AREB6            WCAGGTGWnW          3.8
NF-MUE1   CGGCCATCT      26.0   CGGCCATYK          C-REL            SGGRnTTTCC          3.6
MYOD      RnCAGGTG       24.7   CAGGTG             TAL-1ALPHA/E47   AACAGATGKT          3.4
ELK-1     CCGGAART       22.6   CCGGAARY           POU6F1           GCATAAWTTAT         3.4
NRF-1     YGCGCATGCG     20.9   RCGCAnGCG Y        FREAC-4          CTWAWGTAAACAnWG     3.4
TEL-2     CAGGAAGTAR     20.8   SMGGAAGT           BRN-2            YKnATTWYSnATG       3.4
GABP      vCCGGAAGnGCR   19.8   SCGGAAGY           AFP1             GTGYARTTAAT         3.3
STAT1     CAnTTCCS       17.9   CATTTCCK           TCF-1(P)         GKCRGKTT            3.2
CAC-BP    GRGGSTGGG      15.0   GGGTGG             HNF-4            TGAMCTTTGMMCYT      3.1
AP-4      GCAGCTGnY      14.9   CAGCTG             STAT             TCCMAGAA            3.0
SRY       KTWGTTT        14.6   TTGTTT             IRF1             AAGTGAA             2.8
TBP       TATAAATW       14.2   TATAAA             E4F1             GTGACGTARS          2.7
FOXO1     RWAAACA A      14.1   RTAAACA            NF-AT            WGGAAAnW            2.6
TFII-I    RGAGGKAGG      13.9   GnGGGAGG           CDC 5            GATTTAAC ATAA       2.6
PEA3      MGGAWGT        13.6   SMGGAAGT           AML1             ACCACA              2.6
SF-1      TGRCCTTG       12.6   TGACCTTG           IPF1             KGTCATTAnndC        2.5
SOX-5     ATTGTT         12.5   YYATTGTT           FAC1             TnYGTGTTKTG         2.5




           Rediscovered most previously known motifs
     (2) Positional bias of discovered motifs




Fig. 4
(3) Tissue specificity of discovered motifs
     Comparative genomics reveals functional elements




TTACGGTACCGCTATACCCGAACGTCTAATAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
TgGCGGTACgGCTtTACCCGAtCGTCTAATAGcAAAtACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
TTACGGTACCG-TATACCCGAAtaTCTAATAGAAAAAAtTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
cTACGGTACCGCcATACCCGAACGgCTAATAGAAAAgACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
TTACGGTgCCGCTATACCCGAACGTCTAATAGAAcAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
TTACGGTACCaCTATACCCGAgCGTCTAATAGAgAAAgCTtTAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA
TaACGGTACCGtTATACCCGAACGTCTAATAGAAAgAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAA


                    Functional bases            Conserved bases

 • Genomes change over time                          Comparing genomes
 • Important things are conserved                    reveals functional regions

 • Noise changed independently                        Multiple genomes give
 • Signal conserved in all                            more powerful comparison
Regulatory motif discovery: Contributions

• Genome-wide conservation criteria
  – No prior knowledge necessary / no experimentation
  – Unbiased, systematic, exhaustive search
• Performance
  – High sensitivity and specificity
  – Nearly all previously known motifs re-discovered
  – Novel motifs discovered: Are they real?
               Inferring Motif Function



Part I                                        Part II

1. Yeast                           1. Genes
2. Alignment                       2. Regulation
3. Evolution                       3. Grammar
4. Duplication                     4. Human
                      Motif Function

 Gal4   Gal4   Gal4            Mig1   Mig1          GAL1

        Mig1   Gal4            Mig1                 GAL3

        Gal4   Mig1     Gal4          Gal4          GAL7


• Intuition
   – Genes of related function are frequently co-regulated
   – Regulatory motifs are enriched in functional categories
• Approach
   – Use biological knowledge to assign function to motifs
   – Mine public datasets for enrichment in the discovered motifs
   – Use functional categories to discovery additional motifs
       Intersecting with Functional Categories


      GO
      MIPS           CGG-11-CCG
                                                Transcription
                                 Nucleus
                                                                      Energy

              Carbohydrate
               metabolism
                                                    all four
                                                 S.cerevisiae

                                                                          Cell Cycle
                                  Transport
                                                          Cell fate

  Gal genes
Gal1, Gal2, Gal3, Gal7, Gal10, Hxt3, Mth1, Pcl10, Gcy1

                                                         Significance of overlap: 10-28
Intersecting with...                                                                                  Specificity
                                                          Transcription                                P-value
                                            Nucleus                            Energy
• Functional Classification                               Conserved
                                  C-metab                   motifs                       Cell Cycle      10-28
    GO
    MIPS         CGG-11-CCG                   Transport              Cell fate



                                                            GCN4
• Chromatin IP                            REB1                                STE12

   Gal4                                                  Conserved
                 TCA-6-ACG        ABF1                     motifs                        FKH2           10-48
          Gal1
                                               GAL4                  SWI4


• Protein Complexes by MassSpec
                                               Complex68                 Cell polarity
                                                                          complex
                                   RNA
                                                         Conserved
                              metabolism                                             Membrane
                  GATGAG                                   motifs                    Biogenesis         10-18
                                 complex                                              Complex
                                            Complex180        Protein turnover
                                                                 complex

• Expression Clusters                     Cluster1        Dioxic Shift
                                                                             Cluster3

                                  Cell                   Conserved
                                                           motifs
                   ACGCGT         cycle                                              Heat Shock
                                                                                                        10-29
                                              Cluster6
                                                                   Starvation
 Regulatory motif function: Contributions


• Identification of candidate motif functions
   – Data mining of existing biological datasets
   – No new experiments were necessary
• Results
   – Majority of discovered motifs show enrichment
   – New biological knowledge gained
                 Combinatorial Control



Part I                                       Part II

1. Yeast                          1. Genes
2. Alignment                      2. Regulation
3. Evolution                      3. Grammar
4. Duplication                    4. Human
         Explaining fine-grain regulation




Small number of
regulatory motifs

                                      Large number of
                                    regulated processes


        Versatility comes from motif combinations
                             Combinatorial regulation

          +              +                       -
   Gal4       Gal4       Gal4             Mig1   Mig1   GAL1
                     x
              Mig1       Gal4             Mig1          GAL3
                     x                     +
              Gal4       Mig1      Gal4          Gal4   GAL7


• Intuition
   – Protein-protein interactions may induce or repress binding
   – Transcription factors can bind cooperatively
   – Their regulatory motifs should co-occur
• Method
   – Discover meaningful motif combinations in a genome-wide fashion
   – Discover functional implications of combinatorial control
Genome-wide co-occurrence map




  CBF1                  Ste12     Tec1

         Met31

                                  rESR1   Abf1

                 Gcn4

                          Leu3            rESR2




                 Gcr1      Msn2
    Motif combinations change specificity

                               Conserved occurrences
                                   of Ste12, Tec1

             Ste12      Tec1




Mating                             Budding



              Filamentation
     Combinatorial control: Contributions


• Identification of significant motif combinations
   – Pairs identified solely based on motif conservation
   – Functional implications identified using public datasets
• Results
   – Genome-wide graph of motif interactions
   – Changing specificities of regulatory motifs
                 Human Genome



Part I                                     Part II

1. Yeast                        1. Genes
2. Alignment                    2. Regulation
3. Evolution                    3. Grammar
4. Duplication                  4. Human
         Systematic gene identification in the human


Human
Dog
Mouse
Rat

  • Increased challenge
        – Exons can be much shorter
        – Non-coding regions are much larger
  • Methods
        – Combine RFC test with additional information
        – Incorporate knowledge of genetic code redundancy
        – Frame Dependent Substitutions (FDS) test
 Frame Dependent Substitution (FDS) Test




                                     Genes Intergenic       Separation
1st or 2nd codon positions changed     4%     58%
3rd codon position changed            60%     58%           13-fold




  CFTR region: power to discover all annotated exons
                     Process-specific regulatory motif
                                discovery
                                                               Vamsi Mootha
    Oxydative Phosphorylation genes (OXPHOS)
                                                      500+ genes
                                                       - Coordinately regulated
                                                       - Repressed in diabetes
                                                       - Human muscle cells
  Patti et al 2003
Mootha et al 2003                     ADP       ATP



      Exercise Diabetes                     ?
                                            ?
                               ?            ?
                 PGC-1a        ?            ?                 Lin et al. (2002) Nature
      ?
                               ?            ?
                                            ?
                                            ?
    Energy requirements
                                                               Wu et al. (1999) Cell
  Tissue-specific regulatory sub-networks
                                                      Vamsi Mootha
      stimuli




                       Ppargc1
Tissue-specific
  drug target




                             Erra      Gapba

                     double positive
                     feedback loop
                                                 Targets




                Increased system stability / robustness
            Genome-wide motif discovery
                                                    Xiaohui Xie
        Transcription Start




          -1 kb   +1 kb

• Increased challenges
   – Intergenic regions are much longer
   – Motifs can appear at very large distances
• Methods
   – Focus on promoter-proximal regions
   – Multiple alignment of human, dog, mouse, rat


        Hundreds of significantly conserved patterns
        Genome-wide motif discovery
human   CTCTTAATGGTACACGTTCTGCCT----AAGTAGCCTAGACGCTCCCGTGCGCCC-GGGG
dog     CTCTTA-CGGGGCACATTCTGCTTTCAACAGTGGGGCAGACGGTCCCGCGCGCCCCAAGG
mouse   GTCTTAGGAGGCT-CGATCGCC---------------------GCCTGCATTATT-----
rn      GTCTTAGTTGGCCACGACCTGC---------------------TCATGCATAATT-----
         *****   *    *   * *                       * *
                                            Erra
human   CGGGTAGGCCTGGCCGAAAATCTCTCCCGCGCGCCTGACCTTGGGTTGCCCCAGCCAGGC
dog     CAGGC---CCGGGCTGCAGACCTGCCCTGAGGGAATGACCTTGGGCGGCCGCAGCGGGGC
mouse   --------------CACAAGCCTGTGGCGCGC-CGTGACCTTGGGCTGCCCCAGGCGGGC
rn      --------------CACAAGTTTCTC---TGC-CCTGACCTTGGGTTGCCCCAGGCGAG-
                         *    *       *    ********** *** ***     *

human   TGCGGGCCCGAGACCCCCG-------------------GGCCTCCCTGCCCCCCGCGCCG
dog     CGCGGGCCCAGGCCCCCCTCCCTCCCTCCCTCCCTCCCTCCCTCCCTGCCCCCCGGACCG
mouse   TGCAGGCTCACCACCCCGTCTTTTCT---------------------GCTTTTCGAGTCG
rn      -GCATACACCCCGCCTTTTTTTTTTTTTT---------TTTTTTTTTGCCGTTCAAG-AG
         **   * *    **                                **    *     *

human   CCCCGATTTGCCCTCAGAGAGGGTATC----GATCTTATTTCTGGGTCTACGGCAAACTC
dog     CCCCGCTTCACCCTCCCAGCTGGGAACCCCGGGCCTGAATACGGAGTCAGCCGCACACTT
mouse   GCCCGCTCTGCTCCCA-GGAGAGCATTCACGGTCTTATTTAGTGAGCGTAAGGCAAATCT
rn      CCCTGTTCTGCTCTCA-AAAGGGTATTAACGGTCTTATTTATTGGGCGCAAAGCAAACTT
         ** * *   * * *       * *      *   *   *   * *      *** *
                         New?        New?
human   CAAGGTCTACAAACGTAGAGGTCAGCTGTGACCCCGGGCCAGGCCGTGAAGGTCCCCAGG
dog     CACGGCCCAAACGCGGCGAGGTCAACAGCGACCCCGGGCCGGGCGGTGAAGGTGCCCGGG
mouse   GAATACCCAGCAGGGCCGAGGTCACCTGTGACCCCAGGCCAGGCCAGGACGGTGCCAAGG
rn      TAATACCCAGCAGGGCGGAGGTCACCTGTGACCCCAGGCCAGGCCATAAAGGTGCCAAGG
         *    * *     * ******* * * ****** **** ***     * *** ** **


       60% of previously known motifs rediscovered
  (remaining show poor conservation: diverged / incorrect)
       Seed + extension motif discovery




• Seed selection
   – N-gap-M motifs. Suffix-tree like search
• Motif extension
   – Search motif instances in the genome, build consensus
• Motif collapsing
   – Two levels of clustering. 173 clusters. 750 motifs.
(3) Tissue-specific expression
What about 3’ UTR (Downstream motifs?)
3’ UTR motifs show directional preference
3’ UTR motifs show distinguishing features
              microRNA genes




• Repress specifically target genes
• Act via double-stranded RNA duplex
• Newly discovered in worm, plants, human, etc
                   3’ motifs hit known microRNA genes


miRNA            mature miRNA sequence     Motif        Pc    miRNA            mature miRNA sequence      Motif
hsa-miR-98       UGAGGUAGuaaguuguauuguu    CTACCTCA    0.57   hsa-miR-1        UGGAAUGUaaagaaguaugua      ACATTCCA
hsa-let-7i       UGAGGUAGuaguuugugcu       CTACCTCA    0.57   hsa-miR-135b     UAUGGCUUuucauuccuaugug     AAGCC ATA
hsa-let-7g       UGAGGUAGuaguuuguacagu     CTACCTCA    0.57   hsa-miR-135a     UAUGGCUUuuuauuccuauguga    AAGCC ATA
hsa-let-7f       UGAGGUAGuagauuguauaguu    CTACCTCA    0.57   hsa-miR-17-5p    CAAAGUGCuuacagugcagguagu   GCACTTTG
hsa-let-7e       UGAGGUAGgagguuguauagu     CTACCTCA    0.57   hsa-miR-93       AAAGUGC Uguucgugcagguag    AGCAC TTT
hsa-let-7c       UGAGGUAGuagguuguaugguu    CTACCTCA    0.57   hsa-miR-372      AAAGUGC Ugcgacauuugagcgu   AGCAC TTT
hsa-let-7b       UGAGGUAGuagguugugugguu    CTACCTCA    0.57   hsa-miR-106a     aAAAGUGCUuacagugcagguagc   AGCAC TTT
hsa-let-7a       UGAGGUAGuagguuguauaguu    CTACCTCA    0.57   hsa-miR-367      aAUUGCACUuuagcaaugguga     AGTGCAAT
hsa-miR-124a     uUAAGGCAC gcggugaaugcca   GTGCC TTA   0.54   hsa-miR-25       cAUUGCACUugucucggucuga     AGTGCAAT
hsa-miR-92       UAUUGCACuugucccggccugu    GTGCAATA    0.52   hsa-miR-219      UGAUUGUCcaaacgcaauucu      GACAATCA
hsa-miR-32       UAUUGCACauuacuaaguugc     GTGCAATA    0.52   hsa-miR-182      UUUGGCAAugguagaacucaca     TTGCCAAA
hsa-miR-30e      UGUAAACAuccuugacugga      TGTTTACA    0.47   hsa-miR-125b     UCCCUGAGacccuaacuuguga     CTCAGGGA
hsa-miR-30d      UGUAAACAuccccgacuggaag    TGTTTACA    0.47   hsa-miR-125a     UCCCUGAGacccuuuaaccugug    CTCAGGGA
hsa-miR-30c      UGUAAACAuccuacacucucagc   TGTTTACA    0.47   hsa-miR-301      CAGUGCAAuaguauugucaaagc    TTGCACTG
hsa-miR-30b      UGUAAACAuccuacacucagc     TGTTTACA    0.47   hsa-miR-130b     CAGUGCAAugaugaaagggcau     TTGCACTG
hsa-miR-30a-5p   UGUAAACAuccucgacuggaagc   TGTTTACA    0.47   hsa-miR-130a     CAGUGCAAuguuaaaagggc       TTGCACTG
hsa-miR-20       UAAAGUGCuuauagugcaggua    GCACTTTA    0.46   hsa-miR-142-3p   UGUAGUGUuuccuacuuuaugga    ACACTACA
hsa-miR-106b     UAAAGUGCugacagugcagau     GCACTTTA    0.46   hsa-miR-373      gAAGUGCUUcgauuuuggggugu    AAGCACTT
hsa-miR-9        UCUUUGGUuaucuagcuguauga   ACCAAAGA    0.46   hsa-miR-302d     uAAGUGCUUccauguuugagugu    AAGCACTT
hsa-miR-29c      UAGCACCAuuugaaaucgguua    TGGTGCTA    0.45   hsa-miR-302c     uAAGUGCUUccauguuucagugg    AAGCACTT
hsa-miR-29b      UAGCACCAuuugaaaucagu      TGGTGCTA    0.45   hsa-miR-302b     uAAGUGCUUccauguuuuaguag    AAGCACTT
hsa-miR-29a      cUAGC ACCAucugaaaucgguu   TGGTGCTA    0.45   hsa-miR-302a     uAAGUGCUUccauguuuugguga    AAGCACTT
hsa-let-7d       aGAGGUAGUagguugcauagu     ACTACCTC    0.44   hsa-miR-152      UCAGUGCAugacagaacuugg      TGCACTGA
New micro-RNA genes discovered
Genome duplication in a vertebrate
         So much more to be done!!



Part I                                   Part II

1. Yeast                      1. Genes
2. Alignment                  2. Regulation
3. Evolution                  3. Grammar
4. Duplication                4. Human
       Open Questions / Final Projects

• What does it all do?
   – Ultra-conserved elements in the human genome
   – New types of transcripts, non-coding genes, miRNAs
• How is all it all controlled?
   – Combinatorial relationships, motif grammars
   – Multi-cellular coordination
• How does it all evolve?
   – Genes, Protein domains, message RNA
   – Regulatory motifs, networks, circuits
                 Summary



Part I                                Part II

1. Yeast                   1. Genes
2. Alignment               2. Regulation
3. Evolution               3. Grammar
4. Duplication             4. Human
    Summary of contributions

Alignment      • Genome alignment
                 – Graph-theoretic framework
                 – Aligned complete genomes
                 – Discovered evolutionary changes
Genes          • Gene identification
                 – Systematic classification approach
                 – High sensitivity and specificity (>99%)
                 – Changes affect 15% of all genes
Regulation     • Regulatory motif discovery
                 – Sequence pattern search and refinement
                 – Candidate functions for novel motifs
                 – Combinatorial interactions
Evolution      • Evolutionary innovation
                 – Understanding of genome ancestry
                 – Mechanisms and regions of change
                 – Emergence of new functions
                    Acknowledgements

Broad Institute of MIT and Harvard   Whitehead Institute
   Eric Lander                          Gerry Fink
   Bruce Birren                         Rick Young
   Nick Patterson                       Julia Zeitlinger
   Vamsi Mootha                         Trey Ideker
   Xiaohui Xie                          Susan Lindquist
                                     SGD / Stanford
                                        David Botstein
                                        Mike Cherry
                                        Kara Dolinski
                                        Dianna Fisk
                                        SGD curators

								
To top