Questions bioinformatics III, day 5

Document Sample
scope of work template
							Day 1 2 3 4                                                Day 6 7 8 9


       Questions bioinformatics III, day 5
  Question                                         Hints

1 The SARS virus, which belongs top the
  group of the Nidovirales has been predicted
  to encode 11 proteins, some of which are
  actually split into separate peptides. The set
  of 11 can be found here.


         11 proteins
     identified in SARS
  >SARS_1 * putative orf1ab
  polyprotein *
  join(250..13383,13383..21470) *
  29836505
  MESLVLGVNE KTHVQLSLPV LQVRDVLVRG
  FGDSVEEALS EAREHLKNGT CGLVELEKGV
  LPQLEQPYVF IKRSDALSTN HGHKVVELVA
  EMDGIQYGRS GITLGVLVPH VGETPIAYRN
  VLLRKNGNKG AGGHSYGIDL KSYDLGDELG
  TDPIEDYEQN WNTKHGSGAL RELTRELNGG
  AVTRYVDNNF CGPDGYPLDC IKDFLARAGK
  SMCTLSEQLD YIESKRGVYC CRDHEHEIAW
  FTERSDKSYE HQTPFEIKSA KKFDTFKGEC
  PKFVFPLNSK VKVIQPRVEK KKTEGFMGRI
  RSVYPVASPQ ECNNMHLSTL MKCNHCDEVS
  WQTCDFLKAT CEHCGTENLV IEGPTTCGYL
  PTNAVVKMPC PACQDPEIGP EHSVADYHNH
  SNIETRLRKG GRTRCFGGCV FAYVGCYNKR
  AYWVPRASAD IGSGHTGITG DNVETLNEDL
  LEILSRERVN INIVGDFHLN EEVAIILASF
  SASTSAFIDT IKSLDYKSFK TIVESCGNYK
  VTKGKPVKGA WNIGQQRSVL TPLCGFPSQA
  AGVIRSIFAR TLDAANHSIP DLQRAAVTIL
  DGISEQSLRL VDAMVYTSDL LTNSVIIMAY
  VTGGLVQQTS QWLSNLLGTT VEKLRPIFEW
  IEAKLSAGVE FLKDAWEILK FLITGVFDIV
  KGQIQVASDN IKDCVKCFID VVNKALEMCI
  DQVTIAGAKL RSLNLGEVFI AQSKGLYRQC
  IRGKEQLQLL MPLKAPKEVT FLEGDSHDTV
  LTSEEVVLKN GELEALETPV DSFTNGAIVG
  TPVCVNGLML LEIKDKEQYC ALSPGLLATN
  NVFRLKGGAP IKGVTFGEDT VWEVQGYKNV
  RITFELDERV DKVLNEKCSV YTVESGTEVT
  EFACVVAEAV VKTLQPVSDL LTNMGIDLDE
WSVATFYLFD   DAGEENFSSR   MYCSFYPPDE
EEEDDAECEE   EEIDETCEHE   YGTEDDYQGL
PLEFGASAET   VRVEEEEEED   WLDDTTEQSE
IEPEPEPTPE   EPVNQFTGYL   KLTDNVAIKC
VDIVKEAQSA   NPMVIVNAAN   IHLKHGGGVA
GALNKATNGA   MQKESDDYIK   LNGPLTVGGS
CLLSGHNLAK   KCLHVVGPNL   NAGEDIQLLK
AAYENFNSQD   ILLAPLLSAG   IFGAKPLQSL
QVCVQTVRTQ   VYIAVNDKAL   YEQVVMDYLD
NLKPRVEAPK   QEEPPNTEDS   KTEEKSVVQK
PVDVKPKIKA   CIDEVTTTLE   ETKFLTNKLL
LFADINGKLY   HDSQNMLRGE   DMSFLEKDAP
YMVGDVITSG   DITCVVIPSK   KAGGTTEMLS
RALKKVPVDE   YITTYPGQGC   AGYTLEEAKT
ALKKCKSAFY   VLPSEAPNAK   EEILGTVSWN
LREMLAHAEE   TRKLMPICMD   VRAIMATIQR
KYKGIKIQEG   IVDYGVRFFF   YTSKEPVASI
ITKLNSLNEP   LVTMPIGYVT   HGFNLEEAAR
CMRSLKAPAV   VSVSSPDAVT   TYNGYLTSSS
KTSEEHFVET   VSLAGSYRDW   SYSGQRTELG
VEFLKRGDKI   VYHTLESPVE   FHLDGEVLSL
DKLKSLLSLR   EVKTIKVFTT   VDNTNLHTQL
VDMSMTYGQQ   FGPTYLDGAD   VTKIKPHVNH
EGKTFFVLPS   DDTLRSEAFE   YYHTLDESFL
GRYMSALNHT   KKWKFPQVGG   LTSIKWADNN
CYLSSVLLAL   QQLEVKFNAP   ALQEAYYRAR
AGDAANFCAL   ILAYSNKTVG   ELGDVRETMT
HLLQHANLES   AKRVLNVVCK   HCGQKTTTLT
GVEAVMYMGT   LSYDNLKTGV   SIPCVCGRDA
TQYLVQQESS   FVMMSAPPAE   YKLQQGTFLC
ANEYTGNYQC   GHYTHITAKE   TLYRIDGAHL
TKMSEYKGPV   TDVFYKETSY   TTTIKPVSYK
LDGVTYTEIE   PKLDGYYKKD   NAYYTEQPID
LVPTQPLPNA   SFDNFKLTCS   NTKFADDLNQ
MTGFTKPASR   ELSVTFFPDL   NGDVVAIDYR
HYSASFKKGA   KLLHKPIVWH   INQATTKTTF
KPNTWCLRCL   WSTKPVDTSN   SFEVLAVEDT
QGMDNLACES   QQPTSEEVVE   NPTIQKEVIE
CDVKTTEVVG   NVILKPSDEG   VKVTQELGHE
DLMAAYVENT   SITIKKPNEL   SLALGLKTIA
THGIAAINSV   PWSKILAYVK   PFLGQAAITT
SNCAKRLAQR   VFNNYMPYVF   TLLFQLCTFT
KSTNSRIRAS   LPTTIAKNSV   KSVAKLCLDA
GINYVKSPKF   SKLFTIAMWL   LLLSICLGSL
ICVTAAFGVL   LSNFGAPSYC   NGVRELYLNS
SNVTTMDFCE   GSFPCSICLS   GLDSLDSYPA
LETIQVTISS   YKLDLTILGL   AAEWVLAYML
FTKFFYLLGL   SAIMQVFFGY   FASHFISNSW
LMWFIISIVQ   MAPVSAMVRM   YIFFASFYYI
WKSYVHIMDG   CTSSTCMMCY   KRNRATRVEC
TTIVNGMKRS   FYVYANGGRG   FCKTHNWNCL
NCDTFCTGST   FISDEVARDL   SLQFKRPINP
TDQSSYIVDS   VAVKNGALHL   YFDKAGQKTY
ERHPLSHFVN   LDNLRANNTK   GSLPINVIVF
DGKSKCDESA   SKSASVYYSQ   LMCQPILLLD
QALVSDVGDS   TEVSVKMFDA   YVDTFSATFS
VPMEKLKALV   ATAHSELAKG   VALDGVLSTF
VSAARQGVVD   TDVDTKDVIE   CLKLSHHSDL
EVTGDSCNNF   MLTYNKVENM   TPRDLGACID
CNARHINAQV   AKSHNVSLIW   NVKDYMSLSE
QLRKQIRSAA   KKNNIPFRLT   CATTRQVVNV
ITTKISLKGG   KIVSTCFKLM   LKATLLCVLA
ALVCYIVMPV   HTLSIHDGYT   NEIIGYKAIQ
DGVTRDIIST   DDCFANKHAG   FDAWFSQRGG
SYKNDKSCPV   VAAIITREIG   FIVPGLPGTV
LRAINGDFLH   FLPRVFSAVG   NICYTPSKLI
EYSDFATSAC   VLAAECTIFK   DAMGKPVPYC
YDTNLLEGSI   SYSELRPDTR   YVLMDGSIIQ
FPNTYLEGSV   RVVTTFDAEY   CRHGTCERSE
VGICLSTSGR   WVLNNEHYRA   LSGVFCGVDA
MNLIANIFTP   LVQPVGALDV   SASVVAGGII
AILVTCAAYY   FMKFRRVFGE   YNHVVAANAL
LFLMSFTILC   LVPAYSFLPG   VYSVFYLYLT
FYFTNDVSFL   AHLQWFAMFS   PIVPFWITAI
YVFCISLKHC   HWFFNNYLRK   RVMFNGVTFS
TFEEAALCTF   LLNKEMYLKL   RSETLLPLTQ
YNRYLALYNK   YKYFSGALDT   TSYREAACCH
LAKALNDFSN   SGADVLYQPP   QTSITSAVLQ
SGFRKMAFPS   GKVEGCMVQV   TCGTTTLNGL
WLDDTVYCPR   HVICTAEDML   NPNYEDLLIR
KSNHSFLVQA   GNVQLRVIGH   SMQNCLLRLK
VDTSNPKTPK   YKFVRIQPGQ   TFSVLACYNG
SPSGVYQCAM   RPNHTIKGSF   LNGSCGSVGF
NIDYDCVSFC   YMHHMELPTG   VHAGTDLEGK
FYGPFVDRQT   AQAAGTDTTI   TLNVLAWLYA
AVINGDRWFL   NRFTTTLNDF   NLVAMKYNYE
PLTQDHVDIL   GPLSAQTGIA   VLDMCAALKE
LLQNGMNGRT   ILGSTILEDE   FTPFDVVRQC
SGVTFQGKFK   KIVKGTHHWM   LLTFLTSLLI
LVQSTQWSLF   FFVYENAFLP   FTLGIMAIAA
CAMLLVKHKH   AFLCLFLLPS   LATVAYFNMV
YMPASWVMRI   MTWLELADTS   LSGYRLKDCV
MYASALVLLI   LMTARTVYDD   AARRVWTLMN
VITLVYKVYY   GNALDQAISM   WALVISVTSN
YSGVVTTIMF   LARAIVFVCV   EYYPLLFITG
NTLQCIMLVY   CFLGYCCCCY   FGLFCLLNRY
FRLTLGVYDY   LVSTQEFRYM   NSQGLLPPKS
SIDAFKLNIK   LLGIGGKPCI   KVATVQSKMS
DVKCTSVVLL   SVLQQLRVES   SSKLWAQCVQ
LHNDILLAKD   TTEAFEKMVS   LLSVLLSMQG
AVDINRLCEE   MLDNRATLQA   IASEFSSLPS
YAAYATAQEA   YEQAVANGDS   EVVLKKLKKS
LNVAKSEFDR   DAAMQRKLEK   MADQAMTQMY
KQARSEDKRA   KVTSAMQTML   FTMLRKLDND
ALNNIINNAR   DGCVPLNIIP   LTTAAKLMVV
VPDYGTYKNT   CDGNTFTYAS   ALWEIQQVVD
ADSKIVQLSE   INMDNSPNLA   WPLIVTALRA
NSAVKLQNNE   LSPVALRQMS   CAAGTTQTAC
TDDNALAYYN   NSKGGRFVLA   LLSDHQDLKW
ARFPKSDGTG   TIYTELEPPC   RFVTDTPKGP
KVKYLYFIKG   LNNLNRGMVL   GSLAATVRLQ
AGNATEVPAN   STVLSFCAFA   VDPAKAYKDY
LASGGQPITN   CVKMLCTHTG   TGQAITVTPE
ANMDQESFGG   ASCCLYCRCH   IDHPNPKGFC
DLKGKYVQIP   TTCANDPVGF   TLRNTVCTVC
GMWKGYGCSC   DQLREPLMQS   ADASTFLNRV
CGVSAARLTP   CGTGTSTDVV   YRAFDIYNEK
VAGFAKFLKT   NCCRFQEKDE   EGNLLDSYFV
VKRHTMSNYQ   HEETIYNLVK   DCPAVAVHDF
FKFRVDGDMV   PHISRQRLTK   YTMADLVYAL
RHFDEGNCDT   LKEILVTYNC   CDDDYFNKKD
WYDFVENPDI   LRVYANLGER   VRQSLLKTVQ
FCDAMRDAGI   VGVLTLDNQD   LNGNWYDFGD
FVQVAPGCGV   PIVDSYYSLL   MPILTLTRAL
AAESHMDADL   AKPLIKWDLL   KYDFTEERLC
LFDRYFKYWD   QTYHPNCINC   LDDRCILHCA
NFNVLFSTVF   PPTSFGPLVR   KIFVDGVPFV
VSTGYHFREL   GVVHNQDVNL   HSSRLSFKEL
LVYAADPAMH   AASGNLLLDK   RTTCFSVAAL
TNNVAFQTVK   PGNFNKDFYD   FAVSKGFFKE
GSSVELKHFF   FAQDGNAAIS   DYDYYRYNLP
TMCDIRQLLF   VVEVVDKYFD   CYDGGCINAN
QVIVNNLDKS   AGFPFNKWGK   ARLYYDSMSY
EDQDALFAYT   KRNVIPTITQ   MNLKYAISAK
NRARTVAGVS   ICSTMTNRQF   HQKLLKSIAA
TRGATVVIGT   SKFYGGWHNM   LKTVYSDVET
PHLMGWDYPK   CDRAMPNMLR   IMASLVLARK
HNTCCNLSHR   FYRLANECAQ   VLSEMVMCGG
SLYVKPGGTS   SGDATTAYAN   SVFNICQAVT
ANVNALLSTD   GNKIADKYVR   NLQHRLYECL
YRNRDVDHEF   VDEFYAYLRK   HFSMMILSDD
AVVCYNSNYA   AQGLVASIKN   FKAVLYYQNN
VFMSEAKCWT   ETDLTKGPHE   FCSQHTMLVK
QGDDYVYLPY   PDPSRILGAG   CFVDDIVKTD
GTLMIERFVS   LAIDAYPLTK   HPNQEYADVF
HLYLQYIRKL   HDELTGHMLD   MYSVMLTNDN
TSRYWEPEFY   EAMYTPHTVL   QAVGACVLCN
SQTSLRCGAC   IRRPFLCCKC   CYDHVISTSH
KLVLSVNPYV   CNAPGCDVTD   VTQLYLGGMS
YYCKSHKPPI   SFPLCANGQV   FGLYKNTCVG
SDNVTDFNAI   ATCDWTNAGD   YILANTCTER
LKLFAAETLK   ATEETFKLSY   GIATVREVLS
DRELHLSWEV   GKPRPPLNRN   YVFTGYRVTK
NSKVQIGEYT   FEKGDYGDAV   VYRGTTTYKL
NVGDYFVLTS   HTVMPLSAPT   LVPQEHYVRI
TGLYPTLNIS   DEFSSNVANY   QKVGMQKYST
LQGPPGTGKS   HFAIGLALYY   PSARIVYTAC
SHAAVDALCE   KALKYLPIDK   CSRIIPARAR
VECFDKFKVN   STLEQYVFCT   VNALPETTAD
IVVFDEISMA   TNYDLSVVNA   RLRAKHYVYI
GDPAQLPAPR   TLLTKGTLEP   EYFNSVCRLM
KTIGPDMFLG   TCRRCPAEIV   DTVSALVYDN
KLKAHKDKSA   QCFKMFYKGV   ITHDVSSAIN
RPQIGVVREF   LTRNPAWRKA   VFISPYNSQN
AVASKILGLP   TQTVDSSQGS   EYDYVIFTQT
TETAHSCNVN   RFNVAITRAK   IGILCIMSDR
DLYDKLQFTS   LEIPRRNVAT   LQAENVTGLF
KDCSKIITGL   HPTQAPTHLS   VDIKFKTEGL
CVDIPGIPKD   MTYRRLISMM   GFKMNYQVNG
YPNMFITREE   AIRHVRAWIG   FDVEGCHATR
DAVGTNLPLQ   LGFSTGVNLV   AVPTGYVDTE
NNTEFTRVNA   KPPPGDQFKH   LIPLMYKGLP
WNVVRIKIVQ   MLSDTLKGLS   DRVVFVLWAH
GFELTSMKYF   VKIGPERTCC   LCDKRATCFS
TSSDTYACWN   HSVGFDYVYN   PFMIDVQQWG
FTGNLQSNHD   QHCQVHGNAH   VASCDAIMTR
CLAVHECFVK   RVDWSVEYPI   IGDELRVNSA
CRKVQHMVVK   SALLADKFPV   LHDIGNPKAI
KCVPQAEVEW   KFYDAQPCSD   KAYKIEELFY
SYATHHDKFT   DGVCLFWNCN   VDRYPANAIV
CRFDTRVLSN   LNLPGCDGGS   LYVNKHAFHT
PAFDKSAFTN   LKQLPFFYYS   DSPCESHGKQ
VVSDIDYVPL   KSATCITRCN   LGGAVCRHHA
NEYRQYLDAY   NMMISAGFSL   WIYKQFDTYN
LWNTFTRLQS   LENVAYNVVN   KGHFDGHAGE
APVSIINNAV   YTKVDGIDVE   IFENKTTLPV
NVAFELWAKR   NIKPVPEIKI   LNNLGVDIAA
NTVIWDYKRE   APAHVSTIGV   CTMTDIAKKP
TESACSSLTV   LFDGRVEGQV   DLFRNARNGV
LITEGSVKGL   TPSKGPAQAS   VNGVTLIGES
VKTQFNYFKK   VDGIIQQLPE   TYFTQSRDLE
DFKPRSQMET   DFLELAMDEF   IQRYKLEGYA
FEHIVYGDFS   HGQLGGLHLM   IGLAKRSQDS
PLKLEDFIPM   DSTVKNYFIT   DAQTGSSKCV
CSVIDLLLDD   FVEIIKSQDL   SVISKVVKVT
IDYAEISFML   WCKDGHVETF   YPKLQASRAW
QPGVAMPNLY   KMQRMLLEKC   DLQNYGENAV
IPKGIMMNVA   KYTQLCQYLN   TLTLAVPYNM
RVIHFGAGSD   KGVAPGTAVL   RQWLPTGTLL
VDSDLNDFVS   DAYSTLIGDC   ATVHTANKWD
LIISDMYDPR   TKHVTKENDS   KEGFFTYLCG
FIKQKLALGG   SIAVKITEHS   WNADLYKLMG
HFSWWTAFVT   NVNASSSEAF   LIGANYLGKP
KEQIDGYTMH   ANYIFWRNTN   PIQLSSYSLF
DMSKFPLKLR   GTAVMSLKEN   QINDMIYSLL
EKGRLIIREN   NRVVVSSDIL   VNN

>SARS_2 * orf1a polyprotein *
250..13398 * 29836495
MESLVLGVNE KTHVQLSLPV LQVRDVLVRG
FGDSVEEALS EAREHLKNGT CGLVELEKGV
LPQLEQPYVF IKRSDALSTN HGHKVVELVA
EMDGIQYGRS GITLGVLVPH VGETPIAYRN
VLLRKNGNKG AGGHSYGIDL KSYDLGDELG
TDPIEDYEQN WNTKHGSGAL RELTRELNGG
AVTRYVDNNF CGPDGYPLDC IKDFLARAGK
SMCTLSEQLD YIESKRGVYC CRDHEHEIAW
FTERSDKSYE HQTPFEIKSA KKFDTFKGEC
PKFVFPLNSK VKVIQPRVEK KKTEGFMGRI
RSVYPVASPQ ECNNMHLSTL MKCNHCDEVS
WQTCDFLKAT CEHCGTENLV IEGPTTCGYL
PTNAVVKMPC PACQDPEIGP EHSVADYHNH
SNIETRLRKG GRTRCFGGCV FAYVGCYNKR
AYWVPRASAD IGSGHTGITG DNVETLNEDL
LEILSRERVN INIVGDFHLN EEVAIILASF
SASTSAFIDT IKSLDYKSFK TIVESCGNYK
VTKGKPVKGA WNIGQQRSVL TPLCGFPSQA
AGVIRSIFAR TLDAANHSIP DLQRAAVTIL
DGISEQSLRL   VDAMVYTSDL   LTNSVIIMAY
VTGGLVQQTS   QWLSNLLGTT   VEKLRPIFEW
IEAKLSAGVE   FLKDAWEILK   FLITGVFDIV
KGQIQVASDN   IKDCVKCFID   VVNKALEMCI
DQVTIAGAKL   RSLNLGEVFI   AQSKGLYRQC
IRGKEQLQLL   MPLKAPKEVT   FLEGDSHDTV
LTSEEVVLKN   GELEALETPV   DSFTNGAIVG
TPVCVNGLML   LEIKDKEQYC   ALSPGLLATN
NVFRLKGGAP   IKGVTFGEDT   VWEVQGYKNV
RITFELDERV   DKVLNEKCSV   YTVESGTEVT
EFACVVAEAV   VKTLQPVSDL   LTNMGIDLDE
WSVATFYLFD   DAGEENFSSR   MYCSFYPPDE
EEEDDAECEE   EEIDETCEHE   YGTEDDYQGL
PLEFGASAET   VRVEEEEEED   WLDDTTEQSE
IEPEPEPTPE   EPVNQFTGYL   KLTDNVAIKC
VDIVKEAQSA   NPMVIVNAAN   IHLKHGGGVA
GALNKATNGA   MQKESDDYIK   LNGPLTVGGS
CLLSGHNLAK   KCLHVVGPNL   NAGEDIQLLK
AAYENFNSQD   ILLAPLLSAG   IFGAKPLQSL
QVCVQTVRTQ   VYIAVNDKAL   YEQVVMDYLD
NLKPRVEAPK   QEEPPNTEDS   KTEEKSVVQK
PVDVKPKIKA   CIDEVTTTLE   ETKFLTNKLL
LFADINGKLY   HDSQNMLRGE   DMSFLEKDAP
YMVGDVITSG   DITCVVIPSK   KAGGTTEMLS
RALKKVPVDE   YITTYPGQGC   AGYTLEEAKT
ALKKCKSAFY   VLPSEAPNAK   EEILGTVSWN
LREMLAHAEE   TRKLMPICMD   VRAIMATIQR
KYKGIKIQEG   IVDYGVRFFF   YTSKEPVASI
ITKLNSLNEP   LVTMPIGYVT   HGFNLEEAAR
CMRSLKAPAV   VSVSSPDAVT   TYNGYLTSSS
KTSEEHFVET   VSLAGSYRDW   SYSGQRTELG
VEFLKRGDKI   VYHTLESPVE   FHLDGEVLSL
DKLKSLLSLR   EVKTIKVFTT   VDNTNLHTQL
VDMSMTYGQQ   FGPTYLDGAD   VTKIKPHVNH
EGKTFFVLPS   DDTLRSEAFE   YYHTLDESFL
GRYMSALNHT   KKWKFPQVGG   LTSIKWADNN
CYLSSVLLAL   QQLEVKFNAP   ALQEAYYRAR
AGDAANFCAL   ILAYSNKTVG   ELGDVRETMT
HLLQHANLES   AKRVLNVVCK   HCGQKTTTLT
GVEAVMYMGT   LSYDNLKTGV   SIPCVCGRDA
TQYLVQQESS   FVMMSAPPAE   YKLQQGTFLC
ANEYTGNYQC   GHYTHITAKE   TLYRIDGAHL
TKMSEYKGPV   TDVFYKETSY   TTTIKPVSYK
LDGVTYTEIE   PKLDGYYKKD   NAYYTEQPID
LVPTQPLPNA   SFDNFKLTCS   NTKFADDLNQ
MTGFTKPASR   ELSVTFFPDL   NGDVVAIDYR
HYSASFKKGA   KLLHKPIVWH   INQATTKTTF
KPNTWCLRCL   WSTKPVDTSN   SFEVLAVEDT
QGMDNLACES   QQPTSEEVVE   NPTIQKEVIE
CDVKTTEVVG   NVILKPSDEG   VKVTQELGHE
DLMAAYVENT   SITIKKPNEL   SLALGLKTIA
THGIAAINSV   PWSKILAYVK   PFLGQAAITT
SNCAKRLAQR   VFNNYMPYVF   TLLFQLCTFT
KSTNSRIRAS   LPTTIAKNSV   KSVAKLCLDA
GINYVKSPKF   SKLFTIAMWL   LLLSICLGSL
ICVTAAFGVL   LSNFGAPSYC   NGVRELYLNS
SNVTTMDFCE   GSFPCSICLS   GLDSLDSYPA
LETIQVTISS   YKLDLTILGL   AAEWVLAYML
FTKFFYLLGL   SAIMQVFFGY   FASHFISNSW
LMWFIISIVQ   MAPVSAMVRM   YIFFASFYYI
WKSYVHIMDG   CTSSTCMMCY   KRNRATRVEC
TTIVNGMKRS   FYVYANGGRG   FCKTHNWNCL
NCDTFCTGST   FISDEVARDL   SLQFKRPINP
TDQSSYIVDS   VAVKNGALHL   YFDKAGQKTY
ERHPLSHFVN   LDNLRANNTK   GSLPINVIVF
DGKSKCDESA   SKSASVYYSQ   LMCQPILLLD
QALVSDVGDS   TEVSVKMFDA   YVDTFSATFS
VPMEKLKALV   ATAHSELAKG   VALDGVLSTF
VSAARQGVVD   TDVDTKDVIE   CLKLSHHSDL
EVTGDSCNNF   MLTYNKVENM   TPRDLGACID
CNARHINAQV   AKSHNVSLIW   NVKDYMSLSE
QLRKQIRSAA   KKNNIPFRLT   CATTRQVVNV
ITTKISLKGG   KIVSTCFKLM   LKATLLCVLA
ALVCYIVMPV   HTLSIHDGYT   NEIIGYKAIQ
DGVTRDIIST   DDCFANKHAG   FDAWFSQRGG
SYKNDKSCPV   VAAIITREIG   FIVPGLPGTV
LRAINGDFLH   FLPRVFSAVG   NICYTPSKLI
EYSDFATSAC   VLAAECTIFK   DAMGKPVPYC
YDTNLLEGSI   SYSELRPDTR   YVLMDGSIIQ
FPNTYLEGSV   RVVTTFDAEY   CRHGTCERSE
VGICLSTSGR   WVLNNEHYRA   LSGVFCGVDA
MNLIANIFTP   LVQPVGALDV   SASVVAGGII
AILVTCAAYY   FMKFRRVFGE   YNHVVAANAL
LFLMSFTILC   LVPAYSFLPG   VYSVFYLYLT
FYFTNDVSFL   AHLQWFAMFS   PIVPFWITAI
YVFCISLKHC   HWFFNNYLRK   RVMFNGVTFS
TFEEAALCTF   LLNKEMYLKL   RSETLLPLTQ
YNRYLALYNK   YKYFSGALDT   TSYREAACCH
LAKALNDFSN   SGADVLYQPP   QTSITSAVLQ
SGFRKMAFPS   GKVEGCMVQV   TCGTTTLNGL
WLDDTVYCPR   HVICTAEDML   NPNYEDLLIR
KSNHSFLVQA   GNVQLRVIGH   SMQNCLLRLK
VDTSNPKTPK   YKFVRIQPGQ   TFSVLACYNG
SPSGVYQCAM   RPNHTIKGSF   LNGSCGSVGF
NIDYDCVSFC   YMHHMELPTG   VHAGTDLEGK
FYGPFVDRQT   AQAAGTDTTI   TLNVLAWLYA
AVINGDRWFL   NRFTTTLNDF   NLVAMKYNYE
PLTQDHVDIL   GPLSAQTGIA   VLDMCAALKE
LLQNGMNGRT   ILGSTILEDE   FTPFDVVRQC
SGVTFQGKFK   KIVKGTHHWM   LLTFLTSLLI
LVQSTQWSLF   FFVYENAFLP   FTLGIMAIAA
CAMLLVKHKH   AFLCLFLLPS   LATVAYFNMV
YMPASWVMRI   MTWLELADTS   LSGYRLKDCV
MYASALVLLI   LMTARTVYDD   AARRVWTLMN
VITLVYKVYY   GNALDQAISM   WALVISVTSN
YSGVVTTIMF   LARAIVFVCV   EYYPLLFITG
NTLQCIMLVY   CFLGYCCCCY   FGLFCLLNRY
FRLTLGVYDY   LVSTQEFRYM   NSQGLLPPKS
SIDAFKLNIK   LLGIGGKPCI   KVATVQSKMS
DVKCTSVVLL   SVLQQLRVES   SSKLWAQCVQ
LHNDILLAKD   TTEAFEKMVS   LLSVLLSMQG
AVDINRLCEE   MLDNRATLQA   IASEFSSLPS
YAAYATAQEA   YEQAVANGDS   EVVLKKLKKS
LNVAKSEFDR   DAAMQRKLEK   MADQAMTQMY
KQARSEDKRA   KVTSAMQTML   FTMLRKLDND
ALNNIINNAR   DGCVPLNIIP   LTTAAKLMVV
VPDYGTYKNT   CDGNTFTYAS   ALWEIQQVVD
ADSKIVQLSE   INMDNSPNLA   WPLIVTALRA
NSAVKLQNNE   LSPVALRQMS   CAAGTTQTAC
TDDNALAYYN   NSKGGRFVLA   LLSDHQDLKW
ARFPKSDGTG   TIYTELEPPC   RFVTDTPKGP
KVKYLYFIKG   LNNLNRGMVL   GSLAATVRLQ
AGNATEVPAN   STVLSFCAFA   VDPAKAYKDY
LASGGQPITN   CVKMLCTHTG   TGQAITVTPE
ANMDQESFGG   ASCCLYCRCH   IDHPNPKGFC
DLKGKYVQIP   TTCANDPVGF   TLRNTVCTVC
GMWKGYGCSC   DQLREPLMQS   ADASTFLNGF
AV

>SARS_3 * putative E2 glycoprotein
precursor * 21477..25244 * 29836496
MFIFLLFLTL TSGSDLDRCT TFDDVQAPNY
TQHTSSMRGV YYPDEIFRSD TLYLTQDLFL
PFYSNVTGFH TINHTFGNPV IPFKDGIYFA
ATEKSNVVRG WVFGSTMNNK SQSVIIINNS
TNVVIRACNF ELCDNPFFAV SKPMGTQTHT
MIFDNAFNCT FEYISDAFSL DVSEKSGNFK
HLREFVFKNK DGFLYVYKGY QPIDVVRDLP
SGFNTLKPIF KLPLGINITN FRAILTAFSP
AQDIWGTSAA AYFVGYLKPT TFMLKYDENG
TITDAVDCSQ NPLAELKCSV KSFEIDKGIY
QTSNFRVVPS GDVVRFPNIT NLCPFGEVFN
ATKFPSVYAW ERKKISNCVA DYSVLYNSTF
FSTFKCYGVS ATKLNDLCFS NVYADSFVVK
GDDVRQIAPG QTGVIADYNY KLPDDFMGCV
LAWNTRNIDA TSTGNYNYKY RYLRHGKLRP
FERDISNVPF SPDGKPCTPP ALNCYWPLND
YGFYTTTGIG YQPYRVVVLS FELLNAPATV
CGPKLSTDLI KNQCVNFNFN GLTGTGVLTP
SSKRFQPFQQ FGRDVSDFTD SVRDPKTSEI
LDISPCAFGG VSVITPGTNA SSEVAVLYQD
VNCTDVSTAI HADQLTPAWR IYSTGNNVFQ
TQAGCLIGAE HVDTSYECDI PIGAGICASY
HTVSLLRSTS QKSIVAYTMS LGADSSIAYS
NNTIAIPTNF SISITTEVMP VSMAKTSVDC
NMYICGDSTE CANLLLQYGS FCTQLNRALS
GIAAEQDRNT REVFAQVKQM YKTPTLKYFG
GFNFSQILPD PLKPTKRSFI EDLLFNKVTL
ADAGFMKQYG ECLGDINARD LICAQKFNGL
TVLPPLLTDD MIAAYTAALV SGTATAGWTF
GAGAALQIPF AMQMAYRFNG IGVTQNVLYE
NQKQIANQFN KAISQIQESL TTTSTALGKL
QDVVNQNAQA LNTLVKQLSS NFGAISSVLN
DILSRLDKVE AEVQIDRLIT GRLQSLQTYV
TQQLIRAAEI RASANLAATK MSECVLGQSK
RVDFCGKGYH LMSFPQAAPH GVVFLHVTYV
PSQERNFTTA PAICHEGKAY FPREGVFVFN
GTSWFITQRN FFSPQIITTD NTFVSGNCDV
VIGIINNTVY DPLQPELDSF KEELDKYFKN
HTSPDVDLGD ISGINASVVN IQKEIDRLNE
VAKNLNESLI DLQELGKYEQ YIKWPWYVWL
GFIAGLIAIV MVTILLCCMT SCCSCLKGAC
SCGSCCKFDE DDSEPVLKGV KLHYT

>SARS_4 * putative uncharacterized
protein * 25253..26077 * 29836497
MDLFMRFFTL GSITAQPVKI DNASPASTVH
ATATIPLQAS LPFGWLVIGV AFLAVFQSAT
KIIALNKRWQ LALYKGFQFI CNLLLLFVTI
YSHLLLVAAG MEAQFLYLYA LIYFLQCINA
CRIIMRCWLC WKCKSKNPLL YDANYFVCWH
THNYDYCIPY NSVTDTIVVT EGDGISTPKL
KEDYQIGGYS EDRHSGVKDY VVVHGYFTEV
YYQLESTQIT TDTGIENATF FIFNKLVKDP
PNVQIHTIDG SSGVANPAMD PIYDEPTTTT
SVPL

>SARS_5 * putative uncharacterized
protein * 25674..26138 * 29836498
MMPTTLFAGT HITMTTVYHI TVSQIQLSLL
KVTAFQHQNS KKTTKLVVIL RIGTQVLKTM
SLYMAISPKF TTSLSLHKLL QTLVLKMLHS
SSLTSLLKTH RMCKYTQSTA LQELLIQQWI
QFMMSRRRLL ACLCKHKKVS TNLCTHSFRK
KQVR

>SARS_6 * putative small envelope
protein E * 26102..26332 * 29836499
MYSFVSEETG TLIVNSVLLF LAFVVFLLVT
LAILTALRLC AYCCNIVNVS LVKPTVYVYS
RVKNLNSSEG VPDLLV

>SARS_7 * putative protein M *
26383..27048 * 29836504
MADNGTITVE ELKQLLEQWN LVIGFLFLAW
IMLLQFAYSN RNRFLYIIKL VFLWLLWPVT
LACFVLAAVY RINWVTGGIA IAMACIVGLM
WLSYFVASFR LFARTRSMWS FNPETNILLN
VPLRGTIVTR PLMESELVIG AVIIRGHLRM
AGHSLGRCDI KDLPKEITVA TSRTLSYYKL
GASQRVGTDS GFAAYNRYRI GNYKLNTDHA
GSNDNIALLV Q

>SARS_8 * putative uncharacterized
protein * 27059..27250 * 29836500
MFHLVDFQVT IAEILIIIMR TFRIAIWNLD
VIISSIVRQL FKPLTKKNYS ELDDEEPMEL
DYP

>SARS_9 * putative uncharacterized
protein * 27258..27626 * 29836501
MKIILFLTLI VFTSCELYHY QECVRGTTVL
LKEPCPSGTY EGNSPFHPLA DNKFALTCTS
THFAFACADG TRHTYQLRAR SVSPKLFIRQ
EEVQQELYSP LFLIVAALVF LILCFTIKRK
TE

>SARS_10 * putative nucleocapsid
 protein * 28105..29373 * 29836503
 MSDNGPQSNQ RSAPRITFGG PTDSTDNNQN
 GGRNGARPKQ RRPQGLPNNT ASWFTALTQH
 GKEELRFPRG QGVPINTNSG PDDQIGYYRR
 ATRRVRGGDG KMKELSPRWY FYYLGTGPEA
 SLPYGANKEG IVWVATEGAL NTPKDHIGTR
 NPNNNAATVL QLPQGTTLPK GFYAEGSRGG
 SQASSRSSSR SRGNSRNSTP GSSRGNSPAR
 MASGGGETAL ALLLLDRLNQ LESKVSGKGQ
 QQQGQTVTKK SAAEASKKPR QKRTATKQYN
 VTQAFGRRGP EQTQGNFGDQ DLIRQGTDYK
 HWPQIAQFAP SASAFFGMSR IGMEVTPSGT
 WLTYHGAIKL DDKDPQFKDN VILLNKHIDA
 YKTFPPTEPK KDKKKKTDEA QPLPQRQKKQ
 PTVTLLPAAD MDDFSRQLQN SMSGASADST
 QA

 >SARS_11 * putative uncharacterized
 protein * 28115..28411 * 29836502
 MDPNQTNVVP PALHLVDPQI QLTITRMEDA
 MGQGQNSADP KVYPIILRLG SQLSLSMARR
 NLDSLEARAF QSTPIVVQMT KLATTEELPD
 EFVVVTAK




a Find out with which other completely    The levels of sequence identity might be
  sequenced Nidovirales SARS has homologs quite low, when in doubt do reciprocal
  (make a table).                         PSI-Blast searches. So you know what to
                                          look for in your Blast output, here are
                                          the sequenced Nidovirales:

                                                 Avian infectious bronchitis virus
                                                 Bovine coronavirus
                                                 Equine arteritis virus
                                                 Human coronavirus 229E
                                                 Lactate dehydrogenase-elevating
                                                  virus
                                                 Murine hepatitis virus
                                                 Porcine epidemic diarrhea virus
                                                 Porcine reproductive and
                                                  respiratory syndrome virus
                                                 SARS coronavirus
                                                 Simian hemorrhagic fever virus
                                                 Transmissible gastroenteritis
                                                  virus

b Make a phylogeny of the Nidovirales based
  on the proteins that they all share.
 c Is the sequence-based phylogeny consistent
   with which species the SARS viruses share
   the most genes?
 d If the answer to c is 'yes', can you think of a Are we able to detect all sequence
   pitfall of that result?                         homologies by sequence comparison?
                                                   And how does that depend on the
                                                   evolutionary distance between the
                                                   sequences?

2 A question about horizontal gene transfer.
  For the following sequence:
  cmntqapiae   atkkavsmgp
  ekvieevfks   nlvgrggagf rtgkkwesay ktpasdkyvv cnadeglpst
  ykdwcllnne
  akrkevftgm   gicaktigak rcfmylryey rnlvpaleqs ikdvqstcpe
  ladlkyeirl
  gggpyvagee   naqfesiegr aplprkdrpg nifptmeglf hkptvinnve
  tffaiphiiq
  qgsqsfgegk   mpkllsvtgd vdepilietn lnnyslnhll qeisakdiva
  aeiggctepi
  ifgskfdtlf   gfgrgtlnav gsvvlfnssc dlgkiyenkl kfmaeesckq
  cvpcrdgsyi
  fhrafkelrd   tgkssynmra lavasesaar ssicahgkal eslfksacdf
  mnktkpiyqp
  hstyhq




 a From which species does is it? Which
   domain does it contain?

 b Note that the sequence above is part of a                 You could make a tree with the best
   larger protein. Here we are however                       Blast hits. For the purpose of this
   specifically interested in the evolutionary               question it is sufficient to let the Blast
   history of this part of the protein itself.               tool do this for you. Click on “distance
   The species which this protein comes from                 tree of results” to obtain a phylogeny of
   is a ciliate, a phylum of the eukaryotes that             the sequences. You can zoom in the tree
   contains amongst others “Paramecium”                      by mousing over it and clicking on
   (pantoffel dier). Is the protein above closely            “select subtree”. The query sequence is
c related to proteins from other ciliates?                   all the way at the bottom of the page.

  Does Nyctotherus have a “distant” homolog
  of this protein (one that is less than 50%    In order to find that in the Blast results
d identical)?                                   of the first protein you have to make sure
                                                that you select “1000 alignments” and
                                                “1000 descriptions” in the Blast page
e Is this protein (the homolog) closely related where you enter the sequence.
  to proteins from other ciliates?
                                                You will have to do a second Blast
  Are both Nyctotherus versions of this gene search… + a tree
  the result of a recent gene duplication? Or
  do they have different origins? How do you
  think these proteins ended up in the
   genome? Horizontal Gene Transfer or
   Vertical Inheritance?

3 In contrast to what one would expect,
   hyperthermophiles (species that live at a
   temperature of 80 C or higher) do not
   always have a high GC level in their DNA
   to stabilize it. Instead they have found
   another solution to stabilize their DNA.
   Comparative genome analysis combined
   with the examination of the experimental
   literature we can identify candidate proteins
   for this.
 a Find the genes that are shared between                     Hyperthermophiles are: A. fulgidus, A.
   nearly all hyperthermophiles in the COG                    pernix, M. jannaschii, P. horikoshii, P.
   database and that are not shared by most of                abyssi, A. aeolicus and T. maritima.
   the non-thermophiles.                                      Examples of non-thermophiles are E.
                                                              coli, B.subtilis, Synechocystis,
                                                              M.tuberculosis, and H.influenzae. Use
                                                              the “old” COG database:
                                                              www.ncbi.nlm.nih.gov/COG/old and
                                                              then select “phylogenetic patterns
                                                              search”
 b Interpret the functions of the proteins.                   Search in Pubmed at
                                                              www.ncbi.nlm.nih.gov/entrez.
 c Which one is likely responsible for the
   stabilization of the DNA?
 d What is the mechanism of DNA
   stabilization?

4 An important part of comparative genome
  analysis is the interpretation of the results in
  light of the biological context. Below are
  two sequences that are shared between the
  parasitic species R. prowazekii and E.
  cuniculi.
   >E.  cuniculi
   MNEVENNNHS FPREDIPTED   EIEEEANSRQ GILRYFRVAR AEYTKFALLG
   LMFGIIGFIY
   SFMRILKDMF VMVRQEPTTI   LFIKIFYILP VSMALVFLIQ YMLGTKTVSR
   IFSIFCGGFA
   SLFFLCGAVF LIEEQVSPSK   FLFRDMFIDG KMSSRSLNVF KSMFLTLNEP
   LATIVFISAE
   MWGSLVLSYL FLSFLNESCT   IRQFSRFIPP LIIITNVSLF LSATVAGAFF
   KLREKLAFQQ
   NQVLLSGIFI FQGFLVVLVI   FLKIYLERVT MKRPLFIVSS GSRRKKAKAN
   VSFSEGLEIM
   SQSKLLLAMS LIVLFFNISY   NMVESTFKVG VKVAAEYFNE EKGKYSGKFN
   RIDQYMTSVV
   VICLNLSPFS SYVETRGFLL   VGLITPIVTL MAIVLFLGSA LYNTSMEESG
   LGIVNGLFPG
   GKPLYVLENY FGVIFMSLLK   ITKYSAFDIC KEKLGMRINP TYRARFKSVY
   DGIFGKLGKS
   IGSIYGLLMF EALDTEDLRK   ATPITAGIIF IFIVMWVKAI IYLSRSYESA
   VQHNRDVDID
  MTEKAKKSLE TPEEPKVVD

  > R.prowazekii
  MSTSKSENYL SELRKIIWPI   EQYENKKFLP LAFMMFCILL NYSTLRSIKD
  GFVVTDIGTE
  SISFLKTYIV LPSAVIAMII   YVKLCDILKQ ENVFYVITSF FLGYFALFAF
  VLYPYPDLVH
  PDHKTIESLS LAYPNFKWFI   KIVGKWSFAS FYTIAELWGT MMLSLLFWQF
  ANQITKIAEA
  KRFYSMFGLL ANLALPVTSV   VIGYFLHEKT QIVAEHLKFV PLFVIMITSS
  FLIILTYRWM
  NKNVLTDPRL YDPALVKEKK   TKAKLSFIES LKMIFTSKYV GYIALLIIAY
  GVSVNLVEGV
  WKSKVKELYP TKEAYTIYMG   QFQFYQGWVA IAFMLIGSNI LRKVSWLTAA
  MITPLMMFIT
  GAAFFSFIFF DSVIAMNLTG   ILASSPLTLA VMIGMIQNVL SKGVKYSLFD
  ATKNMAYIPL
  DKDLRVKGQA AVEVIGGRLG   KSGGAIIQST FFILFPVFGF IEATPYFASI
  FFIIVILWIF
  AVKGLNKEYQ VLVNKNEK



 a What is their function?
 b Can you relate their function to the parasitic
   lifestyle of the species that harbor them?
 c How many other homologous sequences
   can you discover in either genome?
 d In which other species do these proteins
   occur?
 e What, if anything, do they do in these
   species?

5 The gene for methylcitrate synthase has
  only been discovered relatively recently. It
  is homologous to citrate synthase, and
  wrongly annotated in the COG database
  (COG0372). Below are a number of protein
  sequences, some of which are citrate
  synthases, some are methyl-citrate
  synthases, while for others there is no
  experimental evidence regarding their
  function.
  >E.coli methylcitrate_synthase
  MSDTTILQNS THVIKPKKSV ALSGVPAGNT   ALCTVGKSGN DLHYRGYDIL
  DLAKHCEFEE
  VAHLLIHGKL PTRDELAAYK TKLKALRGLP   ANVRTVLEAL PAASHPMDVM
  RTGVSALGCT
  LPEKEGHTVS GARDIADKLL ASLSSILLYW   YHYSHNGERI QPETDDDSIG
  GHFLHLLHGE
  KPSQSWEKAM HISLVLYAEH EFNASTFTSR   VIAGTGSDMY SAIIGAIGAL
  RGPKHGGANE
  VSLEIQQRYE TPDEAEADIR KRVENKEVVI   GFGHPVYTIA DPRHQVIKRV
  AKQLSQEGGS
  LKMYNIADRL ETVMWESKKM FPNLDWFSAV   SYNMMGVPTE MFTPLFVIAR
  VTGWAAHIIE
  QRQDNKIIRP SANYVGPEDR PFVALDKRQ

  >Ralstonia   methylcitrate_synthase
  MSEAQPLVTP   KPKKSVALSG VTAGNTALCT VGRTGNDLHY RGYDILDIAE
  TCEFEEIAHL
  LVHGKLPTKS   ELAAYKAKLK SLRGLPANVK AALEWVPASA HPMDVMRTGV
  SVLGTVLPEK
  EDHNTPGARD   IADRLMASLG SMLLYWYHYS HNGRRIEVET DDDSIGGHFL
  HLLHGEKPSA
  LWERAMNTSL   NLYAEHEFNA STFTARVIAG TGSDMYSSIS GAIGALRGPK
  HGGANEVAFE
  IQKRYDNPDE   AQADITRRVE NKEVVIGFGH PVYTTGDPRN QVIKEVAKKL
  SKDAGSMKMF
  DIAEALETVM   WDIKKMFPNL DWFSAVSYHM MGVPTAMFTA LFVIARTSGW
  AAHIIEQRID
  NKIIRQSANY   TGPENLKFVP LKDRK

  >Corynebacterium methylcitrate synthase
 MSSATTTDVR   KGLYGVIADY TAVSKVMPET NSLTYRGYAV EDLVENCSFE
 EVFYLLWHGE
 LPTAQQLAEF   NERGRSYRSL DAGLISLIHS LPKEAHPMDV MRTAVSYMGT
 KDSEYFTTDS
 EHIRKVGHTL   LAQLPMVLAM DIRRRKGLDI IAPDSSKSVA ENLLSMVFGT
 GPESPASNPA
 DVRDFEKSLI   LYAEHSFNAS TFTARVITST KSDVYSAITG AIGALKGPLH
 GGANEFVMHT
 MLAIDDPNKA   AAWINNALDN KNVVMGFGHR VYKRGDSRVP SMEKSFRELA
 ARHDGEKWVA
 MYENMRDAMD   ARTGIKPNLD FPAGPAYHLL GFPVDFFTPL FVIARVAGWT
 AHIVEQYENN
 SLIRPLSEYN   GEEQREVAPI EKR

 >E.coli citrate synthase
 MADTKAKLTL NGDTAVELDV LKGTLGQDVI   DIRTLGSKGV FTFDPGFTST
 ASCESKITFI
 DGDEGILLHR GFPIDQLATD SNYLEVCYIL   LNGEKPTQEQ YDEFKTTVTR
 HTMIHEQITR
 LFHAFRRDSH PMAVMCGITG ALAAFYHDSL   DVNNPRHREI AAFRLLSKMP
 TMAAMCYKYS
 IGQPFVYPRN DLSYAGNFLN MMFSTPCEPY   EVNPILERAM DRILILHADH
 EQNASTSTVR
 TAGSSGANPF ACIAAGIASL WGPAHGGANE   AALKMLEEIS SVKHIPEFVR
 RAKDKNDSFR
 LMGFGHRVYK NYDPRATVMR ETCHEVLKEL   GTKDDLLEVA MELENIALND
 PYFIEKKLYP
 NVDFYSGIIL KAMGIPSSMF TVIFAMARTV   GWIAHWSEMH SDGMKIARPR
 QLYTGYEKRD
 FKSDIKR

 >B.subtilis citrate synthase
 MVHYGLKGIT CVETSISHID GEKGRLIYRG   HHAKDIALNH SFEEAAYLIL
 FGKLPSTEEL
 QVFKDKLAAE RNLPEHIERL IQSLPNNMDD   MSVVRTVVSA LGENTYTFHP
 KTEEAIRLIA
 ITPSIIAYRK RWTRGEQAIA PSSQYGHVEN   YYYMLTGEQP SEAKKKALET
 YMILATEHGM
 NASTFSARVT LSTESDLVSA VTAALGTMKG   PLHGGAPSAV TKMLEDIGEK
 EHAEAYLKEK
 LEKGERLMGF GHRVYKTKDP RAEALRQKAE   EVAGNDRDLD LALHVEAEAI
 RLLEIYKPGR
 KLYTNVEFYA AAVMRAIDFD DELFTPTFSA   SRMVGWCAHV LEQAENNMIF
 RPSAQYTGAI
 PEEVLS

 >A.fulgidus
 MKDGLEDVIA CKTTISRIAL   ENGRAILEYR GYDIRDLARK ASYEEVAYLL
 LYGELPKKYE
 LQDFKIELAE RRELPPQIIG   LLTHLPPYTH PMVVLRTATS YLGSLDKKIA
 VRTREETFNK
 AKDLIAKFPT IVAYYHRIRT   GRNIIPPALE FSHAANFLYM LHGEEPTKTA
 ERALDMDLIL
 HAEHELNAST FAARIAASTL   ADIYACVVAA TGTLMGPLHG GAAQEVMRML
 REVASPRRAE
 EYVKRKIEAG ERIMGFGHRV   YRGVMDPRAE LLRYLAKRLA AEGSTKWFEI
 SEAIAKAAYK
 YKKLLPNVDF YSASVYANLG   IPDDLFVNIF AMGRISGWTA HIIEQYENNR
 LIRPRAEYVG
 EKEKKFIPLS KR

 >Synechocystis
 MNYMMTDNEV FKEGLAGVPA   AKSRVSHVDG TDGILEYRGI RIEELAKSSS
 FIEVAYLLIW
 GKLPTQAEIE EFEYEIRTHR   RIKYHIRDMM KCFPETGHPM DALQTSAAAL
 GLFYARRALD
 DPKYIRAAVV RLLAKIPTMV   AAFHMIREGN DPIQPNDKLD YASNFLYMLT
 EKEPDPFAAK
 VFDVCLTLHA EHTMNASTFS   ARVTASTLTD PYAVVASAVG TLAGPLHGGA
 NEEVLNMLEE
 IGSVENVRPY VEKCLANKQR   IMGFGHRVYK VKDPRAIILQ DLAEQLFAKM
 GHDEYYEIAV
 ELEKVVEEYV GQKGIYPNVD   FYSGLVYRKL DIPADLFTPL FAIARVAGWL
 AHWKEQLSVN
 KIYRPTQIYI GDHNLSYVPM   TERVVSVARN EDPNAII



a Make phylogenetic tree of the sequences.
b Which ones are also likely to be
  (methyl)citrate synthases based on the
  phylogenetic tree?
c Are the methylcitrate synthases
  monophyletic?
d B.subtilis has a second gene (beside the one
  above) that in the COGs is annotated as a
 citrate synthase. Add it to the tree.
 >B.subtilis
 MEEKQHYSPG LDGVIAAETH                 ISYLDTQSSQ
 ILIRGYDLIE LSETKSYLEL                 VHLLLEGRLP
 EESEMETLER KINSASSLPA                 DHLRLLELLP
 EDTHPMDGLR TGLSALAGYD                 RQIDDRSPSA
 NKERAYQLLG KMPALTAASY                 RIINKKEPIL
 PLQTLSYSAN FLYMMTGKLP                 SSLEEQIFDR
 SLVLYSEHEM PNSTFAARVI                 ASTHSDLYGA
 LTGAVASLKG NLHGGANEAV                 MYLLLEAKTT
 SDFEQLLQTK LKRKEKIMGF                 GHRVYMKKMD
 PRALMMKEAL QQLCDKAGDH                 RLYEMCEAGE
 RLMEKEKGLY PNLDYYAAPV                 YWMLGIPIPL
 YTPIFFSART SGLCAHVIEQ                 HANNRLFRPR
 VSYMGPRYQT KS

e Do you think, based on the tree, that it is a
  citrate synthase or a methyl-citrate
  synthase?
f Methylcitrate synthase activity has also      You will have to add all homologs of
  been found in S. cerevisiae, but the gene     citrate synthase in S. cerevisiae to the
  responsible has not been identified. Can you alignment and remake the tree.
  find a sequence in S. cerevisiae that is
  orthologous to the methylcitrate synthases
  above?
g If you cannot find a gene that is orthologous Remember the story about the malate
  to methyl-citrate synthases in S. cerevisiae, and lactate dehydrogenases in
  and you know that there has to be one (the Trichomonas on day 3.
  activity has been measured). What does this
  imply for the monophyly of methyl-citrate
  synthases?
h Another methyl-citrate synthase has
  recently been identified in the fungi. Add it
  to the tree.
 >Aspergillus    nidulans methylcitrate_synthase
 MALPLRTARH   ASRLAQTIGR RGYATAEPDL KSALKAVIPA KRELLAEVKK
 QGDEVIGEVK
 VSNVIGGMRG   LKSMLWEGSV LDADEGIRFH GKTIKDCQKE LPKGPTGTEM
 LPEAMFWLLL
 TGEVPSTSQV   RAFSKQLAEE SHLPDHILDL AKSFPKHMHP MTQISIITAA
 LNTESKFAKL
 YEKGINKADY   WEPTFDDAIS LLAKIPRVAA LVFRPNEIDV VGRQKLDPAQ
 DWSYNFAELL
 GKGGANNADF   HDLLRLYLAL HGDHEGGNVS AHATHLVGSA LSDPFLSYSA
 GLLGLAGPLH
 GLAAQEVLRW   ILAMQEKIGT KFTDEDVRAY LWDTLKSGRV VPGYGHGVLR
 KPDPRFQALM
 DFAATRKDVL   ANPVFQLVKK NSEIAPGVLT EHGKTKNPHP NVDAASGVLF
 YHYGFQQPLY
 YTVTFGVSRA   LGPLVQLIWD RALGLPIERP KSINLLGLKK



i Which S. cerevisiae gene is the likely
  methyl-citrate synthase?

						
Related docs