Questions bioinformatics III, day 5
Document Sample


Day 1 2 3 4 Day 6 7 8 9
Questions bioinformatics III, day 5
Question Hints
1 The SARS virus, which belongs top the
group of the Nidovirales has been predicted
to encode 11 proteins, some of which are
actually split into separate peptides. The set
of 11 can be found here.
11 proteins
identified in SARS
>SARS_1 * putative orf1ab
polyprotein *
join(250..13383,13383..21470) *
29836505
MESLVLGVNE KTHVQLSLPV LQVRDVLVRG
FGDSVEEALS EAREHLKNGT CGLVELEKGV
LPQLEQPYVF IKRSDALSTN HGHKVVELVA
EMDGIQYGRS GITLGVLVPH VGETPIAYRN
VLLRKNGNKG AGGHSYGIDL KSYDLGDELG
TDPIEDYEQN WNTKHGSGAL RELTRELNGG
AVTRYVDNNF CGPDGYPLDC IKDFLARAGK
SMCTLSEQLD YIESKRGVYC CRDHEHEIAW
FTERSDKSYE HQTPFEIKSA KKFDTFKGEC
PKFVFPLNSK VKVIQPRVEK KKTEGFMGRI
RSVYPVASPQ ECNNMHLSTL MKCNHCDEVS
WQTCDFLKAT CEHCGTENLV IEGPTTCGYL
PTNAVVKMPC PACQDPEIGP EHSVADYHNH
SNIETRLRKG GRTRCFGGCV FAYVGCYNKR
AYWVPRASAD IGSGHTGITG DNVETLNEDL
LEILSRERVN INIVGDFHLN EEVAIILASF
SASTSAFIDT IKSLDYKSFK TIVESCGNYK
VTKGKPVKGA WNIGQQRSVL TPLCGFPSQA
AGVIRSIFAR TLDAANHSIP DLQRAAVTIL
DGISEQSLRL VDAMVYTSDL LTNSVIIMAY
VTGGLVQQTS QWLSNLLGTT VEKLRPIFEW
IEAKLSAGVE FLKDAWEILK FLITGVFDIV
KGQIQVASDN IKDCVKCFID VVNKALEMCI
DQVTIAGAKL RSLNLGEVFI AQSKGLYRQC
IRGKEQLQLL MPLKAPKEVT FLEGDSHDTV
LTSEEVVLKN GELEALETPV DSFTNGAIVG
TPVCVNGLML LEIKDKEQYC ALSPGLLATN
NVFRLKGGAP IKGVTFGEDT VWEVQGYKNV
RITFELDERV DKVLNEKCSV YTVESGTEVT
EFACVVAEAV VKTLQPVSDL LTNMGIDLDE
WSVATFYLFD DAGEENFSSR MYCSFYPPDE
EEEDDAECEE EEIDETCEHE YGTEDDYQGL
PLEFGASAET VRVEEEEEED WLDDTTEQSE
IEPEPEPTPE EPVNQFTGYL KLTDNVAIKC
VDIVKEAQSA NPMVIVNAAN IHLKHGGGVA
GALNKATNGA MQKESDDYIK LNGPLTVGGS
CLLSGHNLAK KCLHVVGPNL NAGEDIQLLK
AAYENFNSQD ILLAPLLSAG IFGAKPLQSL
QVCVQTVRTQ VYIAVNDKAL YEQVVMDYLD
NLKPRVEAPK QEEPPNTEDS KTEEKSVVQK
PVDVKPKIKA CIDEVTTTLE ETKFLTNKLL
LFADINGKLY HDSQNMLRGE DMSFLEKDAP
YMVGDVITSG DITCVVIPSK KAGGTTEMLS
RALKKVPVDE YITTYPGQGC AGYTLEEAKT
ALKKCKSAFY VLPSEAPNAK EEILGTVSWN
LREMLAHAEE TRKLMPICMD VRAIMATIQR
KYKGIKIQEG IVDYGVRFFF YTSKEPVASI
ITKLNSLNEP LVTMPIGYVT HGFNLEEAAR
CMRSLKAPAV VSVSSPDAVT TYNGYLTSSS
KTSEEHFVET VSLAGSYRDW SYSGQRTELG
VEFLKRGDKI VYHTLESPVE FHLDGEVLSL
DKLKSLLSLR EVKTIKVFTT VDNTNLHTQL
VDMSMTYGQQ FGPTYLDGAD VTKIKPHVNH
EGKTFFVLPS DDTLRSEAFE YYHTLDESFL
GRYMSALNHT KKWKFPQVGG LTSIKWADNN
CYLSSVLLAL QQLEVKFNAP ALQEAYYRAR
AGDAANFCAL ILAYSNKTVG ELGDVRETMT
HLLQHANLES AKRVLNVVCK HCGQKTTTLT
GVEAVMYMGT LSYDNLKTGV SIPCVCGRDA
TQYLVQQESS FVMMSAPPAE YKLQQGTFLC
ANEYTGNYQC GHYTHITAKE TLYRIDGAHL
TKMSEYKGPV TDVFYKETSY TTTIKPVSYK
LDGVTYTEIE PKLDGYYKKD NAYYTEQPID
LVPTQPLPNA SFDNFKLTCS NTKFADDLNQ
MTGFTKPASR ELSVTFFPDL NGDVVAIDYR
HYSASFKKGA KLLHKPIVWH INQATTKTTF
KPNTWCLRCL WSTKPVDTSN SFEVLAVEDT
QGMDNLACES QQPTSEEVVE NPTIQKEVIE
CDVKTTEVVG NVILKPSDEG VKVTQELGHE
DLMAAYVENT SITIKKPNEL SLALGLKTIA
THGIAAINSV PWSKILAYVK PFLGQAAITT
SNCAKRLAQR VFNNYMPYVF TLLFQLCTFT
KSTNSRIRAS LPTTIAKNSV KSVAKLCLDA
GINYVKSPKF SKLFTIAMWL LLLSICLGSL
ICVTAAFGVL LSNFGAPSYC NGVRELYLNS
SNVTTMDFCE GSFPCSICLS GLDSLDSYPA
LETIQVTISS YKLDLTILGL AAEWVLAYML
FTKFFYLLGL SAIMQVFFGY FASHFISNSW
LMWFIISIVQ MAPVSAMVRM YIFFASFYYI
WKSYVHIMDG CTSSTCMMCY KRNRATRVEC
TTIVNGMKRS FYVYANGGRG FCKTHNWNCL
NCDTFCTGST FISDEVARDL SLQFKRPINP
TDQSSYIVDS VAVKNGALHL YFDKAGQKTY
ERHPLSHFVN LDNLRANNTK GSLPINVIVF
DGKSKCDESA SKSASVYYSQ LMCQPILLLD
QALVSDVGDS TEVSVKMFDA YVDTFSATFS
VPMEKLKALV ATAHSELAKG VALDGVLSTF
VSAARQGVVD TDVDTKDVIE CLKLSHHSDL
EVTGDSCNNF MLTYNKVENM TPRDLGACID
CNARHINAQV AKSHNVSLIW NVKDYMSLSE
QLRKQIRSAA KKNNIPFRLT CATTRQVVNV
ITTKISLKGG KIVSTCFKLM LKATLLCVLA
ALVCYIVMPV HTLSIHDGYT NEIIGYKAIQ
DGVTRDIIST DDCFANKHAG FDAWFSQRGG
SYKNDKSCPV VAAIITREIG FIVPGLPGTV
LRAINGDFLH FLPRVFSAVG NICYTPSKLI
EYSDFATSAC VLAAECTIFK DAMGKPVPYC
YDTNLLEGSI SYSELRPDTR YVLMDGSIIQ
FPNTYLEGSV RVVTTFDAEY CRHGTCERSE
VGICLSTSGR WVLNNEHYRA LSGVFCGVDA
MNLIANIFTP LVQPVGALDV SASVVAGGII
AILVTCAAYY FMKFRRVFGE YNHVVAANAL
LFLMSFTILC LVPAYSFLPG VYSVFYLYLT
FYFTNDVSFL AHLQWFAMFS PIVPFWITAI
YVFCISLKHC HWFFNNYLRK RVMFNGVTFS
TFEEAALCTF LLNKEMYLKL RSETLLPLTQ
YNRYLALYNK YKYFSGALDT TSYREAACCH
LAKALNDFSN SGADVLYQPP QTSITSAVLQ
SGFRKMAFPS GKVEGCMVQV TCGTTTLNGL
WLDDTVYCPR HVICTAEDML NPNYEDLLIR
KSNHSFLVQA GNVQLRVIGH SMQNCLLRLK
VDTSNPKTPK YKFVRIQPGQ TFSVLACYNG
SPSGVYQCAM RPNHTIKGSF LNGSCGSVGF
NIDYDCVSFC YMHHMELPTG VHAGTDLEGK
FYGPFVDRQT AQAAGTDTTI TLNVLAWLYA
AVINGDRWFL NRFTTTLNDF NLVAMKYNYE
PLTQDHVDIL GPLSAQTGIA VLDMCAALKE
LLQNGMNGRT ILGSTILEDE FTPFDVVRQC
SGVTFQGKFK KIVKGTHHWM LLTFLTSLLI
LVQSTQWSLF FFVYENAFLP FTLGIMAIAA
CAMLLVKHKH AFLCLFLLPS LATVAYFNMV
YMPASWVMRI MTWLELADTS LSGYRLKDCV
MYASALVLLI LMTARTVYDD AARRVWTLMN
VITLVYKVYY GNALDQAISM WALVISVTSN
YSGVVTTIMF LARAIVFVCV EYYPLLFITG
NTLQCIMLVY CFLGYCCCCY FGLFCLLNRY
FRLTLGVYDY LVSTQEFRYM NSQGLLPPKS
SIDAFKLNIK LLGIGGKPCI KVATVQSKMS
DVKCTSVVLL SVLQQLRVES SSKLWAQCVQ
LHNDILLAKD TTEAFEKMVS LLSVLLSMQG
AVDINRLCEE MLDNRATLQA IASEFSSLPS
YAAYATAQEA YEQAVANGDS EVVLKKLKKS
LNVAKSEFDR DAAMQRKLEK MADQAMTQMY
KQARSEDKRA KVTSAMQTML FTMLRKLDND
ALNNIINNAR DGCVPLNIIP LTTAAKLMVV
VPDYGTYKNT CDGNTFTYAS ALWEIQQVVD
ADSKIVQLSE INMDNSPNLA WPLIVTALRA
NSAVKLQNNE LSPVALRQMS CAAGTTQTAC
TDDNALAYYN NSKGGRFVLA LLSDHQDLKW
ARFPKSDGTG TIYTELEPPC RFVTDTPKGP
KVKYLYFIKG LNNLNRGMVL GSLAATVRLQ
AGNATEVPAN STVLSFCAFA VDPAKAYKDY
LASGGQPITN CVKMLCTHTG TGQAITVTPE
ANMDQESFGG ASCCLYCRCH IDHPNPKGFC
DLKGKYVQIP TTCANDPVGF TLRNTVCTVC
GMWKGYGCSC DQLREPLMQS ADASTFLNRV
CGVSAARLTP CGTGTSTDVV YRAFDIYNEK
VAGFAKFLKT NCCRFQEKDE EGNLLDSYFV
VKRHTMSNYQ HEETIYNLVK DCPAVAVHDF
FKFRVDGDMV PHISRQRLTK YTMADLVYAL
RHFDEGNCDT LKEILVTYNC CDDDYFNKKD
WYDFVENPDI LRVYANLGER VRQSLLKTVQ
FCDAMRDAGI VGVLTLDNQD LNGNWYDFGD
FVQVAPGCGV PIVDSYYSLL MPILTLTRAL
AAESHMDADL AKPLIKWDLL KYDFTEERLC
LFDRYFKYWD QTYHPNCINC LDDRCILHCA
NFNVLFSTVF PPTSFGPLVR KIFVDGVPFV
VSTGYHFREL GVVHNQDVNL HSSRLSFKEL
LVYAADPAMH AASGNLLLDK RTTCFSVAAL
TNNVAFQTVK PGNFNKDFYD FAVSKGFFKE
GSSVELKHFF FAQDGNAAIS DYDYYRYNLP
TMCDIRQLLF VVEVVDKYFD CYDGGCINAN
QVIVNNLDKS AGFPFNKWGK ARLYYDSMSY
EDQDALFAYT KRNVIPTITQ MNLKYAISAK
NRARTVAGVS ICSTMTNRQF HQKLLKSIAA
TRGATVVIGT SKFYGGWHNM LKTVYSDVET
PHLMGWDYPK CDRAMPNMLR IMASLVLARK
HNTCCNLSHR FYRLANECAQ VLSEMVMCGG
SLYVKPGGTS SGDATTAYAN SVFNICQAVT
ANVNALLSTD GNKIADKYVR NLQHRLYECL
YRNRDVDHEF VDEFYAYLRK HFSMMILSDD
AVVCYNSNYA AQGLVASIKN FKAVLYYQNN
VFMSEAKCWT ETDLTKGPHE FCSQHTMLVK
QGDDYVYLPY PDPSRILGAG CFVDDIVKTD
GTLMIERFVS LAIDAYPLTK HPNQEYADVF
HLYLQYIRKL HDELTGHMLD MYSVMLTNDN
TSRYWEPEFY EAMYTPHTVL QAVGACVLCN
SQTSLRCGAC IRRPFLCCKC CYDHVISTSH
KLVLSVNPYV CNAPGCDVTD VTQLYLGGMS
YYCKSHKPPI SFPLCANGQV FGLYKNTCVG
SDNVTDFNAI ATCDWTNAGD YILANTCTER
LKLFAAETLK ATEETFKLSY GIATVREVLS
DRELHLSWEV GKPRPPLNRN YVFTGYRVTK
NSKVQIGEYT FEKGDYGDAV VYRGTTTYKL
NVGDYFVLTS HTVMPLSAPT LVPQEHYVRI
TGLYPTLNIS DEFSSNVANY QKVGMQKYST
LQGPPGTGKS HFAIGLALYY PSARIVYTAC
SHAAVDALCE KALKYLPIDK CSRIIPARAR
VECFDKFKVN STLEQYVFCT VNALPETTAD
IVVFDEISMA TNYDLSVVNA RLRAKHYVYI
GDPAQLPAPR TLLTKGTLEP EYFNSVCRLM
KTIGPDMFLG TCRRCPAEIV DTVSALVYDN
KLKAHKDKSA QCFKMFYKGV ITHDVSSAIN
RPQIGVVREF LTRNPAWRKA VFISPYNSQN
AVASKILGLP TQTVDSSQGS EYDYVIFTQT
TETAHSCNVN RFNVAITRAK IGILCIMSDR
DLYDKLQFTS LEIPRRNVAT LQAENVTGLF
KDCSKIITGL HPTQAPTHLS VDIKFKTEGL
CVDIPGIPKD MTYRRLISMM GFKMNYQVNG
YPNMFITREE AIRHVRAWIG FDVEGCHATR
DAVGTNLPLQ LGFSTGVNLV AVPTGYVDTE
NNTEFTRVNA KPPPGDQFKH LIPLMYKGLP
WNVVRIKIVQ MLSDTLKGLS DRVVFVLWAH
GFELTSMKYF VKIGPERTCC LCDKRATCFS
TSSDTYACWN HSVGFDYVYN PFMIDVQQWG
FTGNLQSNHD QHCQVHGNAH VASCDAIMTR
CLAVHECFVK RVDWSVEYPI IGDELRVNSA
CRKVQHMVVK SALLADKFPV LHDIGNPKAI
KCVPQAEVEW KFYDAQPCSD KAYKIEELFY
SYATHHDKFT DGVCLFWNCN VDRYPANAIV
CRFDTRVLSN LNLPGCDGGS LYVNKHAFHT
PAFDKSAFTN LKQLPFFYYS DSPCESHGKQ
VVSDIDYVPL KSATCITRCN LGGAVCRHHA
NEYRQYLDAY NMMISAGFSL WIYKQFDTYN
LWNTFTRLQS LENVAYNVVN KGHFDGHAGE
APVSIINNAV YTKVDGIDVE IFENKTTLPV
NVAFELWAKR NIKPVPEIKI LNNLGVDIAA
NTVIWDYKRE APAHVSTIGV CTMTDIAKKP
TESACSSLTV LFDGRVEGQV DLFRNARNGV
LITEGSVKGL TPSKGPAQAS VNGVTLIGES
VKTQFNYFKK VDGIIQQLPE TYFTQSRDLE
DFKPRSQMET DFLELAMDEF IQRYKLEGYA
FEHIVYGDFS HGQLGGLHLM IGLAKRSQDS
PLKLEDFIPM DSTVKNYFIT DAQTGSSKCV
CSVIDLLLDD FVEIIKSQDL SVISKVVKVT
IDYAEISFML WCKDGHVETF YPKLQASRAW
QPGVAMPNLY KMQRMLLEKC DLQNYGENAV
IPKGIMMNVA KYTQLCQYLN TLTLAVPYNM
RVIHFGAGSD KGVAPGTAVL RQWLPTGTLL
VDSDLNDFVS DAYSTLIGDC ATVHTANKWD
LIISDMYDPR TKHVTKENDS KEGFFTYLCG
FIKQKLALGG SIAVKITEHS WNADLYKLMG
HFSWWTAFVT NVNASSSEAF LIGANYLGKP
KEQIDGYTMH ANYIFWRNTN PIQLSSYSLF
DMSKFPLKLR GTAVMSLKEN QINDMIYSLL
EKGRLIIREN NRVVVSSDIL VNN
>SARS_2 * orf1a polyprotein *
250..13398 * 29836495
MESLVLGVNE KTHVQLSLPV LQVRDVLVRG
FGDSVEEALS EAREHLKNGT CGLVELEKGV
LPQLEQPYVF IKRSDALSTN HGHKVVELVA
EMDGIQYGRS GITLGVLVPH VGETPIAYRN
VLLRKNGNKG AGGHSYGIDL KSYDLGDELG
TDPIEDYEQN WNTKHGSGAL RELTRELNGG
AVTRYVDNNF CGPDGYPLDC IKDFLARAGK
SMCTLSEQLD YIESKRGVYC CRDHEHEIAW
FTERSDKSYE HQTPFEIKSA KKFDTFKGEC
PKFVFPLNSK VKVIQPRVEK KKTEGFMGRI
RSVYPVASPQ ECNNMHLSTL MKCNHCDEVS
WQTCDFLKAT CEHCGTENLV IEGPTTCGYL
PTNAVVKMPC PACQDPEIGP EHSVADYHNH
SNIETRLRKG GRTRCFGGCV FAYVGCYNKR
AYWVPRASAD IGSGHTGITG DNVETLNEDL
LEILSRERVN INIVGDFHLN EEVAIILASF
SASTSAFIDT IKSLDYKSFK TIVESCGNYK
VTKGKPVKGA WNIGQQRSVL TPLCGFPSQA
AGVIRSIFAR TLDAANHSIP DLQRAAVTIL
DGISEQSLRL VDAMVYTSDL LTNSVIIMAY
VTGGLVQQTS QWLSNLLGTT VEKLRPIFEW
IEAKLSAGVE FLKDAWEILK FLITGVFDIV
KGQIQVASDN IKDCVKCFID VVNKALEMCI
DQVTIAGAKL RSLNLGEVFI AQSKGLYRQC
IRGKEQLQLL MPLKAPKEVT FLEGDSHDTV
LTSEEVVLKN GELEALETPV DSFTNGAIVG
TPVCVNGLML LEIKDKEQYC ALSPGLLATN
NVFRLKGGAP IKGVTFGEDT VWEVQGYKNV
RITFELDERV DKVLNEKCSV YTVESGTEVT
EFACVVAEAV VKTLQPVSDL LTNMGIDLDE
WSVATFYLFD DAGEENFSSR MYCSFYPPDE
EEEDDAECEE EEIDETCEHE YGTEDDYQGL
PLEFGASAET VRVEEEEEED WLDDTTEQSE
IEPEPEPTPE EPVNQFTGYL KLTDNVAIKC
VDIVKEAQSA NPMVIVNAAN IHLKHGGGVA
GALNKATNGA MQKESDDYIK LNGPLTVGGS
CLLSGHNLAK KCLHVVGPNL NAGEDIQLLK
AAYENFNSQD ILLAPLLSAG IFGAKPLQSL
QVCVQTVRTQ VYIAVNDKAL YEQVVMDYLD
NLKPRVEAPK QEEPPNTEDS KTEEKSVVQK
PVDVKPKIKA CIDEVTTTLE ETKFLTNKLL
LFADINGKLY HDSQNMLRGE DMSFLEKDAP
YMVGDVITSG DITCVVIPSK KAGGTTEMLS
RALKKVPVDE YITTYPGQGC AGYTLEEAKT
ALKKCKSAFY VLPSEAPNAK EEILGTVSWN
LREMLAHAEE TRKLMPICMD VRAIMATIQR
KYKGIKIQEG IVDYGVRFFF YTSKEPVASI
ITKLNSLNEP LVTMPIGYVT HGFNLEEAAR
CMRSLKAPAV VSVSSPDAVT TYNGYLTSSS
KTSEEHFVET VSLAGSYRDW SYSGQRTELG
VEFLKRGDKI VYHTLESPVE FHLDGEVLSL
DKLKSLLSLR EVKTIKVFTT VDNTNLHTQL
VDMSMTYGQQ FGPTYLDGAD VTKIKPHVNH
EGKTFFVLPS DDTLRSEAFE YYHTLDESFL
GRYMSALNHT KKWKFPQVGG LTSIKWADNN
CYLSSVLLAL QQLEVKFNAP ALQEAYYRAR
AGDAANFCAL ILAYSNKTVG ELGDVRETMT
HLLQHANLES AKRVLNVVCK HCGQKTTTLT
GVEAVMYMGT LSYDNLKTGV SIPCVCGRDA
TQYLVQQESS FVMMSAPPAE YKLQQGTFLC
ANEYTGNYQC GHYTHITAKE TLYRIDGAHL
TKMSEYKGPV TDVFYKETSY TTTIKPVSYK
LDGVTYTEIE PKLDGYYKKD NAYYTEQPID
LVPTQPLPNA SFDNFKLTCS NTKFADDLNQ
MTGFTKPASR ELSVTFFPDL NGDVVAIDYR
HYSASFKKGA KLLHKPIVWH INQATTKTTF
KPNTWCLRCL WSTKPVDTSN SFEVLAVEDT
QGMDNLACES QQPTSEEVVE NPTIQKEVIE
CDVKTTEVVG NVILKPSDEG VKVTQELGHE
DLMAAYVENT SITIKKPNEL SLALGLKTIA
THGIAAINSV PWSKILAYVK PFLGQAAITT
SNCAKRLAQR VFNNYMPYVF TLLFQLCTFT
KSTNSRIRAS LPTTIAKNSV KSVAKLCLDA
GINYVKSPKF SKLFTIAMWL LLLSICLGSL
ICVTAAFGVL LSNFGAPSYC NGVRELYLNS
SNVTTMDFCE GSFPCSICLS GLDSLDSYPA
LETIQVTISS YKLDLTILGL AAEWVLAYML
FTKFFYLLGL SAIMQVFFGY FASHFISNSW
LMWFIISIVQ MAPVSAMVRM YIFFASFYYI
WKSYVHIMDG CTSSTCMMCY KRNRATRVEC
TTIVNGMKRS FYVYANGGRG FCKTHNWNCL
NCDTFCTGST FISDEVARDL SLQFKRPINP
TDQSSYIVDS VAVKNGALHL YFDKAGQKTY
ERHPLSHFVN LDNLRANNTK GSLPINVIVF
DGKSKCDESA SKSASVYYSQ LMCQPILLLD
QALVSDVGDS TEVSVKMFDA YVDTFSATFS
VPMEKLKALV ATAHSELAKG VALDGVLSTF
VSAARQGVVD TDVDTKDVIE CLKLSHHSDL
EVTGDSCNNF MLTYNKVENM TPRDLGACID
CNARHINAQV AKSHNVSLIW NVKDYMSLSE
QLRKQIRSAA KKNNIPFRLT CATTRQVVNV
ITTKISLKGG KIVSTCFKLM LKATLLCVLA
ALVCYIVMPV HTLSIHDGYT NEIIGYKAIQ
DGVTRDIIST DDCFANKHAG FDAWFSQRGG
SYKNDKSCPV VAAIITREIG FIVPGLPGTV
LRAINGDFLH FLPRVFSAVG NICYTPSKLI
EYSDFATSAC VLAAECTIFK DAMGKPVPYC
YDTNLLEGSI SYSELRPDTR YVLMDGSIIQ
FPNTYLEGSV RVVTTFDAEY CRHGTCERSE
VGICLSTSGR WVLNNEHYRA LSGVFCGVDA
MNLIANIFTP LVQPVGALDV SASVVAGGII
AILVTCAAYY FMKFRRVFGE YNHVVAANAL
LFLMSFTILC LVPAYSFLPG VYSVFYLYLT
FYFTNDVSFL AHLQWFAMFS PIVPFWITAI
YVFCISLKHC HWFFNNYLRK RVMFNGVTFS
TFEEAALCTF LLNKEMYLKL RSETLLPLTQ
YNRYLALYNK YKYFSGALDT TSYREAACCH
LAKALNDFSN SGADVLYQPP QTSITSAVLQ
SGFRKMAFPS GKVEGCMVQV TCGTTTLNGL
WLDDTVYCPR HVICTAEDML NPNYEDLLIR
KSNHSFLVQA GNVQLRVIGH SMQNCLLRLK
VDTSNPKTPK YKFVRIQPGQ TFSVLACYNG
SPSGVYQCAM RPNHTIKGSF LNGSCGSVGF
NIDYDCVSFC YMHHMELPTG VHAGTDLEGK
FYGPFVDRQT AQAAGTDTTI TLNVLAWLYA
AVINGDRWFL NRFTTTLNDF NLVAMKYNYE
PLTQDHVDIL GPLSAQTGIA VLDMCAALKE
LLQNGMNGRT ILGSTILEDE FTPFDVVRQC
SGVTFQGKFK KIVKGTHHWM LLTFLTSLLI
LVQSTQWSLF FFVYENAFLP FTLGIMAIAA
CAMLLVKHKH AFLCLFLLPS LATVAYFNMV
YMPASWVMRI MTWLELADTS LSGYRLKDCV
MYASALVLLI LMTARTVYDD AARRVWTLMN
VITLVYKVYY GNALDQAISM WALVISVTSN
YSGVVTTIMF LARAIVFVCV EYYPLLFITG
NTLQCIMLVY CFLGYCCCCY FGLFCLLNRY
FRLTLGVYDY LVSTQEFRYM NSQGLLPPKS
SIDAFKLNIK LLGIGGKPCI KVATVQSKMS
DVKCTSVVLL SVLQQLRVES SSKLWAQCVQ
LHNDILLAKD TTEAFEKMVS LLSVLLSMQG
AVDINRLCEE MLDNRATLQA IASEFSSLPS
YAAYATAQEA YEQAVANGDS EVVLKKLKKS
LNVAKSEFDR DAAMQRKLEK MADQAMTQMY
KQARSEDKRA KVTSAMQTML FTMLRKLDND
ALNNIINNAR DGCVPLNIIP LTTAAKLMVV
VPDYGTYKNT CDGNTFTYAS ALWEIQQVVD
ADSKIVQLSE INMDNSPNLA WPLIVTALRA
NSAVKLQNNE LSPVALRQMS CAAGTTQTAC
TDDNALAYYN NSKGGRFVLA LLSDHQDLKW
ARFPKSDGTG TIYTELEPPC RFVTDTPKGP
KVKYLYFIKG LNNLNRGMVL GSLAATVRLQ
AGNATEVPAN STVLSFCAFA VDPAKAYKDY
LASGGQPITN CVKMLCTHTG TGQAITVTPE
ANMDQESFGG ASCCLYCRCH IDHPNPKGFC
DLKGKYVQIP TTCANDPVGF TLRNTVCTVC
GMWKGYGCSC DQLREPLMQS ADASTFLNGF
AV
>SARS_3 * putative E2 glycoprotein
precursor * 21477..25244 * 29836496
MFIFLLFLTL TSGSDLDRCT TFDDVQAPNY
TQHTSSMRGV YYPDEIFRSD TLYLTQDLFL
PFYSNVTGFH TINHTFGNPV IPFKDGIYFA
ATEKSNVVRG WVFGSTMNNK SQSVIIINNS
TNVVIRACNF ELCDNPFFAV SKPMGTQTHT
MIFDNAFNCT FEYISDAFSL DVSEKSGNFK
HLREFVFKNK DGFLYVYKGY QPIDVVRDLP
SGFNTLKPIF KLPLGINITN FRAILTAFSP
AQDIWGTSAA AYFVGYLKPT TFMLKYDENG
TITDAVDCSQ NPLAELKCSV KSFEIDKGIY
QTSNFRVVPS GDVVRFPNIT NLCPFGEVFN
ATKFPSVYAW ERKKISNCVA DYSVLYNSTF
FSTFKCYGVS ATKLNDLCFS NVYADSFVVK
GDDVRQIAPG QTGVIADYNY KLPDDFMGCV
LAWNTRNIDA TSTGNYNYKY RYLRHGKLRP
FERDISNVPF SPDGKPCTPP ALNCYWPLND
YGFYTTTGIG YQPYRVVVLS FELLNAPATV
CGPKLSTDLI KNQCVNFNFN GLTGTGVLTP
SSKRFQPFQQ FGRDVSDFTD SVRDPKTSEI
LDISPCAFGG VSVITPGTNA SSEVAVLYQD
VNCTDVSTAI HADQLTPAWR IYSTGNNVFQ
TQAGCLIGAE HVDTSYECDI PIGAGICASY
HTVSLLRSTS QKSIVAYTMS LGADSSIAYS
NNTIAIPTNF SISITTEVMP VSMAKTSVDC
NMYICGDSTE CANLLLQYGS FCTQLNRALS
GIAAEQDRNT REVFAQVKQM YKTPTLKYFG
GFNFSQILPD PLKPTKRSFI EDLLFNKVTL
ADAGFMKQYG ECLGDINARD LICAQKFNGL
TVLPPLLTDD MIAAYTAALV SGTATAGWTF
GAGAALQIPF AMQMAYRFNG IGVTQNVLYE
NQKQIANQFN KAISQIQESL TTTSTALGKL
QDVVNQNAQA LNTLVKQLSS NFGAISSVLN
DILSRLDKVE AEVQIDRLIT GRLQSLQTYV
TQQLIRAAEI RASANLAATK MSECVLGQSK
RVDFCGKGYH LMSFPQAAPH GVVFLHVTYV
PSQERNFTTA PAICHEGKAY FPREGVFVFN
GTSWFITQRN FFSPQIITTD NTFVSGNCDV
VIGIINNTVY DPLQPELDSF KEELDKYFKN
HTSPDVDLGD ISGINASVVN IQKEIDRLNE
VAKNLNESLI DLQELGKYEQ YIKWPWYVWL
GFIAGLIAIV MVTILLCCMT SCCSCLKGAC
SCGSCCKFDE DDSEPVLKGV KLHYT
>SARS_4 * putative uncharacterized
protein * 25253..26077 * 29836497
MDLFMRFFTL GSITAQPVKI DNASPASTVH
ATATIPLQAS LPFGWLVIGV AFLAVFQSAT
KIIALNKRWQ LALYKGFQFI CNLLLLFVTI
YSHLLLVAAG MEAQFLYLYA LIYFLQCINA
CRIIMRCWLC WKCKSKNPLL YDANYFVCWH
THNYDYCIPY NSVTDTIVVT EGDGISTPKL
KEDYQIGGYS EDRHSGVKDY VVVHGYFTEV
YYQLESTQIT TDTGIENATF FIFNKLVKDP
PNVQIHTIDG SSGVANPAMD PIYDEPTTTT
SVPL
>SARS_5 * putative uncharacterized
protein * 25674..26138 * 29836498
MMPTTLFAGT HITMTTVYHI TVSQIQLSLL
KVTAFQHQNS KKTTKLVVIL RIGTQVLKTM
SLYMAISPKF TTSLSLHKLL QTLVLKMLHS
SSLTSLLKTH RMCKYTQSTA LQELLIQQWI
QFMMSRRRLL ACLCKHKKVS TNLCTHSFRK
KQVR
>SARS_6 * putative small envelope
protein E * 26102..26332 * 29836499
MYSFVSEETG TLIVNSVLLF LAFVVFLLVT
LAILTALRLC AYCCNIVNVS LVKPTVYVYS
RVKNLNSSEG VPDLLV
>SARS_7 * putative protein M *
26383..27048 * 29836504
MADNGTITVE ELKQLLEQWN LVIGFLFLAW
IMLLQFAYSN RNRFLYIIKL VFLWLLWPVT
LACFVLAAVY RINWVTGGIA IAMACIVGLM
WLSYFVASFR LFARTRSMWS FNPETNILLN
VPLRGTIVTR PLMESELVIG AVIIRGHLRM
AGHSLGRCDI KDLPKEITVA TSRTLSYYKL
GASQRVGTDS GFAAYNRYRI GNYKLNTDHA
GSNDNIALLV Q
>SARS_8 * putative uncharacterized
protein * 27059..27250 * 29836500
MFHLVDFQVT IAEILIIIMR TFRIAIWNLD
VIISSIVRQL FKPLTKKNYS ELDDEEPMEL
DYP
>SARS_9 * putative uncharacterized
protein * 27258..27626 * 29836501
MKIILFLTLI VFTSCELYHY QECVRGTTVL
LKEPCPSGTY EGNSPFHPLA DNKFALTCTS
THFAFACADG TRHTYQLRAR SVSPKLFIRQ
EEVQQELYSP LFLIVAALVF LILCFTIKRK
TE
>SARS_10 * putative nucleocapsid
protein * 28105..29373 * 29836503
MSDNGPQSNQ RSAPRITFGG PTDSTDNNQN
GGRNGARPKQ RRPQGLPNNT ASWFTALTQH
GKEELRFPRG QGVPINTNSG PDDQIGYYRR
ATRRVRGGDG KMKELSPRWY FYYLGTGPEA
SLPYGANKEG IVWVATEGAL NTPKDHIGTR
NPNNNAATVL QLPQGTTLPK GFYAEGSRGG
SQASSRSSSR SRGNSRNSTP GSSRGNSPAR
MASGGGETAL ALLLLDRLNQ LESKVSGKGQ
QQQGQTVTKK SAAEASKKPR QKRTATKQYN
VTQAFGRRGP EQTQGNFGDQ DLIRQGTDYK
HWPQIAQFAP SASAFFGMSR IGMEVTPSGT
WLTYHGAIKL DDKDPQFKDN VILLNKHIDA
YKTFPPTEPK KDKKKKTDEA QPLPQRQKKQ
PTVTLLPAAD MDDFSRQLQN SMSGASADST
QA
>SARS_11 * putative uncharacterized
protein * 28115..28411 * 29836502
MDPNQTNVVP PALHLVDPQI QLTITRMEDA
MGQGQNSADP KVYPIILRLG SQLSLSMARR
NLDSLEARAF QSTPIVVQMT KLATTEELPD
EFVVVTAK
a Find out with which other completely The levels of sequence identity might be
sequenced Nidovirales SARS has homologs quite low, when in doubt do reciprocal
(make a table). PSI-Blast searches. So you know what to
look for in your Blast output, here are
the sequenced Nidovirales:
Avian infectious bronchitis virus
Bovine coronavirus
Equine arteritis virus
Human coronavirus 229E
Lactate dehydrogenase-elevating
virus
Murine hepatitis virus
Porcine epidemic diarrhea virus
Porcine reproductive and
respiratory syndrome virus
SARS coronavirus
Simian hemorrhagic fever virus
Transmissible gastroenteritis
virus
b Make a phylogeny of the Nidovirales based
on the proteins that they all share.
c Is the sequence-based phylogeny consistent
with which species the SARS viruses share
the most genes?
d If the answer to c is 'yes', can you think of a Are we able to detect all sequence
pitfall of that result? homologies by sequence comparison?
And how does that depend on the
evolutionary distance between the
sequences?
2 A question about horizontal gene transfer.
For the following sequence:
cmntqapiae atkkavsmgp
ekvieevfks nlvgrggagf rtgkkwesay ktpasdkyvv cnadeglpst
ykdwcllnne
akrkevftgm gicaktigak rcfmylryey rnlvpaleqs ikdvqstcpe
ladlkyeirl
gggpyvagee naqfesiegr aplprkdrpg nifptmeglf hkptvinnve
tffaiphiiq
qgsqsfgegk mpkllsvtgd vdepilietn lnnyslnhll qeisakdiva
aeiggctepi
ifgskfdtlf gfgrgtlnav gsvvlfnssc dlgkiyenkl kfmaeesckq
cvpcrdgsyi
fhrafkelrd tgkssynmra lavasesaar ssicahgkal eslfksacdf
mnktkpiyqp
hstyhq
a From which species does is it? Which
domain does it contain?
b Note that the sequence above is part of a You could make a tree with the best
larger protein. Here we are however Blast hits. For the purpose of this
specifically interested in the evolutionary question it is sufficient to let the Blast
history of this part of the protein itself. tool do this for you. Click on “distance
The species which this protein comes from tree of results” to obtain a phylogeny of
is a ciliate, a phylum of the eukaryotes that the sequences. You can zoom in the tree
contains amongst others “Paramecium” by mousing over it and clicking on
(pantoffel dier). Is the protein above closely “select subtree”. The query sequence is
c related to proteins from other ciliates? all the way at the bottom of the page.
Does Nyctotherus have a “distant” homolog
of this protein (one that is less than 50% In order to find that in the Blast results
d identical)? of the first protein you have to make sure
that you select “1000 alignments” and
“1000 descriptions” in the Blast page
e Is this protein (the homolog) closely related where you enter the sequence.
to proteins from other ciliates?
You will have to do a second Blast
Are both Nyctotherus versions of this gene search… + a tree
the result of a recent gene duplication? Or
do they have different origins? How do you
think these proteins ended up in the
genome? Horizontal Gene Transfer or
Vertical Inheritance?
3 In contrast to what one would expect,
hyperthermophiles (species that live at a
temperature of 80 C or higher) do not
always have a high GC level in their DNA
to stabilize it. Instead they have found
another solution to stabilize their DNA.
Comparative genome analysis combined
with the examination of the experimental
literature we can identify candidate proteins
for this.
a Find the genes that are shared between Hyperthermophiles are: A. fulgidus, A.
nearly all hyperthermophiles in the COG pernix, M. jannaschii, P. horikoshii, P.
database and that are not shared by most of abyssi, A. aeolicus and T. maritima.
the non-thermophiles. Examples of non-thermophiles are E.
coli, B.subtilis, Synechocystis,
M.tuberculosis, and H.influenzae. Use
the “old” COG database:
www.ncbi.nlm.nih.gov/COG/old and
then select “phylogenetic patterns
search”
b Interpret the functions of the proteins. Search in Pubmed at
www.ncbi.nlm.nih.gov/entrez.
c Which one is likely responsible for the
stabilization of the DNA?
d What is the mechanism of DNA
stabilization?
4 An important part of comparative genome
analysis is the interpretation of the results in
light of the biological context. Below are
two sequences that are shared between the
parasitic species R. prowazekii and E.
cuniculi.
>E. cuniculi
MNEVENNNHS FPREDIPTED EIEEEANSRQ GILRYFRVAR AEYTKFALLG
LMFGIIGFIY
SFMRILKDMF VMVRQEPTTI LFIKIFYILP VSMALVFLIQ YMLGTKTVSR
IFSIFCGGFA
SLFFLCGAVF LIEEQVSPSK FLFRDMFIDG KMSSRSLNVF KSMFLTLNEP
LATIVFISAE
MWGSLVLSYL FLSFLNESCT IRQFSRFIPP LIIITNVSLF LSATVAGAFF
KLREKLAFQQ
NQVLLSGIFI FQGFLVVLVI FLKIYLERVT MKRPLFIVSS GSRRKKAKAN
VSFSEGLEIM
SQSKLLLAMS LIVLFFNISY NMVESTFKVG VKVAAEYFNE EKGKYSGKFN
RIDQYMTSVV
VICLNLSPFS SYVETRGFLL VGLITPIVTL MAIVLFLGSA LYNTSMEESG
LGIVNGLFPG
GKPLYVLENY FGVIFMSLLK ITKYSAFDIC KEKLGMRINP TYRARFKSVY
DGIFGKLGKS
IGSIYGLLMF EALDTEDLRK ATPITAGIIF IFIVMWVKAI IYLSRSYESA
VQHNRDVDID
MTEKAKKSLE TPEEPKVVD
> R.prowazekii
MSTSKSENYL SELRKIIWPI EQYENKKFLP LAFMMFCILL NYSTLRSIKD
GFVVTDIGTE
SISFLKTYIV LPSAVIAMII YVKLCDILKQ ENVFYVITSF FLGYFALFAF
VLYPYPDLVH
PDHKTIESLS LAYPNFKWFI KIVGKWSFAS FYTIAELWGT MMLSLLFWQF
ANQITKIAEA
KRFYSMFGLL ANLALPVTSV VIGYFLHEKT QIVAEHLKFV PLFVIMITSS
FLIILTYRWM
NKNVLTDPRL YDPALVKEKK TKAKLSFIES LKMIFTSKYV GYIALLIIAY
GVSVNLVEGV
WKSKVKELYP TKEAYTIYMG QFQFYQGWVA IAFMLIGSNI LRKVSWLTAA
MITPLMMFIT
GAAFFSFIFF DSVIAMNLTG ILASSPLTLA VMIGMIQNVL SKGVKYSLFD
ATKNMAYIPL
DKDLRVKGQA AVEVIGGRLG KSGGAIIQST FFILFPVFGF IEATPYFASI
FFIIVILWIF
AVKGLNKEYQ VLVNKNEK
a What is their function?
b Can you relate their function to the parasitic
lifestyle of the species that harbor them?
c How many other homologous sequences
can you discover in either genome?
d In which other species do these proteins
occur?
e What, if anything, do they do in these
species?
5 The gene for methylcitrate synthase has
only been discovered relatively recently. It
is homologous to citrate synthase, and
wrongly annotated in the COG database
(COG0372). Below are a number of protein
sequences, some of which are citrate
synthases, some are methyl-citrate
synthases, while for others there is no
experimental evidence regarding their
function.
>E.coli methylcitrate_synthase
MSDTTILQNS THVIKPKKSV ALSGVPAGNT ALCTVGKSGN DLHYRGYDIL
DLAKHCEFEE
VAHLLIHGKL PTRDELAAYK TKLKALRGLP ANVRTVLEAL PAASHPMDVM
RTGVSALGCT
LPEKEGHTVS GARDIADKLL ASLSSILLYW YHYSHNGERI QPETDDDSIG
GHFLHLLHGE
KPSQSWEKAM HISLVLYAEH EFNASTFTSR VIAGTGSDMY SAIIGAIGAL
RGPKHGGANE
VSLEIQQRYE TPDEAEADIR KRVENKEVVI GFGHPVYTIA DPRHQVIKRV
AKQLSQEGGS
LKMYNIADRL ETVMWESKKM FPNLDWFSAV SYNMMGVPTE MFTPLFVIAR
VTGWAAHIIE
QRQDNKIIRP SANYVGPEDR PFVALDKRQ
>Ralstonia methylcitrate_synthase
MSEAQPLVTP KPKKSVALSG VTAGNTALCT VGRTGNDLHY RGYDILDIAE
TCEFEEIAHL
LVHGKLPTKS ELAAYKAKLK SLRGLPANVK AALEWVPASA HPMDVMRTGV
SVLGTVLPEK
EDHNTPGARD IADRLMASLG SMLLYWYHYS HNGRRIEVET DDDSIGGHFL
HLLHGEKPSA
LWERAMNTSL NLYAEHEFNA STFTARVIAG TGSDMYSSIS GAIGALRGPK
HGGANEVAFE
IQKRYDNPDE AQADITRRVE NKEVVIGFGH PVYTTGDPRN QVIKEVAKKL
SKDAGSMKMF
DIAEALETVM WDIKKMFPNL DWFSAVSYHM MGVPTAMFTA LFVIARTSGW
AAHIIEQRID
NKIIRQSANY TGPENLKFVP LKDRK
>Corynebacterium methylcitrate synthase
MSSATTTDVR KGLYGVIADY TAVSKVMPET NSLTYRGYAV EDLVENCSFE
EVFYLLWHGE
LPTAQQLAEF NERGRSYRSL DAGLISLIHS LPKEAHPMDV MRTAVSYMGT
KDSEYFTTDS
EHIRKVGHTL LAQLPMVLAM DIRRRKGLDI IAPDSSKSVA ENLLSMVFGT
GPESPASNPA
DVRDFEKSLI LYAEHSFNAS TFTARVITST KSDVYSAITG AIGALKGPLH
GGANEFVMHT
MLAIDDPNKA AAWINNALDN KNVVMGFGHR VYKRGDSRVP SMEKSFRELA
ARHDGEKWVA
MYENMRDAMD ARTGIKPNLD FPAGPAYHLL GFPVDFFTPL FVIARVAGWT
AHIVEQYENN
SLIRPLSEYN GEEQREVAPI EKR
>E.coli citrate synthase
MADTKAKLTL NGDTAVELDV LKGTLGQDVI DIRTLGSKGV FTFDPGFTST
ASCESKITFI
DGDEGILLHR GFPIDQLATD SNYLEVCYIL LNGEKPTQEQ YDEFKTTVTR
HTMIHEQITR
LFHAFRRDSH PMAVMCGITG ALAAFYHDSL DVNNPRHREI AAFRLLSKMP
TMAAMCYKYS
IGQPFVYPRN DLSYAGNFLN MMFSTPCEPY EVNPILERAM DRILILHADH
EQNASTSTVR
TAGSSGANPF ACIAAGIASL WGPAHGGANE AALKMLEEIS SVKHIPEFVR
RAKDKNDSFR
LMGFGHRVYK NYDPRATVMR ETCHEVLKEL GTKDDLLEVA MELENIALND
PYFIEKKLYP
NVDFYSGIIL KAMGIPSSMF TVIFAMARTV GWIAHWSEMH SDGMKIARPR
QLYTGYEKRD
FKSDIKR
>B.subtilis citrate synthase
MVHYGLKGIT CVETSISHID GEKGRLIYRG HHAKDIALNH SFEEAAYLIL
FGKLPSTEEL
QVFKDKLAAE RNLPEHIERL IQSLPNNMDD MSVVRTVVSA LGENTYTFHP
KTEEAIRLIA
ITPSIIAYRK RWTRGEQAIA PSSQYGHVEN YYYMLTGEQP SEAKKKALET
YMILATEHGM
NASTFSARVT LSTESDLVSA VTAALGTMKG PLHGGAPSAV TKMLEDIGEK
EHAEAYLKEK
LEKGERLMGF GHRVYKTKDP RAEALRQKAE EVAGNDRDLD LALHVEAEAI
RLLEIYKPGR
KLYTNVEFYA AAVMRAIDFD DELFTPTFSA SRMVGWCAHV LEQAENNMIF
RPSAQYTGAI
PEEVLS
>A.fulgidus
MKDGLEDVIA CKTTISRIAL ENGRAILEYR GYDIRDLARK ASYEEVAYLL
LYGELPKKYE
LQDFKIELAE RRELPPQIIG LLTHLPPYTH PMVVLRTATS YLGSLDKKIA
VRTREETFNK
AKDLIAKFPT IVAYYHRIRT GRNIIPPALE FSHAANFLYM LHGEEPTKTA
ERALDMDLIL
HAEHELNAST FAARIAASTL ADIYACVVAA TGTLMGPLHG GAAQEVMRML
REVASPRRAE
EYVKRKIEAG ERIMGFGHRV YRGVMDPRAE LLRYLAKRLA AEGSTKWFEI
SEAIAKAAYK
YKKLLPNVDF YSASVYANLG IPDDLFVNIF AMGRISGWTA HIIEQYENNR
LIRPRAEYVG
EKEKKFIPLS KR
>Synechocystis
MNYMMTDNEV FKEGLAGVPA AKSRVSHVDG TDGILEYRGI RIEELAKSSS
FIEVAYLLIW
GKLPTQAEIE EFEYEIRTHR RIKYHIRDMM KCFPETGHPM DALQTSAAAL
GLFYARRALD
DPKYIRAAVV RLLAKIPTMV AAFHMIREGN DPIQPNDKLD YASNFLYMLT
EKEPDPFAAK
VFDVCLTLHA EHTMNASTFS ARVTASTLTD PYAVVASAVG TLAGPLHGGA
NEEVLNMLEE
IGSVENVRPY VEKCLANKQR IMGFGHRVYK VKDPRAIILQ DLAEQLFAKM
GHDEYYEIAV
ELEKVVEEYV GQKGIYPNVD FYSGLVYRKL DIPADLFTPL FAIARVAGWL
AHWKEQLSVN
KIYRPTQIYI GDHNLSYVPM TERVVSVARN EDPNAII
a Make phylogenetic tree of the sequences.
b Which ones are also likely to be
(methyl)citrate synthases based on the
phylogenetic tree?
c Are the methylcitrate synthases
monophyletic?
d B.subtilis has a second gene (beside the one
above) that in the COGs is annotated as a
citrate synthase. Add it to the tree.
>B.subtilis
MEEKQHYSPG LDGVIAAETH ISYLDTQSSQ
ILIRGYDLIE LSETKSYLEL VHLLLEGRLP
EESEMETLER KINSASSLPA DHLRLLELLP
EDTHPMDGLR TGLSALAGYD RQIDDRSPSA
NKERAYQLLG KMPALTAASY RIINKKEPIL
PLQTLSYSAN FLYMMTGKLP SSLEEQIFDR
SLVLYSEHEM PNSTFAARVI ASTHSDLYGA
LTGAVASLKG NLHGGANEAV MYLLLEAKTT
SDFEQLLQTK LKRKEKIMGF GHRVYMKKMD
PRALMMKEAL QQLCDKAGDH RLYEMCEAGE
RLMEKEKGLY PNLDYYAAPV YWMLGIPIPL
YTPIFFSART SGLCAHVIEQ HANNRLFRPR
VSYMGPRYQT KS
e Do you think, based on the tree, that it is a
citrate synthase or a methyl-citrate
synthase?
f Methylcitrate synthase activity has also You will have to add all homologs of
been found in S. cerevisiae, but the gene citrate synthase in S. cerevisiae to the
responsible has not been identified. Can you alignment and remake the tree.
find a sequence in S. cerevisiae that is
orthologous to the methylcitrate synthases
above?
g If you cannot find a gene that is orthologous Remember the story about the malate
to methyl-citrate synthases in S. cerevisiae, and lactate dehydrogenases in
and you know that there has to be one (the Trichomonas on day 3.
activity has been measured). What does this
imply for the monophyly of methyl-citrate
synthases?
h Another methyl-citrate synthase has
recently been identified in the fungi. Add it
to the tree.
>Aspergillus nidulans methylcitrate_synthase
MALPLRTARH ASRLAQTIGR RGYATAEPDL KSALKAVIPA KRELLAEVKK
QGDEVIGEVK
VSNVIGGMRG LKSMLWEGSV LDADEGIRFH GKTIKDCQKE LPKGPTGTEM
LPEAMFWLLL
TGEVPSTSQV RAFSKQLAEE SHLPDHILDL AKSFPKHMHP MTQISIITAA
LNTESKFAKL
YEKGINKADY WEPTFDDAIS LLAKIPRVAA LVFRPNEIDV VGRQKLDPAQ
DWSYNFAELL
GKGGANNADF HDLLRLYLAL HGDHEGGNVS AHATHLVGSA LSDPFLSYSA
GLLGLAGPLH
GLAAQEVLRW ILAMQEKIGT KFTDEDVRAY LWDTLKSGRV VPGYGHGVLR
KPDPRFQALM
DFAATRKDVL ANPVFQLVKK NSEIAPGVLT EHGKTKNPHP NVDAASGVLF
YHYGFQQPLY
YTVTFGVSRA LGPLVQLIWD RALGLPIERP KSINLLGLKK
i Which S. cerevisiae gene is the likely
methyl-citrate synthase?
Related docs
Get documents about "