European Molecular Biology
Open Software Suite
EMBOSS
• European Molecular Biology Open Software Suite
• A properly constructed toolkit for creating robust
bioinformatics applications or workflows
• What can you do with it? Almost everything!
Sequence alignment
Rapid database searching with sequence patterns
Protein motif identification, including domain analysis
Nucleotide sequence pattern analysis---for example to identify CpG islands or repeats
Codon usage analysis for small genomes
Rapid identification of sequence patterns in large scale sequence sets
• http://emboss.sourceforge.net/
• EMBOSS and SHELL commands pipeline
Useful EMBOSS commands
command description
showdb Displays information on the currently available
databases
wossname Finds programs by keywords in their one-line
documentation
tfm Get the manual entries for each program in EMBOSS
seealso Find the relevant programs of a certain program
seqret Reads and writes (returns) sequences
entret Reads and writes (returns) flatfile entries
extractfeat Extract features from a sequence
extractseq Extract regions from a sequence
transeq Translate nucleic acid sequences
Get help from EMBOSS itself
# showdb
Shows the currently available databases
# tfm wossname
How to use a EMBOSS command? Just (r)tfm it
# wossname alignment
Which commands can handle alignments?
# seealso seqret
Are there any other commands able to do the similar thing?
Try these commands first, and you will find you can go further
step by step without help
Command line options
• All EMBOSS programs react to a number of
command line options. The most important
ones are
–help Get help
–help –verbose Get elaborate help
–auto “no questions asked”
–stdout Write to standard output
–filter Read stdin, write stdout
SEQRET parameters
zonnebloem> seqret -help
Standard (Mandatory) qualifiers:
[-sequence] seqall (Gapped) sequence(s) filename and optional
format, or reference (input USA)
[-outseq] seqoutall [.] Sequence set(s)
filename and optional format (output USA)
Additional (Optional) qualifiers: (none)
Advanced (Unprompted) qualifiers:
-feature boolean Use feature information
-firstonly boolean Read one sequence and stop
General qualifiers:
-help boolean Report command line options. More
information on associated and general
qualifiers can be found with -help -verbose
SEQRET parameters
zonnebloem> seqret -help -verbose
Standard (Mandatory) qualifiers:
[-sequence] seqall (Gapped) sequence(s) filename and
optional
format, or reference (input USA)
[-outseq] seqoutall [.] Sequence set(s)
filename and optional format (output USA)
Additional (Optional) qualifiers: (none)
Advanced (Unprompted) qualifiers:
-feature boolean Use feature information
-firstonly boolean Read one sequence and stop
Associated qualifiers:
"-sequence" associated qualifiers
-sbegin1 integer Start of each sequence to be used
///
SEQRET parameters
///
"-sequence" associated qualifiers
-sbegin1 integer Start of each sequence to be used
-send1 integer End of each sequence to be used
-sreverse1 boolean Reverse (if DNA)
-sask1 boolean Ask for begin/end/reverse
-snucleotide1 boolean Sequence is nucleotide
-sprotein1 boolean Sequence is protein
-slower1 boolean Make lower case
-supper1 boolean Make upper case
-sformat1 string Input sequence format
-sdbname1 string Database name
-sid1 string Entryname
-ufo1 string UFO features
-fformat1 string Features format
-fopenfile1 string Features file name
///
SEQRET parameters
///
"-outseq" associated qualifiers
-osformat2 string Output seq format
-osextension2 string File name extension
-osname2 string Base file name
-osdirectory2 string Output directory
-osdbname2 string Database name to add
-ossingle2 boolean Separate file for each entry
-oufo2 string UFO features
-offormat2 string Features format
-ofname2 string Features file name
-ofdirectory2 string Output directory
///
SEQRET parameters
///
General qualifiers:
-auto boolean Turn off prompts
-stdout boolean Write standard output
-filter boolean Read standard input, write standard output
-options boolean Prompt for standard and additional values
-debug boolean Write debug output to program.dbg
-verbose boolean Report some/full command line options
-help boolean Report command line options. More
information on associated and general
qualifiers can be found with -help -verbose
-warning boolean Report warnings
-error boolean Report errors
-fatal boolean Report fatal errors
-die boolean Report dying program messages
Universal Sequence Address
Type Example Description
filename xxx.seq A sequence file "xxx.seq" in any format
format::filename fasta::xxx.seq A sequence file "xxx.seq" in fasta format
EMBL entry PAAMIR, using whatever access method is defined locally
db:IDname embl:paamir
for the EMBL database
EMBL entry X13776, using whatever access method is defined locally
db:AccessionNumber embl:X13776 for the EMBL database and searching by accession number and entry
name (X13776 is the accession number in this case)
EMBL entry X13776, using whatever access method is defined locally
db-acc:AccessionNumber embl-acc:X13776
for the EMBL database and searching by accession number only
EMBL entry PAAMIR, using whatever access method is defined locally
db-id:IDname embl-id:paamir
for the EMBL database, and searching by ID only
db-searchfield:word embl-des:lectin EMBL entries containing the word 'lectin' in the Description line
db-searchfield:wildcard- EMBL entries containing the wildcarded word 'human' in the Organism
embl-org:*human*
word fields
EMBL entries PAAMIB, PAAMIE and so on, usually in alphabetical
db:wildcard-ID embl:paami* order, using whatever access method is defined locally for the EMBL
database
Universal Sequence Address
Type Example Description
db or db:* embl or EMBL:* All sequences in the EMBL database
Reads file mylist and uses each line as a separate USA. List files can
@listfile @mylist
contain references to other lists files or any other standard USA.
list:listfile list:mylist Same as "@mylist" above
The pipe character "|" causes EMBOSS to fire up getz (the SRS
sequence retrieval program) to extract entry PAAMIR from EMBL in
'program parameters |' 'getz -e [embl-id:paamir] |'
EMBL format. Any application or script which writes one or more
sequences to stdout can be used in this way.
So far the shortest USA we could invent. In 'asis' format the name is
the sequence so no file needs to be opened. This is a special case. It
asis::sequence asis::atacgcagttatctgaccat
was intended as a joke, but could be quite useful for generating
command lines.
Each of the above can have '[start : end]' or '[start : end : r]' appended to them.
The 'file' and 'dbname' forms of USA can have 'format::' in front of them
(although a database knows which format it is and so this is redundant and
error-prone)
Exercise
Retrieve the corresponding nucleotide sequence of a protein fragment
Known:
• UniProt Accession number: Q5ZKN6
• Amino acid fragment: VAEEVAEE
Object:
• Retrieve the corresponding nucleotide sequence by linking the UniProt
entry to EMBL
• Understand the EMBOSS commands
Find out the fragment’s offset
# seqret –auto uniprot:Q5ZKN6 -stdout
>Q5ZKN6_CHICK Q5ZKN6 SubName: Full=Putative uncharacterized protein;
MADNLPSEFDVVVIGTGLPESIIAAACARSGQRVLHVDSRNYYGGNWASFSFSGLLSWIK
ENQQNTDIKDECEDWRKLILENEEVISLNKKDKTIQHVEAFCFDDQDAAEDVEEAGALAR
LPAYGASVAEEVAEEPEKECSPLESAVPGAENLESEKATSVDPASAAEGNVTEINAESES
SHDSASGESTLESGKTEAALSEISAQEPKKITYSQIVREGRRFNIDLVSKLLYSRGLLIE
LLIKSNVSRYAEFKNATRILAFREGKVEQVPCSRADVFNSRQLAMVEKRMLMKFLTFCLE
YEQHPDEYQDYKNSTFAQFLKTRKLTPSLQHFILHSIAMVSEKDCNTLEGLQATRKFLQC
LGRYGNTPFLFPLYGQGEIPQCFCRMCAVFGGIYCLRHSVQCLVVDKESGRCKAVVDHFG
QRISANYFIVEDSYLSESVCENVCYRQLSRAVLITDQSVLKTDSEQQVSILMVPPVDLGQ
PAVCVIELCSSTMTCMKDTYLVHLTCPSTKTAREDLEPVVQKLFSLNAETEKETEDEVLE
KPRVLWALYFNMRDSSGIDRNSYSGLPSNVYVCSGPDSALGNDCAVKQAETIFQEMFPTE
EFCPAPPNPEDIIYDEDEIASEETGFNNSPETKPESSLQESSSRGSSTAVKEHIEE
Get the sequence of this protein entry in FASTA format
Figure out the offset of “VAEEVAEE” in this entry. How?
Find out the fragment’s offset
# extract sequence from UniProt
$seqAmino = `seqret -auto uniprot:Q5ZKN6 stdout –osf=raw`;
# remove the newlines (if any)
$seqAmino =~ tr/\n//;
# look for location of the repeat
$offset = index($seqAmino, "VAEEVAEE") + 1;
# print the offset
print "Offset = ", $offset, "\n";
Calling an EMBOSS command from within PERL
Get a cross-reference
entret uniprot:Q5ZKN6 -auto stdout |grep "DR "
Get the feature table of this protein entry
Understand the cross-reference
DR EMBL; AJ720048; CAG31707.1; -; mRNA.
Link to EMBL Status identifier
Protein ID Molecule Type
EMBL accession number
Database cross reference
The corresponding
cross reference
in EMBL
Read the detailed documentation of UniProt cross reference
http://www.expasy.org/sprot/userman.html#DR_line
Get a cross-reference
entret uniprot:Q5ZKN6 -auto stdout | grep "DR "
|grep "EMBL;"
Get the EMBL cross-reference for this protein
In Perl, you can use a regular expression to locate the EMBL reference
line, and extract the EMBL accession number and the protein-ID
Link protein to coding DNA
extractfeat embl:AJ720048 -value CAG31707.1 stdout >dnaSeq
Put its corresponding coding DNA into file “dnaSeq”
Figure out the offset in DNA
Offset in amino acid sequence: 128
Offset in corresponding nucleotide sequence:
((128-1) x 3) + 1
OR
(128 x 3)-2
= 382
Position is from 382 to (382+ 8x3)=406
Figure out the position of its corresponding coding DNA
sequence (is there anything wrong here?)
Extract the DNA sequence
# extractseq –auto dnaSeq -reg "382-406"
stdout
Now you got the corresponding dna sequence of the protein
fragment
It should be: “gttgctgaggaggttgctgaagaac”
But is that correct? You still have to translate it for verification…
Verify the result
extractseq dnaSeq -reg "382-406" stdout
>verify_seq
transeq verify_seq stdout
Result is “VAEEVAEEX” but not “VAEEVAEE”
You know what’s wrong here? So be careful on each step you did and
never trust you result before making verification!
Practical assignment
Known:
• UniProt Accession number: A0L7N9
• Fragment offset: 305
• Fragment length: 8 (Find out the fragment yourself this time)
Object:
• Find out the coding DNA of this amino acid fragment and verify it
• You may use different EMBOSS commands
• Build a pipeline in PERL to achieve the whole job