Embed
Email

Linux

Document Sample

Shared by: yurtgc548
Categories
Tags
Stats
views:
0
posted:
1/26/2012
language:
pages:
23
European Molecular Biology

Open Software Suite

EMBOSS

• European Molecular Biology Open Software Suite

• A properly constructed toolkit for creating robust

bioinformatics applications or workflows

• What can you do with it? Almost everything!

Sequence alignment

Rapid database searching with sequence patterns

Protein motif identification, including domain analysis

Nucleotide sequence pattern analysis---for example to identify CpG islands or repeats

Codon usage analysis for small genomes

Rapid identification of sequence patterns in large scale sequence sets





• http://emboss.sourceforge.net/

• EMBOSS and SHELL commands pipeline

Useful EMBOSS commands

command description

showdb Displays information on the currently available

databases

wossname Finds programs by keywords in their one-line

documentation

tfm Get the manual entries for each program in EMBOSS

seealso Find the relevant programs of a certain program

seqret Reads and writes (returns) sequences

entret Reads and writes (returns) flatfile entries

extractfeat Extract features from a sequence

extractseq Extract regions from a sequence

transeq Translate nucleic acid sequences

Get help from EMBOSS itself

# showdb

Shows the currently available databases

# tfm wossname

How to use a EMBOSS command? Just (r)tfm it

# wossname alignment

Which commands can handle alignments?

# seealso seqret

Are there any other commands able to do the similar thing?



Try these commands first, and you will find you can go further

step by step without help

Command line options

• All EMBOSS programs react to a number of

command line options. The most important

ones are

–help Get help

–help –verbose Get elaborate help

–auto “no questions asked”

–stdout Write to standard output

–filter Read stdin, write stdout

SEQRET parameters

zonnebloem> seqret -help

Standard (Mandatory) qualifiers:

[-sequence] seqall (Gapped) sequence(s) filename and optional

format, or reference (input USA)

[-outseq] seqoutall [.] Sequence set(s)

filename and optional format (output USA)



Additional (Optional) qualifiers: (none)

Advanced (Unprompted) qualifiers:

-feature boolean Use feature information

-firstonly boolean Read one sequence and stop



General qualifiers:

-help boolean Report command line options. More

information on associated and general

qualifiers can be found with -help -verbose

SEQRET parameters

zonnebloem> seqret -help -verbose

Standard (Mandatory) qualifiers:

[-sequence] seqall (Gapped) sequence(s) filename and

optional

format, or reference (input USA)

[-outseq] seqoutall [.] Sequence set(s)

filename and optional format (output USA)



Additional (Optional) qualifiers: (none)

Advanced (Unprompted) qualifiers:

-feature boolean Use feature information

-firstonly boolean Read one sequence and stop



Associated qualifiers:



"-sequence" associated qualifiers

-sbegin1 integer Start of each sequence to be used

///

SEQRET parameters

///

"-sequence" associated qualifiers

-sbegin1 integer Start of each sequence to be used

-send1 integer End of each sequence to be used

-sreverse1 boolean Reverse (if DNA)

-sask1 boolean Ask for begin/end/reverse

-snucleotide1 boolean Sequence is nucleotide

-sprotein1 boolean Sequence is protein

-slower1 boolean Make lower case

-supper1 boolean Make upper case

-sformat1 string Input sequence format

-sdbname1 string Database name

-sid1 string Entryname

-ufo1 string UFO features

-fformat1 string Features format

-fopenfile1 string Features file name

///

SEQRET parameters

///

"-outseq" associated qualifiers

-osformat2 string Output seq format

-osextension2 string File name extension

-osname2 string Base file name

-osdirectory2 string Output directory

-osdbname2 string Database name to add

-ossingle2 boolean Separate file for each entry

-oufo2 string UFO features

-offormat2 string Features format

-ofname2 string Features file name

-ofdirectory2 string Output directory

///

SEQRET parameters

///

General qualifiers:

-auto boolean Turn off prompts

-stdout boolean Write standard output

-filter boolean Read standard input, write standard output

-options boolean Prompt for standard and additional values

-debug boolean Write debug output to program.dbg

-verbose boolean Report some/full command line options

-help boolean Report command line options. More

information on associated and general

qualifiers can be found with -help -verbose

-warning boolean Report warnings

-error boolean Report errors

-fatal boolean Report fatal errors

-die boolean Report dying program messages

Universal Sequence Address

Type Example Description



filename xxx.seq A sequence file "xxx.seq" in any format



format::filename fasta::xxx.seq A sequence file "xxx.seq" in fasta format



EMBL entry PAAMIR, using whatever access method is defined locally

db:IDname embl:paamir

for the EMBL database

EMBL entry X13776, using whatever access method is defined locally

db:AccessionNumber embl:X13776 for the EMBL database and searching by accession number and entry

name (X13776 is the accession number in this case)

EMBL entry X13776, using whatever access method is defined locally

db-acc:AccessionNumber embl-acc:X13776

for the EMBL database and searching by accession number only

EMBL entry PAAMIR, using whatever access method is defined locally

db-id:IDname embl-id:paamir

for the EMBL database, and searching by ID only



db-searchfield:word embl-des:lectin EMBL entries containing the word 'lectin' in the Description line



db-searchfield:wildcard- EMBL entries containing the wildcarded word 'human' in the Organism

embl-org:*human*

word fields

EMBL entries PAAMIB, PAAMIE and so on, usually in alphabetical

db:wildcard-ID embl:paami* order, using whatever access method is defined locally for the EMBL

database

Universal Sequence Address

Type Example Description



db or db:* embl or EMBL:* All sequences in the EMBL database



Reads file mylist and uses each line as a separate USA. List files can

@listfile @mylist

contain references to other lists files or any other standard USA.



list:listfile list:mylist Same as "@mylist" above



The pipe character "|" causes EMBOSS to fire up getz (the SRS

sequence retrieval program) to extract entry PAAMIR from EMBL in

'program parameters |' 'getz -e [embl-id:paamir] |'

EMBL format. Any application or script which writes one or more

sequences to stdout can be used in this way.

So far the shortest USA we could invent. In 'asis' format the name is

the sequence so no file needs to be opened. This is a special case. It

asis::sequence asis::atacgcagttatctgaccat

was intended as a joke, but could be quite useful for generating

command lines.





Each of the above can have '[start : end]' or '[start : end : r]' appended to them.

The 'file' and 'dbname' forms of USA can have 'format::' in front of them

(although a database knows which format it is and so this is redundant and

error-prone)

Exercise

Retrieve the corresponding nucleotide sequence of a protein fragment



Known:

• UniProt Accession number: Q5ZKN6

• Amino acid fragment: VAEEVAEE



Object:

• Retrieve the corresponding nucleotide sequence by linking the UniProt

entry to EMBL

• Understand the EMBOSS commands

Find out the fragment’s offset

# seqret –auto uniprot:Q5ZKN6 -stdout



>Q5ZKN6_CHICK Q5ZKN6 SubName: Full=Putative uncharacterized protein;

MADNLPSEFDVVVIGTGLPESIIAAACARSGQRVLHVDSRNYYGGNWASFSFSGLLSWIK

ENQQNTDIKDECEDWRKLILENEEVISLNKKDKTIQHVEAFCFDDQDAAEDVEEAGALAR

LPAYGASVAEEVAEEPEKECSPLESAVPGAENLESEKATSVDPASAAEGNVTEINAESES

SHDSASGESTLESGKTEAALSEISAQEPKKITYSQIVREGRRFNIDLVSKLLYSRGLLIE

LLIKSNVSRYAEFKNATRILAFREGKVEQVPCSRADVFNSRQLAMVEKRMLMKFLTFCLE

YEQHPDEYQDYKNSTFAQFLKTRKLTPSLQHFILHSIAMVSEKDCNTLEGLQATRKFLQC

LGRYGNTPFLFPLYGQGEIPQCFCRMCAVFGGIYCLRHSVQCLVVDKESGRCKAVVDHFG

QRISANYFIVEDSYLSESVCENVCYRQLSRAVLITDQSVLKTDSEQQVSILMVPPVDLGQ

PAVCVIELCSSTMTCMKDTYLVHLTCPSTKTAREDLEPVVQKLFSLNAETEKETEDEVLE

KPRVLWALYFNMRDSSGIDRNSYSGLPSNVYVCSGPDSALGNDCAVKQAETIFQEMFPTE

EFCPAPPNPEDIIYDEDEIASEETGFNNSPETKPESSLQESSSRGSSTAVKEHIEE









Get the sequence of this protein entry in FASTA format

Figure out the offset of “VAEEVAEE” in this entry. How?

Find out the fragment’s offset

# extract sequence from UniProt

$seqAmino = `seqret -auto uniprot:Q5ZKN6 stdout –osf=raw`;

# remove the newlines (if any)

$seqAmino =~ tr/\n//;

# look for location of the repeat

$offset = index($seqAmino, "VAEEVAEE") + 1;

# print the offset

print "Offset = ", $offset, "\n";



Calling an EMBOSS command from within PERL

Get a cross-reference

entret uniprot:Q5ZKN6 -auto stdout |grep "DR "

Get the feature table of this protein entry

Understand the cross-reference



DR EMBL; AJ720048; CAG31707.1; -; mRNA.



Link to EMBL Status identifier

Protein ID Molecule Type

EMBL accession number

Database cross reference



The corresponding

cross reference

in EMBL



Read the detailed documentation of UniProt cross reference

http://www.expasy.org/sprot/userman.html#DR_line

Get a cross-reference

entret uniprot:Q5ZKN6 -auto stdout | grep "DR "

|grep "EMBL;"







Get the EMBL cross-reference for this protein



In Perl, you can use a regular expression to locate the EMBL reference

line, and extract the EMBL accession number and the protein-ID

Link protein to coding DNA

extractfeat embl:AJ720048 -value CAG31707.1 stdout >dnaSeq





Put its corresponding coding DNA into file “dnaSeq”

Figure out the offset in DNA

Offset in amino acid sequence: 128

Offset in corresponding nucleotide sequence:

((128-1) x 3) + 1

OR

(128 x 3)-2

= 382

Position is from 382 to (382+ 8x3)=406



Figure out the position of its corresponding coding DNA

sequence (is there anything wrong here?)

Extract the DNA sequence

# extractseq –auto dnaSeq -reg "382-406"

stdout

Now you got the corresponding dna sequence of the protein

fragment



It should be: “gttgctgaggaggttgctgaagaac”



But is that correct? You still have to translate it for verification…

Verify the result

extractseq dnaSeq -reg "382-406" stdout

>verify_seq



transeq verify_seq stdout



Result is “VAEEVAEEX” but not “VAEEVAEE”



You know what’s wrong here? So be careful on each step you did and

never trust you result before making verification!

Practical assignment

Known:

• UniProt Accession number: A0L7N9

• Fragment offset: 305

• Fragment length: 8 (Find out the fragment yourself this time)



Object:

• Find out the coding DNA of this amino acid fragment and verify it

• You may use different EMBOSS commands

• Build a pipeline in PERL to achieve the whole job



Related docs
Other docs by yurtgc548
Machine
Views: 0  |  Downloads: 0
M_amp;M
Views: 0  |  Downloads: 0
M _1931_
Views: 0  |  Downloads: 0
lyle.smu.edusysHinderer730107_dsgn.ppt
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!