tutorial

Document Sample
tutorial Powered By Docstoc
					                       Submitting DNA Barcode Sequences to GenBank: A Tutorial
                                   Todd Osmundson, Garbelotto Lab
                                          September, 2008

Contents

GEN ERAL INTRODUCTION .................................................................................................................................. 1
INTRODUCTION TO NCB I’S B ARCODE S UB MISS ION TOOL ............................................................... 2
SEQUENCE FIL ES ...................................................................................................................................................... 2
CHROMATOGRAPH FILES ................................................................................................................................... 4
ATTRIB UT E TAB LES ................................................................................................................................................ 5
    TABLE 1: SEQUENCE ATTRIBUTES ............................................................................................................................ 6
    TABLE 2: TRACE FILE ATTRIBUTES........................................................................................................................... 7
    GENERATING FILE LIST S............................................................................................................................................. 8
SUB MITTING THE B ARCODE DATA TO GENB ANK US ING BARSTOOL ........................................ 9
US ING NCB I’S S EQUIN AND TB L2ASN .......................................................................................................... 10
    ELEMENT 1: THE SUBMIT -BLOCK TEMPLATE FILE................................................................................................ 11
    ELEMENT 2: THE SEQUENCE DATA ......................................................................................................................... 12
    ELEMENT 3: THE FEAT URE ANNOTATION TABLE .................................................................................................. 12
    ELEMENT 4: THE SOURCE ANNOTATION TABLE.................................................................................................... 15
    SUBMITTING THE DATA ............................................................................................................................................ 15
APPENDICES .............................................................................................................................................................. 20
    A PPENDIX I: A SAMPLE FASTA FILE FOR A DNA BARCODE SUBMISSION .................................................... 20
    A PPENDIX II: SOURCE M ODIFIERS FOR BARCODE SUBMISSIONS T HROUGH BARSTOOL .............................. 21
    A PPENDIX III: U SING T HE T AR UTILITY TO MAKE FILE ARCHIVES .................................................................... 24
    A PPENDIX IV. CONDUCTING BAT CH BLAST SEARCHES USING SEQTOOLS ................................................... 26
    A PPENDIX V. SOURCE M ODIFIERS FOR FASTA DEFINITION LINES OR T BL 2ASN SOURCE TABLES............. 30



General Introduction

     DNA barcode sequences can be submitted to GenBank (the genetic sequence database at
the National Center for Biotechnology Information, NCBI) using several different methods.
The emphasis in this tutorial is on methods for batch data checking and submission so that
many sequences can be handled at one time. There are two main ways of making batch
sequence submissions to GenBank: NCBI‟s Barcode Submission Tool (BarSTool)
specifically for DNA barcode sequences, and Sequin (or the similar but more automated
tool tbl2asn) for either barcode or non-barcode sequences. When submitting true DNA
barcode sequences (i.e., sequences that meet the length, quality and voucher criteria for
official barcodes), it is preferable to use BarSTool, as it has functionality for batch
submission of the necessary ancillary data (voucher specimen data, chromatograph trace
files, etc.). However, in working with fungi, we have a problem – whereas DNA barcoding
for most animal groups uses a portion of the mitochondrial cytochrome oxidase gene (CO1
or cox1), mycologists have adopted the nuclear ribosomal internal transcribed spacer region
(nrDNA– ITS) as a barcoding standard – and BarSTool is currently configured to accept
only CO1 sequences. I am currently in a dialogue with NCBI to reconfigure BarSTool to
accept ITS data, but am unsure of when, or if, such a change will take place. Therefore, this
tutorial will cover BarSTool in hopes that NCBI will reconfigure it in the near future, and
Sequin/tbl2asn in case they do not.

Introduction to NCBI’s Barcode Submission Tool

   GenBank allows bulk submission of DNA barcode sequences and ancillary descriptive
data via its Barcode Submission Tool (BarSTool). The following tutorial describes how to
prepare data for submission and how to submit these data using BarSTool. For additional
information, see the BarSTool website:
http://www.ncbi.nlm.nih.gov/WebSub/index.cgi?tool=barcode


   The data that you will need consist of 3 parts:
      The sequences themselves, in FASTA format
      Chromatograph trace files for each sequence (one forward, one reverse per
       sequence)
      Ancillary data: 2 Tab-delimited tables (in .txt format) – one containing the descriptive
       data for the sequences and one for the trace files – and the names and sequences of
       the forward and reverse primers used for sequencing.


Sequence files
   The following instructions assume that you have a set of completed sequence contigs
(e.g., prepared in Sequencher) of known provenance (i.e., checked against closest matches in
GenBank using BLAST). See Appendix IV for instructions on how to conduct BLAST
searches.
   The sequence data should be in a single FASTA file. A sequence in FASTA format
consists of a description line, which begins with a greater-than symbol (">"), a carriage
return, and then one or more lines of sequence data. The sequence data can be in one
continuous line, but for ease of reading GenBank recommends that all lines of text be
shorter than 80 characters in length. The sequence data are followed by a carriage return,
followed by the next sequence. An example FASTA file containing 3 sequences is:
   Create a FASTA file for each sequence after by exporting the contig‟s consensus
sequence from Sequencher (File  Export  Consensus). The resulting FASTA file will
have as its descriptor line whatever you have named the contig, so make sure that the contig
has a unique name that will distinguish it from all other sequences in the dataset.
   The individual FASTA files can be joined into a single file using a script; the easiest way
to implement such a script is using the Automator program included in Mac OS. Set up the
following workflow in Automator (see right side of window):




Tasks are added to the workflow using the two windows on the left. For step one, select
“Finder” in the far left window (under Library  Applications), then select “Get Specified
Finder Items” under the Action column. Steps 2 and 3 are found within TextEdit in the
Library column. Before running the workflow, remove all files from the first (Get Specified
Finder Items) step (select the files, and click the “-“ button), and choose an appropriate
name in the third (New Text File) step. Run the workflow by clicking the “Run” button in
the upper right-hand corner. Workflows can be saved so you can use them again.
  The following UNIX shell command (this can be run through the Terminal program in
Mac OS) will do the same concatenation operation, but requires typing in all of the file
names:
                               $ cat <file1> <file2> > <cat.out>
This operation will concatenate the files “file1” and “file2” into the output file “cat.out”. For
a slightly easier way than typing all of the file names, see instructions for generating file lists,
later in this document.
   If you prepare your FASTA files in a word processing program (e.g., Microsoft Word)
rather than in a text editor (e.g., TextEdit, BBEdit, Text Wrangler, etc.) be sure to save your
file as a plain text (.txt) file rather than as a word processor document (.doc, etc.), as the
embedded formatting tags in the latter can cause problems downstream.
   See Appendix 1 for a sample FASTA file for a DNA barcode submission.


Chromatograph files
   The output from the ABI sequencer includes 3 files for each sequence: the
chromatograph trace file (.ab1), a Phred file that includes the base call quality scores (.phd.1),
and a text file that includes the sequence itself (.seq). The .seq file includes the raw base calls
and is therefore virtually useless without editing, but the .ab1 and .phd.1 files are essential for
the barcode submission. The attribute tables (see below) will be used to tie together the
trace, Phred, and sequence (as edited contigs in FASTA format) files.
   The chromatograph trace (.ab1) files must be submitted as a single compressed format /
archive file (.zip, .gz, .tar, etc.). To prepare the trace file, first create a new directory (folder)
named “traces” containing all the traces for this submission. Then, assemble the .ab1 files
into a single archive file and compress it. The archiving and compression steps can be done
simultaneously or in sequence, depending on the tools at your disposal. To do these steps
simultaneously, use a zip utility (e.g., WinZip for Windows). In Mac OS, you can easily
prepare a .zip file as follows:
    1. Select the files to be put in the archive (this should be all files in the “traces” folder
         that you just created.
    2. Select the cogwheel drop-down menu and select “Create Archive of X items”; a .zip
        file containing a compressed archive will appear in the folder – this is the file that
        you will submit through BarSTool.




Alternatively, you can produce the archive using a Tar utility, and compress it using the gzip
utility. For instructions on using the UNIX tar utility, see Appendix III. For instructions
on using the UNIX gzip utility, see the gzip home page: http://www.gzip.org/#intro


Attribute tables
   The main difference between submission of barcode sequences and that of other DNA
sequence data is that barcode sequences are held to a higher standard – they must
correspond to vouchered specimens, must be from particular (agreed-upon) loci, and must
be of high quality (low percentage of ambiguous bases (“N”s); must have forward and
reverse sequences; and should be linked to a chromatograph file and/or a file of read quality
metrics). The attribute tables are the way to link each sequence to its appropriate
chromatograph file and voucher specimen data. These tables must be submitted to
GenBank as tab-delimited text files. One can use a text editor to make the files, but it is
probably easier to use a spreadsheet application such as Excel; just be sure to save the file as
tab-delimited text when you‟re finished.


Table 1: Sequence attributes

  This is a tab-delimited text file that includes information about the sequence and the
specimen from which it is derived (NCBI refers to this information as “source modifiers”).

The NCBI website has downloadable templates for a table including the source modifiers
recommended by the Barcoding Consortium, and a table including all possible source
modifiers. In general, it is sufficient to use just the recommended source modifiers.
Here are the links:
* Template including recommended source modifiers:
http://www.ncbi.nlm.nih.gov/WebSub/html/help/templates/source-table-recommended.txt

* Template including all source modifiers:
http://www.ncbi.nlm.nih.gov/WebSub/html/help/templates/source-table-all.txt

To download the files, it is better to go directly to NCBI and download them from there:
http://www.ncbi.nlm.nih.gov/WebSub/html/help/source-table.html

Here is a sample source modifier table:




   The first column includes the sequence ID. This must be the same ID that was used in
the description line of that sequence in the FASTA file. The rest of the columns include
information about the collection and the accession number of the voucher specimen. While
more information is usually better, GenBank will accept a table that includes only the first
(Sequence_ID) and last (Specimen_voucher) columns. For an example file with the 2-
column format, see the following URL:
http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/source-table-2-col-sample.txt
Official barcode submissions also require the “Country” modifier (country where the
specimen was collected).
   So, the end result of this table is to tell GenBank which voucher specimen corresponds
to each barcode sequence.
   Note the following requirements for the source modifiers table:
           The heading for the first column must be exactly Sequence_ID as shown in the
            sample table.
           Each specimen in the set must have a line in the source modifiers file, even if
            there are no modifiers to apply to the specimen.

           Each Sequence_ID may appear only once in the source modifier file.

See Appendix 2 for descriptions of all source modifier fields.


Table 2: Trace file attributes

  This is a tab-delimited text file that includes information about the trace files .

The NCBI website has a downloadable template for this table; to see the format, go to the
following URL:
http://www.ncbi.nlm.nih.gov/WebSub/html/help/templates/trace-table-required.txt

To download the template, go to the NCBI website:
http://www.ncbi.nlm.nih.gov/WebSub/html/help/trace-table.html

Here is a sample trace attribute table:




The first row of the table includes the column headings.

The columns of the table are as follows (descriptions taken from the NCBI website):

           Template_ID - identifies the sequence. This identifier must be the same value
            as the Sequence_ID used in the source modifier table and in the nucleotide
            FASTA file, and allows GenBank to tie together the sequence, trace file, and
            voucher specimen data for each barcode.

           Trace_file - the path to a specific trace in the trace archive file. If you set up the
            trace archive by putting all the traces into a directory (folder) named “traces”, the
            path would start with "traces/" For example: traces/filename.ab1.

Note: If you set up your traces directory with subdirectories (eg, for each separate
submission set or for each separate organism, etc), the path listed in the trace_file column
must include the subdirectory name. For example: traces/subdirectory_name/filename.scf.

           Trace_format - names the format of the provided trace file. Trace_format can
            have the following values: SCF, SFF, ZTR, and ABI.

           Center_project - a sequencing center's internal designation for a specific
            sequencing p roject. This field can be useful for grouping related traces.

           Program_ID - the base calling program. This field is free text. Program name,
            version numbers or dates are very useful. Examples include:

                   phred-19980904e
                   abi-3.1
                   ATQA
                   TraceTuner
                   Licor
                   Megabase
                   Beckman

           Trace_end - labels which end of the sequence is contained in the read. Possible
            values: F, R, N for Forward, Reverse, and uNknown.

Note that a Trace_file may appear only once in a Trace Information file; however, a
Template_ID may appear more than once.
For more documentation on the NCBI Trace Archive, including additional fields and their
descriptions, consult the following website:
http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc_b &m=doc&s=rfc_b#PROGRAM_ID



Generating file lists

   If it sounds like a lot of work to enter all of the filenames into the spreadsheet, here is
some good news – it is not necessary to manually type the file names or copy/paste them
individually into the spreadsheet – let your operating system do most of the work for you by
generating a file list! Here‟s how:
1. In Mac OS, open a Terminal window (or run cmd in Windows), then navigate to the
folder where the trace files and score files have been placed (use the cd command in all
OS‟s). Note that in Mac OS Terminal, you do not need to manually type the folder path
following the “cd” command – just type “cd”, and then drag the folder icon (from a Finder
window) into the Terminal window and Terminal will write the path for you.

2. Use the following commands to generate a list of all files in the folder and write them to a
file entitled list.txt:
Windows:             dir > list.txt
MacOS:               ls > list.txt
Linux/Unix:          ls > list.txt


3. Generally, in building the spreadsheets, however, you will not want a list of all files, but
rather a list of all files within a certain class (e.g., trace files, or Phred files). These
commands will generate a list of files having a particular file extension in the current folder,
and save it in a text file entitled list.txt.
To generate a list of the trace files use the following commands :
Windows:         dir *.ab1 > list.txt
MacOS:           ls *.ab1 > list.txt
Linux/Unix: ls *.ab1 > list.txt

To generate a list of the score files use :
Windows:      dir *.phd.1 > list.txt
MacOS:         ls *. phd.1 > list.txt
Linux/Unix: ls *. phd.1 > list.txt
You can then open list.txt directly in Excel, or open list.txt in a text editor and copy/paste
into a column in an Excel spreadsheet.


Submitting the barcode data to GenBank using BarSTool
[Note: The NCBI BarSTool information page can be found at the following URL:
http://www.ncbi.nlm.nih.gov/WebSub/index.cgi?tool=barcode]


   Once you have all of the data prepared, it is time to submit them to GenBank. First, you
will need to register for a My NCBI account:
                http://www.ncbi.nlm.nih.gov/entrez/login.fcgi
    To begin the submission process, make sure that you have the following available:
       A web browser that supports both JavaScript and cookies
       The title of a published or in-press paper that discusses the Barcode Set
       A text file of the set of nucleotide sequences in FASTA format
       The names and sequences of the forward and reverse primers
       A tab-delimited table of source modifier data for the set
       A text file of the set of protein sequences in FASTA format (optional for CO1; not
        applicable for ITS)
       A tab-delimited table of trace attributes and a compressed archive containing the
        traces (optional, but highly recommended)

  From the NCBI Barcode of Life home page
(http://www.ncbi.nlm.nih.gov/WebSub/index.cgi?tool=barcode),   select the link “Sign in to use
barcode” in the upper right corner. On the first page, enter your contact information. On
the second page, enter the names of the sequence authors and study (either a published or in
press paper, or the name of an unpublished study. Sequence authors for the Venice study
should include Matteo Garbelotto, Lydia Baker, and anyone involved in the data acquisition
and/or submission of the particular group of sequences being submitted. For the study title
we should use a single, agreed-upon name so that all of the sequences can be grouped
together even if they are submitted separately. I propose using the name of the study that
appears on the lab website: “Barcoding the Venice Fungal Collection.” On the third page,
select a release date for the sequences; this step should be done in consultation with Matteo.
Also on the third page, upload the nucleotide FASTA file containing the sequence data. On
the fourth screen, you may upload a protein translation file; since ITS is not a protein-coding
gene, continue past this step. The fifth screen prompts for primer information. If all
sequences were generated using the same primers (e.g., ITS1-F and ITS4-B), choose the
option “Set one value for all sequences” and then enter the primer name and sequence. In
the sixth page, upload your sequence source modifier file; note that, if your file does not
contain all of the columns recommended by the Barcode Consortium, your submission will
still be accepted but may not be given the official “Barcode” label in GenBank. The seventh
screen will prompt you to upload the trace information table and the trace archive (the
compressed archive file containing all of the chromatograph trace files). Be sure to upload
the correct file in the correct place. Following all information entry and upload, BarSTool
presents a text file (in GenBank flat file format) containing your submission as it will appear
in GenBank. Review this file carefully to confirm that the specimen, locus, author, and
study information are correct. If you are submitting a protein-coding sequence (e.g., CO1),
make sure that the translation makes sense (e.g., no stop codons, signified by an asterisk in
the protein sequence). Finalize your submission, and you‟re finished!


Using NCBI’s Sequin and tbl2asn

   Though one can submit sequences directly to GenBank via a web interface (BankIt), this
method only accommodates submission of sequences one-at-a-time – certainly an
unpalatable option if one has many sequences to submit. Batch submission of sequences is
facilitated by NCBI‟s Sequin utility. Sequin allows the creation of a single file containing
descriptive information for a batch of sequences (author information, etc.) through web
forms completed by the user, and then packages this file with the sequence files (in FASTA
format) into a single Sequin (.sqn) file that can be submitted to GenBank via e-mail. The
utility tbl2asn, as its name (albeit cryptically) suggests, converts information from tables to
ASN.1 (Abstract Syntax Notation 1), the file format used by GenBank. It is a command-line
program, so there are no menus – you must enter commands directly into a shell (UNIX,
DOS, etc.) window. Regardless of whether Sequin or tbl2asn is used, a submission consists
of the following elements:
       Sequence data in FASTA format
       General information about the submission (e.g., author information)
       Annotation of sequence features (e.g., coding regions, non-coding regions)
       Source information (e.g., organism, collection information, etc.)
Besides the difference in the way that the programs are run (command line vs. web forms),
Sequin and tbl2asn differ primarily in the way in which some of these submission elements
are organized. The general information (author names and institutions, etc.) is in both cases
entered into a Sequin project, but when prepared for tbl2asn the user stops after this
information is entered and exports the results into a standalone file that can be read by
tbl2asn. The feature annotations can be either entered into a web form or imported as a tab-
delimited text file for Sequin, and must be imported as a table for tbl2asn. Source
information can either be entered into the web form (the tedious way) or embedded in the
FASTA definition line (the more straightforward way) for Sequin, and are either embedded
in the FASTA definition line or stored in a tab-delimited text table for tbl2asn.
   Because there is a fair amount of overlap between Sequin and tbl2asn and because
tbl2asn is a bit more efficient for large numbers of sequences, I will describe the use of
tbl2asn here. Instructions specific to Sequin are available from the following URL:
http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm.
   First, download tbl2asn. A link to the FTP site containing the download is available at
http://www.ncbi.nlm.nih.gov/Genbank/tbl2asn2.html. Be sure to download the correct
version for your platform, then uncompress the file and change the file permissions if
necessary. Also download Sequin (at http://www.ncbi.nlm.nih.gov/Sequin/), as you will
need this program for the initial steps of the process.
   The submission can contain up to 6 types of items, as follows:
    1. Template file containing a text ASN.1 Submit-block object (suffix .sbt).
    2. Nucleotide sequence data in FASTA format (suffix .fsa).
    3. Tab-delimited text file with a table containing sequence features (suffix .tbl).
    4. Protein sequence if the gene encodes a protein (suffix .pep).
    5. Source Table (suffix .src).
    6. Quality Scores (suffix .qvl).
    In our nrDNA-ITS sequence submissions, we can omit #4 since ITS is non-coding. We
    will also omit #6. The remaining four elements are described in more detail below.

Element 1: The submit-block template file
   This file is generated through Sequin. To make the file, first open Sequin. Choose
   “GenBank” as the database for submission, then select the button “Start New
   Submission.” The following form will open:




   Select a date for when the sequence record may be released (in consultation with
   Matteo), and fill in a tentative manuscript title. Then, select the other tabs and enter the
   contact, author, and affiliation information. After you have done this, return to the
   submission tab and use File->Export Submitter Info. Save the file as template.sbt.


Element 2: The sequence data

      Sequence data should be given in FASTA format, just as in a BarSTool submission.
   As in a BarSTool submission, it is easiest if all sequences are combined into a single
   FASTA file. The FASTA file should be placed in the same directory as the template and
   table files generated in the other steps of this process. It is possible to provide source
   information in the FASTA definition line (See Appendix V), or to store it in a separate
   tab-delimited table. Keep in mind that the sequence identifier (sequence “title”) used in
   the definition line (i.e., following the “>” symbol) must be identical to those used in the
   source modifier and feature annotation tables. This sequence ID will be changed to a
   GenBank accession number by the NCBI staff after the sequences are submitted. For
   our submissions, we will put the source modifier data in a separate file (See Element 4),
   so the FASTA definition lines need only contain a unique identifier.


Element 3: The feature annotation table
          Sequence features such as the location of coding regions, introns, or different
      structural parts of a gene must be identified prior to submission. For example, a typical
      ITS1/ITS4 – primed sequence product contains a small portion of the 18S ribosomal
      RNA gene, followed by the first internal transcribed spacer (ITS1), the 5.8S ribosomal
      RNA gene, the second internal transcribed spacer (ITS2), and a small portion of the 28S
      ribosomal RNA gene. Identification of these elements and their positions will be, by far,
      the most time-consuming part of this process. These feature annotations must then be
      stored in a tab-delimited table having a specific 5-column format (columns separated by
      tabs). The file begins with a definition line similar to that in a FASTA-formatted
      sequence; for example: >Feature SeqId table_name
The sequence identifier (SeqId) must be the same as that used in the sequence FASTA file.
The table name portion is optional. Subsequent lines of the table list the features, each on a
separate line. Each feature can contain additional notes or qualifiers, placed on the line
below the feature type and location. are on the line below. The columns are as follows:
         Column 1: Start location of feature
         Column 2: Stop location of feature
         Column 3: Feature key (type)
         Column 4: Qualifier key (placed on the row below the information in the first 3
          columns
         Column 5: Qualifier value (also placed on the row below the information in the first
          3 columns)
An example table may look like this:
>Feature Lp_1625
1         629     source
                           organism               Laccaria pseudomontana
                           mol_type               genomic DNA
                           isolate                pse1625
                           specimen_voucher       Cripps 1625(type)
                           db_xref                taxon:344594
                           tissue_type            basidiome
                           country                USA: Colorado, Ten Mile Range,
                                                  Blue Lake Dam
                           note                   type strain of Laccaria sp. CLC1771
<1        14      rRNA
                                  product                         18S ribosomal RNA
15        249     misc_RNA
                                  product                         internal transcribed
                                                                  spacer 1
250       407     rRNA
                                  product                         5.8S ribosomal RNA
408       614     misc_RNA
                                  product                         internal transcribed
                                                                  spacer 2
615       >629    rRNA
                                  product                         28S ribosomal RNA
Note that, in the columns 1 and 2 (feature start and stop positions), the first entry begins
with <1, and the last entry ends with >629. The “<” denotes that the feature actually begins
before the first nucleotide of our sequence; the “>” denotes that the feature actually ends
after the last nucleotide of our sequence. This annotation table will yield a GenBank file
having the following feature annotations (See “FEATURES,” below):

LOCUS        DQ149871                 629 bp    DNA      linear  PLN 13-
MAR-2006
DEFINITION Laccaria pseudomontana isolate pse1625 18S ribosomal RNA
gene, partial sequence; internal transcribed spacer 1, 5.8S ribosomal
RNA gene, and internal transcribed spacer 2, complete sequence; and 28S
ribosomal RNA gene, partial sequence.
ACCESSION    DQ149871
VERSION      DQ149871.1 GI:76781901
KEYWORDS     .
SOURCE       Laccaria pseudomontana
  ORGANISM Laccaria pseudomontana
             Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina;
             Agaricomycetes; Agaricomycetidae; Agaricales;
             Tricholomataceae; Laccaria.
REFERENCE    1 (bases 1 to 629)
  AUTHORS    Osmundson,T.W., Cripps,C.L. and Mueller,G.M.
  TITLE      Morphological and molecular systematics of Rocky Mountain
             alpine Laccaria
  JOURNAL    Mycologia 97 (5), 949-972 (2006)
REFERENCE    2 (bases 1 to 629)
  AUTHORS    Osmundson,T.W., Cripps,C.L. and Mueller,G.M.
  TITLE      Direct Submission
  JOURNAL    Submitted (29-JUL-2005) Ecology, Evolution and
             Environmental Biology, Columbia University, 1200 Amsterdam
             Avenue, MC 5557, New York, NY 10027, USA
FEATURES              Location/Qualifiers
      source          1..629
                      /organism="Laccaria pseudomontana"
                      /mol_type="genomic DNA"
                      /isolate="pse1625"
                      /specimen_voucher="Cripps 1625 (type)"
                      /db_xref="taxon:344594"
                      /tissue_type="basidiome"
                      /country="USA: Colorado, Ten Mile Range, Blue Lake
Dam"
                      /note="type strain of Laccaria sp. CLC 1771"
      rRNA            <1..14
                      /product="18S ribosomal RNA"
      misc_RNA        15..249
                      /product="internal transcribed spacer 1"
      rRNA            250..407
                      /product="5.8S ribosomal RNA"
      misc_RNA        408..614
                      /product="internal transcribed spacer 2"
      rRNA            615..>629
                      /product="28S ribosomal RNA"
ORIGIN
1    aggatcatta ttgaataaac ctgatgtggc tgttagctgg cttttcaaag catgtgctcg
61 tccgtcatct ttaatttctc cacctgtgca cattttgtag tcttggatac ctctcgaggc
           121    aactcggatt     ttaggatcgc     cgtgctgtaa      aagtcagctt     tcctctcatt     tccaagacta
           181    tgttttcata     tacaccaaag     tatgtttaaa      gaatgtcatc     aatgggaact     tgtttcctat
           241    aaaattatac     aactttcagc     aacggatctc      ttggctctcg     catcgatgaa     gaacgcagcg
           301    aaatgcgata     agtaatgtga     attgcagaat      tcagtgaatc     atcgaatctt     tgaacgcacc
           361    ttgcgctcct     tggtattccg     aggagcatgc      ctgtttgagt     gtcattaaat     tctcaacctt
           421    ccaactttta     ttagcttggt     taggcttgga      tgtgggggtt     gcgggcttca     tcaatgaggt
           481    cggctctcct     taaatgcatt     agcggaactt      ttgtggaccg     tctattggtg     tgataattat
           541    ctacgccgtg     gatttgaagc     agctttatga      agttcagcct     ctaaccgtcc     attgacttgg
           601    acaattttga     caatttgacc     tcaaatcag
           //



                 It is best to make this table, as well as the source modifier table, in a text editor rather
                 than a word processor in order to avoid the surreptitious insertion of formatting codes.
                 Save the feature annotation table with the file extension .tbl.


           Element 4: The source annotation table

                    This table contains information about the biological source of the sequence. It can
                 contain a wealth of different information (see Appendix V for a complete list of
                 accepted source modifiers), but usually only includes a small subset of the possible fields.
                 The first column must include the sequence ID (SeqID); this code must be identical to
                 the one in the definition line of the corresponding FASTA file. The second column
                 should include the organism name (Latin binomial). For our submissions, we should
                 also use the fields recommended for DNA barcode data by the Consortium for the
                 Barcode of Life, as doing so will facilitate obtaining official barcode designation for our
                 sequences; these fields are: Collected-by, Collection-date, Country, Identified-by, Lat-
                 Lon, and Specimen-voucher. An example table is as follows:


SeqID   organism      Collected-by      Collection-date Country Identified-by  Lat-lon         Specimen-voucher
Mp_MG15 Mycena pura   Matteo Garbelotto 1-April-2008    USA     Todd Osmundson 13.57 N 24.68 W MG 15
Lb_MG99 Mycena impura Matteo Garbelotto 1-April-2008    USA     Doug Schmidt   13.57 N 24.68 W MG 99



           The table must be saved as a tab-delimited file with a .src extension.

           Submitting the data
                Now we will run the tbl2asn program, which will generate a Sequin (.sqn) file that we
                 can submit to GenBank via e-mail. First, copy all files into the directory that contains
                 the tbl2asn program, as this simplifies the path specification in the command line. In
                 MacOS, open a Terminal window; in Windows, open a DOS command window; then
                 navigate to the directory that contains the tbl2asn program. Typing “tbl2asn -” at the
                 shell prompt will produce the full list of command line arguments; the following page
                 contains a summary of the most common ones, as well as some example command lines.
     We will use the following command line, for a batch submission with multiple sequences
     per .fsa file:
     tbl2asn -t template.sbt –p. -a s -V v

     This command line makes several assumptions: that the Sequin template is named
     “template.sbt”, that all sequences are in a single FASTA file, and that all files are in the
     same directory as the tbl2asn program; make sure that your data meet these assumptions,
     or change the command line accordingly. Note also that the .fsa, .tbl, and .src files must
     have the same filename prefix (e.g., mycena.fsa and mycena.tbl), or tbl2asn will not
     match them correctly.


     Most common command line arguments for tbl2asn (from the NCBI website).
-p    Path to the directory. If files are in the current directory –p. should be used.
      Path for the resulting .sqn file(s) (if the –r argument is not used, the .sqn files will be saved in the
-r
      source directory).
      Specifies the template file (.sbt). If the .sbt file is in a different directory the full path must be
-t
      specified.
-i    Creates single submission from indicated .fsa file in a directory of multiple .fsa files.
      Specifies the File type.
          s :FASTA Set (s Batch, s1 Pop, s2 Phy, s3 Mut, s4 Eco)
          l :FASTA+Gap Alignment
          z :FASTA with Gap Lines
-a
          e :PHRAP/ACE
         d :FASTA Delta, di FASTA Delta with Implicit Gaps
          a :Any (default)
      Sample command line: -a s
      Instructs tbl2asn to read multiple FASTA components in one file as a set of unrelated sequences.
-s    Equivalent to “-a s”. This creates a single file of multiple submissions. (1000 sequences per file is
      the usual maximum.)
      Allows the addition of source qualifiers that will be the same for each submission. Example: -j
-j
      “[organism=Saccharomyces cerevisiae] [strain=S288C]”.
      Verification (combine any of the following letters):
      v :Validates the data records. The output is saved to files with a .val suffix.
-V    b :Generates GenBank flatfiles with a .gbf suffix.
      r :Validates without Country Check
      Sample command line: -V vb
      CDS Flags (combine any of the following letters):
      c :Instructs tbl2asn to annotate the longest open reading frame (ORF) if a .tbl file is not provided.
      The product name will be „unknown‟ unless a product name is included in the FASTA definition,
-k    [product=xyz].
      m :Allows alternative start codons to be used in ORF searches.
      r :Allows Runon ORFs
      Sample command line: -k c
      Adds a COMMENT to each submission. Example: -y “Contigs larger than 2kb have been
-y
      annotated, representing approx. 87% of the total genome”.
-Y    Like –y, but adds a COMMENT to each submission from a file.
      Runs the Discrepancy Report. Must supply an output file name. Recommended only for
-Z    annotated genome submissions, complete or WGS. See the Discrepancy Report page for
      information about its output.
-o   Creates a single submission from multiple fasta files.

Example Command Lines:
Single submission: one sequence per .fsa file:
tbl2asn -t template.sbt -p path_to_files -V v
Batch submission: multiple sequences per .fsa file:
tbl2asn -t template.sbt -p path_to_files -a s -V v
Single submission: one .fsa file in directory of multiple .fsa files:
tbl2asn -t template.sbt -i x.fsa -V v

The –V v portion of the command line generates a validation file with a .val extension.
Before submitting your .sqn files to GenBank, review the .val files (open with a text editor);
you will need to correct any error-level errors. Taxonomy-related errors about missing
lineages can (and should) generally be ignored. To correct errors, open the newly-created
.sqn file in Sequin by double clicking on it. Double-click on the portion of the GenBank
output that contains an error; the program will draw a black, vertical line to the left of the
portion and open a dialog box where you can correct the error (see diagrams on next page).
When you open the Sequin file, the first file in the set will be open; to go to other files, click
in the box that contains the sequence ID number.



                                                          Double-click here to go to
                                                          another sequence



                                               Double-click here to correct the
                                               source information
Unfortunately, if we wish to submit chromatograph trace files to the NCBI Trace Archive,
we will have to do so separately rather than integrated as in BarSTool. At present, a trace file
does not appear to be a requirement to obtain official barcode designation for a sequence, so
we may wish to skip this step for now.
                                              Appendices


Appendix I: A Sample FASTA File for a DNA Barcode Submission
(Source: NCBI)
>Seq1 [organism=Car podacus mexicanus]
CCTTTA TCTAA TC TTTG GAGCA TGAGC TGGCA TA GTTG GAACCGCCCTCAGCCTCCTCA TCCGTGCAGAA
CTTG GACAACCTGGAAC TCTTCTAGGA GACGACCAAA TTTACAATGTAA TCGTCAC TGCCCACGCCTTCG
TAA TAA TTTTC TTTA TAG TA ATACCAATCA TGA TCGG TG GTTTCGGAA ACTG ACTAG TCCCACTCA TAA T
CGGCGCCCCCGACATAGCATTCCCCCGTA TAAACA ACATA AGCTTC TGACTACTTCCCCCATCATTTCTT
TTACTTCTA GCATCC TCCACAGTA GAAGC TGG AGCAGG AACAGG GTGAACAG TA TA TCCC CCTCTCGCTG
GTAACC TAGCCCATGCCGGTGCTTCAG TAGACC TAGCCATC TTCTCCCTCCACTTAGCA GG TGTTTCCTC
TA TCCTAG GTGCTA TTAAC TTTA TTACAACCGCCATCAACA TA AAACCCCCAACCCTCTCCCAATACCAA
ACCCCCCTATTCG TA TGA TCAG TCCTTA TTACCGCCGTCCTTCTCCTACTCTC TCTCCCAG TCCTCGCTG
CTGGCA TTACTA TAC TACTAACAGACCGAA ACCTAA ACACTACG TTCTTTGACCCAGCTG GAGG AGGA GA
CCCAGTCCTG TACCAACACCTCTTC TGA TTC TTCGGCCATCCAGA AG TCTA TA TCCTCA TTTTAC

>Seq2 [organism=Vireo solitarius]
GGTA GG TACCGCCCTAAGNC TCCTAA TCCGAGCAG AACTANGCCAACCCGGA GCCCTTC TGG GAGACG AC
CAAATC TACAACG TAG TCG TTACGGCCCACGCCTTCG TAA TA ATC TTTTTCA TA GTAA TGCCAATCA TA A
TCGGAG GA TTCG GGAAC TGAC TAG TTCCTCTAATGA TTGGGGCCCCAGACA TAGCA TTCCCTCGAA TAAA
CAACATAA GCTTTTG ACTAC TACCACCATCA TTCCTACTCC TAA TAGCCTCC TCAACAG TAGA AGCAGG A
GCCGGAACCGGA TGAACCG TG TACCCACCACTAGC TGGA AACCTG GCCCACGCCGGAGCCTCAG TAGACC
TAGC TA TCTTCTCCCTACACCTAGCAGG TA TC TCATCCA TCCTGG GGGCAA TTAAC TTCA TTACAACAGC
AATCAACA TAA AACCACCCGCCCTCTCACAATACCAAACACCACTA TTCGTGTGATCCG TCCTA ATTACG
GCCGTACTACTCCTAC TA TCTC TCCCAGTAC TAGCCGCCGG TA TCACCATGC TAC TCACAGACCGCAACC
TCAACACCACCTTC TTTGACCCAGCAGGA GGAG GAGACCCAG TACTATACCAGCACCTA TTC TGA TTCTT
CGGACACCCAGAAG TCTACA TCCTA ATTCTC

>Seq3 [organism=Den droica tigrina]
CCTATACC TAA TTTTCGGCGCA TGA GCCGGAA TGG TGG GTACCGCTCTAAGCCTCCTCA TTCGAGCA GAA
CTAGGCCAACCCGGAGCCCTTCTGGGA GACGACCAAG TC TACAACG TGG TTG TCACGGCCCATGCCTTCG
TAA TAA TCTTC TTTA TAG TTA TGCCGA TTATAA TCGGAG GATTCGG AAAC TGAC TAG TCCCCCTAATAAT
CGGAGCCCCAGACATA GCATTTCCGCGAA TAAACA ACATA AGCTTC TGACTACTCCCACCATCA TTCC TC
CTCCTCTTAGCA TCCTCCACAG TGG AAGCAG GCGTA GG TACAGGC TGA ACAG TG TATCCCCCACTAGC TG
GCAACCTAGC TCATGCCGGGGCCTCAG TCGACC TCGCAA TCTTCTCCTTACACCTAGCTGG TA TTTCCTC
AATCCTCG GAGCAA TTAAC TTCA TTACAACAGCA ATTAACA TG AAACCTCCTGCCCTCTCACAA TACCAA
ACCCCACTATTCGTC TGA TCAG TG TTAA TTACTGCA GTCCTCC TTC TCCTTTCCCTTCCAG TTC TAGC TG
CAGGAA TCACAA TGCTCCTCACA GACCGCAACCTCAACACCACA TTC TTCG ACCCTGCCGGAG GAGG AGA
TCCCGTCCTA TA TCAACA TCTC TTC TGA TTCTTCGGCCACCCAGAAG TC TACATCC TAA TCCTC

>Seq4 [organism=Vireo gilv us]
CATGAGC TGG AA TAG TAG GTACCGCCCTA AGCCTCCTA ATTCGA GCAGAGC TAG GCCAACCCGGAGCCCT
ACTGGG AGACGACCAA ATC TACAACG TAG TCGNCACGGCCCATGC TTTTG TAA TA ATC TTC TTCATAG TA
ATGCCAA TCA TAA TCGGAG GG TTTG GAAAC TGAC TGG TCCCCCTAATAA TTGGAGC TCCAGACA TAGC A T
TCCCCCGAATAA ACAACATGAG TTTC TGAC TACTTCCCCCATCATTCCTAC TAC TAA TAGCCTCC TCAAC
AGTA GAAGCA GGCG TTGGAACA GGA TGAACCG TA TA TCCACCACTAGCCGGA AACCTA GCCCATGCAGG A
GCCTCAGTAGACCTAGCTA TC TTC TCCCTACACCTA GCAGG TA TCTCCTCCA TCCTAG GGGCAA TCAAC T
TCATTACA ACAGCAA TCAACA TAAAACCACCCGCCCTA TC ACAATACCAAACACCACTA TTCGTATG ATC
CGTCCTAA TCACAGCCG TACTACTCCTCCTATCAC TCCCAGTGC TAGC TGC TGGA ATTACCA TGCTACTT
ACAGACCGCAACCTCAACACTACCTTCTTTGACCCAGCAGGG GGAG GAGACCCAG TGC TATACCAACATC
TA TTC TGA TTCTTCGGACACCCAGAA GTTTACA TCCTAA TTCTC

>Seq5 [organism=Den droica castanea]
CCTATACC TAA TTTTCGGCGCA TGA GCCGGAA TAG TGG GTACCGCCCTAAGCCTCCTCA TTCGAGCA GAA
CTAGGCCAACCCGGAGCCCTTCTGGGA GACGACCAAG TC TA TAACG TAG TTG TCACGGCCCATGCCTTCG
TAA TAA TTTTC TTTA TAG TTA TGCCGA TTATAATC GGAG GATTCGG AAAC TGAC TAG TCCCCCTAATAAT
CGGAGCCCCAGACATA GCATTCCCACGAA TAAACA ACATA AGCT TC TGACTACTCCCACCATCA TTCC TT
CTCCTCCTAGCA TCCTCCACAG TCGAA GCAGGCG TAG GTACA GGCTGAACAG TA TACCCCCCACTAGCTG
GCAACCTAGC TCACGCCGGAGCCTCA GTCGACC TCGCAA TCTTCTC TCTACACCTAGCTGG TA TTTCCTC
AATCCTCG GAGCAA TCAAC TTCA TTACAACAGCA ATTAACA TA AAACCTCCTGCCCTCTCACAA TACCAA
ACCCCACTGTTCGTC TGA TCCGTCC TAA TCACTGCA GTCC TCCTGCTCC TTTCCCTTCCAGTTCTAGCTG
CAGGAA TCACAA TACTCCTCACA GACCGCAACCTAA ACACCACATTCTTCGACCCTGC TGG AGGA GGAG A
TCCCGTCCTA TA TCAACACCTTTTC TGA TTCTTCGGCCACCCAGAAG TC TACATCC TAA TCNTC

>Seq6 [organism=Vireo gilv us]
CATGAGC TGG AA TAG TAG GTACCGCCCTA AGCCTCCTA ATTCGA GCAGAGC TAG GCCAACCCGGAGCCCT
ACTGGG AGACGACCAA ATC TACAACG TAG TCG TCACGGCCCATGC TTTTG TAA TA ATC TTC TTCATAGTA
ATGCCAA TCA TAA TCGGAG GG TTTG GAAAC TGAC TGG TCCCCCTAATAA TTGGAGC TCCAGACA TAGCA T
TCCCCCGAATAA ACAACATGAG TTTC TGAC TACTTCCCCCATCATTCCTAC TAC TAA TAGCCTCC TCAAC
AGTA GAAGCA GGCG TTGGAACA GGA TGAAC TG TA TACCCGCCACTAGCCGG TAACC TAGCCCATGCA GGA
GCCTCAGTAGACCTAGCTA TC TTC TCCCTACACCTA GCAGG TA TCTCCTCCA TCCTAG GGGCAA TCAAC T
TCATTACA ACAGCAA TCAACA TAAAACCACCCGCCCTA TCACAATACCAAACACCACTA TTCGTATG ATC
CGTCCTAA TCACAGCCG TACTACTCCTCCTATCAC TCCCAGTGC TAGC TGC TGGA ATTACCA TGCTACTT
ACAGACCGCAACCTCAACACTACCTTCTTTGACCCAGCAGGG GGAG GAGACCCAG TGC TATACCAACATC
TA TTC TGA TTCTTCGGACACCCAGAA GTTTACA TCCTAA TTCTC
Appendix II: Source Modifiers for Barcode Submissions through BarSTool
(from the NCBI website http://www.ncbi.nlm.nih.gov/WebSub/html/help/source-
table.html)


In addition to the Sequence ID, the following source modifiers are required for Barcode
submissions:
      Country - The country of origin of DNA samples used.
      Specimen_voucher - An identifier of the individual or collection of the source
       organism and the place where it is currently stored, usually an institution.


The following source modifiers are recommended for Barcode submissions:
      Collected_by - Name of person who collected the sample.
      Collection_date - Date the specimen was collected. In format DD-Mon-YYYY, that
       is 2-digit date, three-character abbreviation of month, and 4-digit year, (e.g., 11-Feb-
       2002). Mon-YYYY and YYYY are alternate formats to use when date information is
       less complete.
      Identified_by - name of the person or persons who identified by taxonomic name
       the organism from which the sequence was obtained
      Lat_Lon - Latitude and longitude, in decimal degrees, of where the sample was
       collected.


The following optional source modifiers are available to further describe the sequences in a
Barcode set:
      Authority - The author or authors of the organism name from which sequence was
       obtained.
      Biotype - Variety of a species (usually a fungus, bacteria, or virus) characterized by
       some specific biological property (often geographical, ecological, or physiological).
       Same as biotype.
      Biovar - See biotype
      Breed - The named breed from which sequence was obtained (usually applied to
       domesticated mammals).
      Cell_line - Cell line from which sequence was obtained.
      Cell_type - Type of cell from which sequence was obtained.
      Chemovar - Variety of a species (usually a fungus, bacteria, or virus) characterized by
    its biochemical properties.
   Clone - Name of clone from which sequence was obtained.
   Cultivar - Cultivated variety of plant from which sequence was obtained.
   Dev_stage - Developmental stage of organism.
   Ecotype - The named ecotype (population adapted to a local habitat) from which
    sequence was obtained (customarily applied to populations of Arabidopsis thaliana).
   Forma - The forma (lowest taxonomic unit governed by the nomenclatural codes) of
    organism from which sequence was obtained. This term is usually applied to plants
    and fungi.
   Forma_specialis - The physiologically distinct form from which sequence was
    obtained (usually restricted to certain parasitic fungi).
   Genotype - Genotype of the organism.
   Haplotype - Haplotype of the organism.
   Isolate - Identification or description of the specific individual from which this
    sequence was obtained.
   Isolation source - Describes the local geographical source of the organism from
    which the sequence was obtained.
   Lab_host - Laboratory host used to propagate the organism from which the
    sequence was obtained.
   Natural_host - When the sequence submission is from an organism that exists in a
    symbiotic, parasitic, or other special relationship with some second organism, the
    'natural host' modifier can be used to identify the name of the host species.
   Note - Any additional information that you wish to provide about the sequence.
   Pathovar - Variety of a species (usually a fungus, bacteria or virus) characterized by
    the biological target of the pathogen. Examples include Pseudomonas syringae
    pathovar tomato and Pseudomonas syringae pathovar tabaci.
   Pop_variant - name of the population variant from which the sequence was obtained
   Serogroup - Variety of a species (usually a fungus, bacteria, or virus) characterized by
    its antigenic properties. Same as serogroup and serovar.
   Serotype - See Serogroup
   Serovar - See Serogroup
   Sex - Sex of the organism from which the sequence was obtained.
   Strain - Strain of organism from which sequence was obtained.
   Sub_species - Subspecies of organism from which sequence was obtained.
   Subclone - Name of subclone from which sequence was obtained.
   Subtype - Subtype of organism from which sequence was obtained.
   Substrain - Sub-strain of organism from which sequence was obtained.
   Tissue_lib - Tissue library from which the sequence was obtained.
   Tissue_type - Type of tissue from which sequence was obtained.
   Type - Type of organism from which sequence was obtained.
   Variety - Variety of organism from which sequence was obtained.
Appendix III: Using the tar utility to make file archives
By Owen L. Astrachan, Duke University
(http://www.cs.duke.edu/~ola/courses/programming/tar.html)
The program tar (originally for tape archive) is useful for archiving and transmitting files. For
example, you may want to 'tar up' all your work for a course on the acpub and save it to your
own computer's disk drive so you don't run into quota problems. You might also want to
submit (e.g., for cps 108 or cps 100) an entire directory at once rather than the individual
files in the directory. The tar program is useful for these and other tasks and is simple to use.
You can see more information by reading the man page, type man tar The examples below
are not meant to be exhaustive. You can also use the utility gtar instead.

Create, Extract, See Contents
The tar program takes one of three function command line arguments (there are two others I
won't talk about).
     c --- to create a tar file, writing the file starts at the beginning.
     t --- table of contents, see the names of all files or those specified in other command
        line arguments.
     x --- extract (restore) the contents of the tar file.
(the other options are u for update and r for replace, see the man page for details).
Exactly one function argument, c, t, x, is used in conjunction with other command line
arguments shown below. Again, these examples are not meant to be complete, just useful.

Compression, Verbose, File specified
In addition to a function command line argument the arguments below are useful. I usually
use z and f all the time, and v when creating/extracting.

     f --- specifies the filename (which follows the f) used to tar into or to tar out from;
      see the examples below.
    z --- use zip/gzip to compress the tar file or to read from a compressed tar file.
    v --- verbose output, show, e.g., during create or extract, the files being stored into or
      restored from the tar file.
Examples
To tar all .cc and .h files into a tar file named foo.tgz use:
  tar cvzf foo.tgz *.cc *.h
This creates (c) a compressed (z) tar file named foo.tgz (f) and shows the files being stored
into the tar file (v). The .tgz suffix is a convention for gzipped tar files, it's useful to use the
convention since you'll know to use z to restore/extract.
It's often more useful to tar a directory (which tars all files and subdirectories recursively
unless you specify otherwise). The nice part about tarring a directory is that it is untarred as a
directory rather than as individual files.
 tar cvzf foo.tgz cps100
will tar the directory cps100 (and its files/subdirectories) into a tar file named foo.tgz.
To see a tar file's table of contents use:
  tar tzf foo.tgz
To extract the contents of a tar file use:
  tar xvzf foo.tgz

This untars/extracts (x) into the directory from which the command is invoked, and prints
the files being extracted (v).
If you want to untar into a specified directory, change into that directory and then use tar.
For example, to untar into a directory named newdir:
  mkdir newdir
 cd newdir
  tar xvzf ../foo.tgz
You can extract only one (or several) files if you know the name of the file. For example, to
extract the file named anagram.cc from the tarfile foo.tgz: tar xvzf foo.tgz anagram.cc


Other Archiving/Compression Tools
Many PC/Mac programs will be able to restore files that have been archived using tar. For
example, on Macs, the Stuffit Deluxe program can handle Unix tar files. On PCs, the
pkunzip program will handle Unix tar files. This makes it possible to tar files up on [a server]
and then use ftp to bring them to your personal machine where you can store the tar files
and restore when needed. Of course you can run Linux too.
The zip and unzip commands available on some systems are very useful replacements for tar.
Zip/unzip programs are nearly standard on Windows 95/NT machines and zip will archive
entire directory structures with the right options (type zip by itself for help).
Appendix IV. Conducting batch BLAST searches using SEQTools
(Thanks to Silvia for the software suggestion!)
   SEQTools is a versatile software package for sequence manipulation and analysis.
Among the many tasks that SEQTools can accomplish is facilitating the submission of
batches of DNA or protein sequences to the NCBI BLAST web interface. The following
tutorial describes how to create a sequence project and conduct a batch BLAST search using
SEQTools.
   SEQTools can be downloaded for a 60-day trial period (this license can be extended in
60-day increments free of charge for students, or investigators can purchase a long-term
license following the trial period) from the website http://www.seqtools.dk . The software
is currently available only for the Windows operating system.
   Once you have downloaded the program, build a new project. It is easiest if you put all of
your sequences together in a folder. Note that all files must be of the same type (e.g.,
FASTA, chromatograph trace files, etc.) and correspond to the same class of macromolecule
(e.g., nucleotide and protein sequences will not be handled correctly in a single project).
From the File menu, select “Open Sequence Files.” From the subsequent menu, select
“Sequences, All Types.” In the Project References dialogue box, select a name for your
project:
Through the Main dialogue box, select the folder that contains your sequences, then select
the “Add to List” button, followed by the “Load Files” button.




Batch BLAST searches can be run on either a local BLAST database or using the internet to
search GenBank online; we will do the latter. To conduct the search, open the Search
menu, select “Blast Batch Search,” then select “Sequential, NCBI – QBlast.” This will open
the BLAST dialogue box.
In the “Final Blast Program” tab, select BlastN as the Blast program for final search.
Choose the number of top BLAST hits that you would like the program to report (“Number
of description returned from NCBI”), and the number of alignments that you would like the
program to report (if you are only interested in the top hits, select zero for this value). You
can also select a maximum expect value for reporting. It is best to deselect the checkbox for
“Result in HTML format,” as it is easier to automate downstream applications using a text
output rather than HTML.




In the “Final Databases” tab, you may choose the GenBank databases to search. For our
purposes, choose the nr (general nucleotide) database:
The “Advanced Options” tab offers further options for database selection.
   The “Destination” tab contains options for specifying the output of your search. You
could choose to „parse results into sequence headers‟ if you plan to use the BLAST results
further (e.g., using a Perl script to modify the output), or choose to „save results as separate
files without parsing‟ if you wish to simply view the files to check whether the top hits make
sense. If you use the second option, be sure to save the results as text files, not HTML.




A few words about the remaining tabs: the “Range” tab allows you to choose a subset of
the sequences to submit to BLAST, and the “View Search Progress” tab contains a window
that allows you to follow the progress of your batch search.
Appendix V. Source Modifiers for FASTA Definition Lines or tbl2asn Source Tables
(from the GenBank website: http://www.ncbi.nlm.nih.gov/BankIt/examples/eukrrna.html)


   Source modifiers contain information about the biological source of the sequence to be
submitted. These modifiers can either be embedded in the definition line of a FASTA-
formatted sequence by placing them in square brackets, or can be stored in a separate, tab-
delimited table. The proper format for embedding multiple modifiers in the FASTA
definition line is demonstrated in this example:
>MG104 [organism=Mycena pura] [molecule=DNA] [collection-date=Oct-2005]
CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGAGTTGCCCTGTTCATTCATATTATT
TATTACTGACCAGTGAGGATCCACCTAGCAGTATAGATTCGGATGCAGTATAGGCGTATGATTACAACATC
...



   Accepted source modifiers are presented below. Note that some modifier names
have restricted values or formats (Note: the following information is taken directly
from the NCBI website).


         organism should use the unabbreviated scientific name. Example:
          [organism=Drosophila melanogaster]
         molecule should use either "DNA" or "RNA". Example: [molecule=DNA]
         moltype    should    use   one    of     the   following    values.   Example:
          [moltype=genomic]
            genomic
            precursor RNA
            mRNA
            rRNA
            tRNA
            snRNA
            scRNA
            other-genetic
            cRNA
            snoRNA
            transcribed RNA

         location should use one of the following values. Example:
          [location=mitochondrion]
            genomic
            chloroplast
            kinetoplast
            mitochondrion
            plastid
            macronuclear
         extrachromosomal
         plasmid
         cyanelle
         proviral
         virion
         nucleomorph
         apicoplast
         leucoplast
         proplastid
         endogenous-virus
         hydrogenosome

      collection-date should be in the form YYYY or Mmm-YYYY or DD-Mmm-
       YYYY. Example: [collection-date=2005] or [collection-date=Oct-2005] or
       [collection-date=25-Oct-2005]


The following modifiers should use only TRUE or FALSE. Example:
[transgenic=TRUE].
                      environmental-sample
                      germline
                      metagenomic
                      rearranged
                      transgenic
Other accepted modifiers for nucleotide sequences are:

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:87
posted:4/18/2011
language:English
pages:32