Subway Worksheets

Document Sample
Subway Worksheets Powered By Docstoc
					                                          Red Line Walkthrough
A. Identifying repetitive DNA
Example Sequence:           Arabidopsis thaliana (mouse-ear cress) Synthetic Contig, 16.4 kb
Tool(s):                    RepeatMasker
Concept(s):                 Non-coding DNA, sequence repeats, mobile genetic elements (transposons)

 Simple repeats: 1-5bp repeats
                                    I. Create Project
 (e.g. repetitive dinucleotides
 ‘AT’ etc.)                                 1. Log-in to DNA Subway. (dnasubway.iplantcollaborative.org)

 Low Complexity DNA:Poly-                   2. Click ‘Annotate a genomic sequence.’ (Red Square)
 purine/ poly-pyrimidine
 stretches, or regions of
 extremely high AT or
                                            3. Select samples sequence: Arabidopsis thaliana (mouse-ear cress)
 GC content.                                Synthetic Contig.

 Processed Pseudogenes,                     4. Provide your project with a title, then Click ‘Continue.’
 SINES, Retrotranscripts: Non-
 functional RNAs present within
 genomic sequence                   II. Identify and Mask Repeats

 Transposons (DNA, Retroviral,              1. Click ‘RepeatMasker’
 LINES): Genetic elements                             - Wait until the flashing icon displays ‘V.’ (view)
 which have the ability to be
 amplified and redistributed
 within a genome.                           2. Click ‘RepeatMasker’ again to view the results.

                                               Questions:

Q.1:       How many hits were detected in your sample?                                                      __________


Q.2:       RepeatMasker reports the length of the repetitive sequences
           (Length) as well as the class (Attributes):

           a. What is the average length of sequences identified as “simple repeats”?                       __________

           b. What is the average length of sequences identified as “low complexity”?                       __________


Q.3:       What is the total percentage of repetitive DNA in your sequence?
           (Sum of the length of all repetitive sequence / sequence length (16.47KB)                        __________


Additional Investigation: In the results table under ‘Attributes’ each repeat sequence is labeled “RepeatMasker#-XXX” The ‘#’
is the ordinal number of the hit, the XXX is the class of DNA element (e.g. “Simple_repeat” or “Low_complexity”). There are
other types of repetitive elements such as transposons and pseudogenes (e.g. Helitron and COPIA) Use online resources to
learn more: (http://gydb.org/index.php/Main_Page).




25
B. Making Gene Predictions
Example Sequence:               Arabidopsis thaliana (mouse-ear cress) Synthetic Contig, 16.4 kb from part A
Tool(s):                        Augustus, FGenesH, Snap, tRNA Scan
Concept(s):                     Genomic DNA, Gene Structure, Canonical sequences
 Gene Predictor: A program              III. Predict Genes
 that makes use of multiple
 sensors to model entire                        1. Click ‘Augustus’ and wait until a green ‘V’ icon appears.
 gene(s)
                                                2. Click ‘Augustus’ again to view a table of results. Use the
 Sensor: An algorithm that
 works to predict specific                      results determined by ‘Augustus’ to answer question 4.
 features within a sequence
 (e.g. a canonical splice site or               3. Repeat (one-at-a-time) steps 1-3 with ‘FGenesH’, ‘Snap.’ Also run ‘tRNA
 an exon).                                      Scan’ to answer question 6.
 Ab Initio Prediction: Gene
 prediction based solely on a                                               Questions:
 Genomic DNA sequence.                  Q.4:    Look at the ‘Type’ column in the gene prediction report.
                                                Find the first mention of the term ‘gene’ and copy down the gene’s
 Hidden Markov Model: An
 algorithm that represents (in                  ‘start’ (i.e. the starting basepair). Note the number of times you see the
 this case) sequence data and                   term ‘exon’ (i.e. number of exons predicted).
 signals (nucleotide patterns) as
 states. The probabilities of                           Gene Predictor           Start               exons
 transitions between states                          Augustus gene 1              746                  1
 (even if some states are
                                                     Augustus gene 2
 unknown) can be used to
 model a gene and its                                Augustus gene 3
 components. (See:                                   FGenesH gene 1
 doi:10.1038/nbt1004-1315)                           FGenesH gene 2
 CDS: The protein-coding exons
                                                     FGenesH gene 3
 of a gene sequence.                                 Snap     gene 1
                                                     Snap     gene 2
                                                     Snap     gene 3

Q.5: Based on the chart in question 4, did all the gene predictors yield genes starting at the same location? Did
all the gene predictions have the same number of exons?
_____________________________________________________________________________________________
_____________________________________________________________________________________________

Q.6: Looking at the number of results returned by tRNA Scan, why are they so different from results made by
other predictors? Are their places in the genome where tRNAs are more or less densely concentrated?
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________

Additional Investigation: Look for the background link at the bottom of the DNA Subway home page and review the section
entitled ‘Gene Finding,’




26
C. Viewing Gene Predictions in a Browser
Example Sequence:            Arabidopsis thaliana (mouse-ear cress) Synthetic Contig, 16.4 kb from part B
Tool(s):                     Local Browser (GBrowse)
Concept(s):                  Gene orientation/structure, transposons, chromosome organization

 Gene Browser: A GUI                 IV. View Gene Predictions
 (Graphical User Interface) for
 viewing biological information.             1. Click ‘Local Browser’ and allow browser to load.
 GBrowse (DNA Subway’s
 Browser) is “designed to view               2. Under ‘Scroll/Zoom’ select ‘Show 25kbp.’ When the browser
 genomes. It displays a graphical
 representation of a section of a            reloads, it should show ‘Show 16.47 kbp.’ Answer question 6.
 genome, and shows the
 positions of genes and other                3. Under ‘Reports & Analysis,’ ‘Download Decorated FASTA File’ should be
 functional elements. It can be              selected. Click ‘Configure.’
 configured to show both
 qualitative data such as the
 splicing structure of a gene,
 and quantitative data such as
 microarray expression levels.”-             4. On the line ‘Augustus Predicted Genes’ Click the radio button to select
 http://gmod.org/wiki/GBrowse
                                             ‘BKG’ as ‘red.’ On the line ‘FGenesH Predicted Genes’ Click ‘Underline.’
 _FAQ

 Track: The individual regions of            5. Click ‘Go’ to view the results and answer question 7.
 the display where information
 imported into the browser. For
 each type (or source) of
 information, there is usually an
 associated track.



                                                        Questions:

Q.6: What observations can you make about the locations of transposons and repetitive DNA in relation to
predicted genes?
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________

Q.7: Red highlighted sequence demarcates Augustus predictions, underlined sequences are predictions made
by FGenesH. What do the gene predictions have in common, and is there any pattern to how the FGenesH
predictions begin and end?
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________

Additional Investigation: Play with the Decorated FASTA file to see sequence differences between different gene predictors.
Copy and paste sequence into the NCBI ‘BLAST’ server to get information on predicted genes. (http://blast.ncbi.nlm.nih.gov/)




27
D. Adding Experimental (Biological) Evidence
Example Sequence:            Arabidopsis thaliana (mouse-ear cress) Synthetic Contig, 16.4 kb from part C
Tool(s):                     BLASTN, BLASTX, Upload Data
Concept(s):                  RNA, cDNAs, ESTs, Biological Databases

 BLAST:                              V. Search databases for Biological Evidence
 Basic Local Alignment Search
 Tool (BLAST) is an algorithm                1. Click ‘BLASTN’
 that search databases of                              - Wait until the flashing icon displays ‘V’ (view)
 biological sequence
 information (e.g. DNA, RNA, or
 Protein sequence) and return                2. Click ‘BLASTN’ again to view the results.
 matches. The BLASTN program
 is specific to nucleotide data,             3. Click ‘BLASTX’
 and the BLASTX algorithm                              - Wait until the flashing icon displays ‘V’ (view)
 works with sequence data
 translated into amino acid
 sequences.                                  4. Click ‘BLASTX’ again to view the results.

 UniGene: A database of
 transcript data, “each UniGene                                           Questions:
 entry is a set of transcript
 sequences that appear to come
 from the same transcription         Q.8:   Both BLASTN and BLASTX returns the ‘Length’ of your resulting matches.
 locus (gene or expressed            Do you notice differences in the average lengths of BLASTN and BLASTX
 pseudogene), together with          matches? Explain.
 information on protein
                                     ___________________________________________________________________
 similarities, gene expression,
 cDNA clone reagents, and            ___________________________________________________________________
 genomic location.” -                ___________________________________________________________________
 http://www.ncbi.nlm.nih.gov/u       ___________________________________________________________________
 nigene                              ___________________________________________________________________
 cDNA: DNA produced by
 reverse transcribing mRNA           Q.8: Under ‘Type’ both BLASTN and BLASTX returns ‘match’ and ‘match_part.’
 using reverse transcriptase.        ‘Match’ is describing the overall length of a single match (globally), but
 cDNAs are used to investigate       individual significant matches may be fragmented, i.e. ‘match_part.’ Do BLASTN
 mRNA within a biological
                                     and BLASTX return ‘match’ and ‘match_part’ results in different frequencies?
 sample.
                                     Explain.
 ESTs: “Small pieces of DNA          ___________________________________________________________________
 sequence (usually 200 to 500        ___________________________________________________________________
 nucleotides long) that are          ___________________________________________________________________
 generated by sequencing either
 one or both ends of an
                                     ___________________________________________________________________
 expressed gene. The idea is to      ___________________________________________________________________
 sequence bits of DNA that
 represent genes expressed in
 certain cells, tissues, or organs
 from different organisms.” -        Additional Investigation: Under Attributes in the BLASTN and BLASTX results there is a
 http://www.ncbi.nlm.nih.gov/A
 bout/primer/est.html                section called ‘description.’ Use an internet search engine and/or other resources to learn
                                     about the functional features of significant hits.




28
               Apollo Annotation Tips for Protein Coding Genes
                      This example assumes you have run the following routines:
     RepeatMasker, Augustus, FGenesH, Snap, BlastN, BlastX, Users BlastN (with A.thaliana EST data)

Prepare your workspace

1. Choose one strand to work on.

        (View > Show forward strand or Show reverse stand – check/uncheck your selection)

        - Apollo displays data on both strands; most users will want to work on one strand at a time.

2. Display all models and evidence by expanding tiers.

        (Tiers > Expand all tiers)

        - Apollo may represent multiple data in a single evidence track; expanding the tiers to see all data will
        make it easier to manipulate the data during your annotation process.




 Apollo 2 strand view          Apollo 1 strand view         Tiers collapsed            Tiers expanded

3. Hide unnecessary data.

        (Tiers > Show types panel > Show (uncheck your selection, e.g. BLASTX, BLASTX_USER, BLASTN,
         BLASTN_USER)

        - Protein (BLASTX/ BLASTX_USER) and EST (BLASTN_USER) data can usually be worked on in later steps;
        un-show them in the tiers menu. You will add these data back to your analysis when you are ready to
        consider them.

After preparing your workspace, there are 5 steps to creating a basic manually curated annotation within
Apollo:

A. Create a Gene Model
B. Determine transcript length
C. Determine splice sites and variants.
D. Determine start/stop sites




29
A. Create a Gene Model

Possible decision components:

Biological
        UniGene Model (BLASTN)
        Why? – UniGene models are derived from cDNA and ESTs (transcriptome evidence) produced by
         experiment. (http://www.ncbi.nlm.nih.gov/UniGene/help.cgi?item=build2)

Hypothetical
       Gene model of choice
       Why? - A gene model generated by any of the prediction algorithms is based on known biological
       constraints, and is a priori hypothesis based only on the genomic sequence.

1. Select a gene model as a scaffold

         -Use transcriptome evidence (UniGene -BLASTN) to select the best possible gene model for a scaffold. If
         no gene model exists or significantly reflects the UniGene model, use the UniGene model itself as a
         scaffold (See examples 1, 2).

2. Drag the gene model of choice into the workspace and label the new scaffold. Name the model using the
   ‘Annotation info editor.’

         (Right Click/      Click on the model > Annotation info editor)




                             Ex.1: Models and evidence: Top: Augustus; Middle: FGenesH; Bottom: (BLASTN)

Ex.1: The Augustus and Unigene models are very close. FGenesH (which does not predict UTRs) could be used as a scaffold if you were not
concerned about modeling the UTR. SNAP (not shown) did not predict a gene at this locus. In this case, Augustus is probably the best choice
for a scaffold.




                                                                 nd               rd
                   Ex.2: Models and evidence: Top: Augustus; 2 Track: SNAP, 3 Track: FGenesH; Bottom: (BLASTN)

Ex.2: The Augustus and Unigene models are very close, however SNAP predicts2 genes at this locus. Barring additional evidence, Augustus
may be the best gene model to start a scaffold.




30
                                                 Ex.3: Named August model in workspace


B. Determine transcript length

Possible decision components:

Biological
        UniGene Model (BLASTN)
        Why? – Full length cDNAs (which are components of the Unigene model) give experimentally
         determined boundaries for the transcript.

Hypothetical
       Gene model from part A

1.       Drag the BLASTN model into the workspace, and then name it using the ‘Annotation info editor.’

         (Right Click/      Click on the model > Annotation info editor)




                                    Ex.4: cDNA evidence and Augustus based model in the workspace

Ex.4: The cDNA supports a transcript that is shorter than the Augustus based model at both the 5’ and 3’ ends of the transcript.


2.       Use the ‘Exon detail editor’ to adjust the lengths of the model transcript.

         (Right Click/      Click on the model > Exon detail editor)




31
C. Determine splice sites and variants

Possible decision components:

Biological
        EST data (BLASTN_USER)
        Why? – Like full length cDNAs, ESTs give valuable information on transcript diversity. ESTs are
        generated by high throughput methods, and although the data may be fragmentary, it may
        capture biologically relevant information about splice variants.

         UniProt Protein data (BLASTX/BLASTX_USER)
         Why? – Proteins do not contain UTR, but do contain the initiating amino acid (methionine). Their
         lengths may give clues to the actual length of the translated protein.

Hypothetical
       Gene model from part B


1. Use the tiers menu to show all available data.

         (Tiers > Show types panel > Show (check your selection, e.g. BLASTX, BLASTX_USER, BLASTN,
          BLASTN_USER)

         - Depending on the database you upload (in the example case, Arabidopsis ESTs) you will have to consider
         how to interpret the possible splice variants. BLASTX_USER returns hits from UniProt and may contain hits
         from genomes other than the one you are annotating. Gene duplications may also give hits which
         annotate to other loci.




                                     Ex.5: AT5G13220 (JAZ10) In Apollo (left) and Phytozome (right)

Ex.5: At a locus where there is alternative splicing, gene models may disagree, and the biological evidence may also seem in conflict.
According to Phytozome (right) there is the primary annotated transcript (highlighted green) and three alterative transcripts displayed
below it. In Apollo, the UniGene model (BLASTN – bottom track) seems to suggest the first alternative transcript (At5G13320.2) but other
EST evidence (BLASTN_USER) suggests other transcripts. Depending on the amount of evidence for alternative transcripts at a locus, you
may have to create several models.




32
2. Based on available evidence, drag any additional hypothetical gene models (and/or transcriptome models –
BLASTN, BLASTX, BLASTN_USER, BLASTX_USER) into the workspace. Rename each model using the ‘Annotation
info editor.’

        (Right Click/   Click on the model > Annotation info editor)




                                Ex.6: Gene models with ‘non-canonical’ splice sites highlighted


3. Use the ‘Exon detail editor to “fix” non-canonical splice sites (highlighted with yellow arrows).

        (Right Click/   Click on the model > Exon detail editor)

3a. (Optional) In some cases you may want to change exonic structure in your model. You can do this by either
splitting an exon (Select the exon of choice, (Right Click/ Click on the model > Split exon) or merging two exons
(Select the exons of choice, (Right Click/ Click on the models > Merge exons). Lengths of introns and exons can
always be changed in the exon editor.




                                  Ex.7: BLASTX Model with non-canonical splice sites fixed.


4. Calculate the longest ORF (open reading frame) in your model.

        (Right Click/   Click on the model > Calculate longest ORF)




                                       Ex.8: BLASTX Model with longest ORF displayed




33
D. Determine start/stop sites

Possible decision components:

Biological
        UniProt Protein data (BLASTX/BLASTX_USER)
        Why? – Proteins do not contain UTRs, but do contain the initiating amino acid (methionine).
        lengths of the protein hits give clues to the actual length of the translated protein at that locus
        and its reading frame.

Hypothetical
       Gene model from part C


1. Use the protein data to establish probable start and stop sites. Drag the start and stop icons into your model
from those displayed above the workspace.




                 Ex.9: Start and Stop codons are highlighted in red and green in the uppermost part of the screen.



2. Ensure that your model is complete after all changes by calculating the longest ORF (open reading frame) in
your final model(s).

        (Right Click/   Click on the model > Calculate longest ORF)


                  Once you have finished your model, upload the results back to DNA Subway.

        (File > Upload to DNA Subway)



34
                                        Apollo Visual Glossary

                                                                                    Tiers of gene evidence
                                                                                    (predicted models and
                                                                                        Biological data)

                                                                                  Model building workspace



                                                                                            Position



                                                                                     Zoom, Pan, Horizontal
                                                                                            Scroll

                                                                                     Additional info (active
                                                                                    when model is selected)




  Possible                    Exon
                                                         Intron (bent line)
Start (green)
 Stop (red)
    sites
                                                                                        Alignment gap (straight line)
                Designated start site
                                                                  UTR
                                                                   Exon



                                                                          Designated stop site




                                          Non-cannonical splice




      35
                  Some Useful Apollo Menus (right-click/ click)




                         Exon Detail Editor - adjusts exon boundaries (right-click/   click)




Tiers Menu – color coding of teirs, and                      Sequence menu – extract sequence at selected locus
       (show/hide) – Tiers menu                                           (right-click/ click)




               Annotation Info Editor – name gene models, and add comments (right-click/       click)




36
Summary of mouse functions
Mouse buttons perform many functions in Apollo. The table below summarizes the functions performed
by the three mouse buttons, with and without holding down the shift key.

If you are on a Mac and have a single-button mouse, you can simulate a right mouse click by holding
down the control or alt key while clicking the mouse, and you can simulate a middle mouse click by
holding down the apple key while clicking the mouse. If you are trying to copy text from Apollo (for
example, sequence residues from a Sequence window) to paste into another application, use ctrl-c to
copy the text and apple-v to paste it (or, if you are trying to paste into a Web browser, you can use the
'Paste' command from the browser's Edit menu).

Please note that if you are running an old version of Mac OS X (10.2.2 or earlier), you may find strange
mouse-button behavior. With a three-button mouse, you may find that the behavior of the right and
middle buttons is switched. On a Windows laptop you may find that the middle mouse button will pop
up a little scrollbar. To simulate middle mouse you might have to use the Alt key with the left mouse
button.

  Mouse key                                                 Action
Left              Select feature (or deselect if you're not over any features)
Shift-Left        Add feature to current selection (or remove feature if it's already selected)
                  If you drag a feature into the annotation tier, it will be added as a new transcript (if
Left drag
                  editing is enabled)
                  If you shift-drag a feature onto an annotation, it will be added as a new exon (if editing
Shift-Left drag
                  is enabled)
Middle click      Center display on clicked location
Middle drag       Rubberband multiple features
Shift-Middle      Rubberband multiple features, adding to current selection (or remove if they are
drag              already selected)
Right             Popup menu
Shift-Right drag Tier drag--move currently selected tier




37
                           DNA Subway Annotation “Cheat Sheet”



     1. Establish a project or open an existing project.
     2. Run RepeatMasker.
     3. Run Gene Predictors (e.g. Augustus, FGenesH, etc.).
     4. View results in Local Browser; compare and contrast predictions.
     5. Run BLAST searches.
     6. View results in Local Browser; compare and contrast predictions and BLAST results.
     7. Add additional biological evidence in form of cDNA, ESTs, genes, proteins (optional):
            a. To download ESTs for a sample sequence, open the “Annotation” directory at
               http://gfx.dnalc.org/files/evidence/, right-click the appropriate file and save it to your
               computer. (Do not open the file, but download or save it to your computer.)
            b. After saving the file to your computer go back to your project in DNA Subway.
            c. Click ‘Upload Data,’ browse to the file, and upload it to DNA Subway.
            d. Click ‘User BLASTN (or User BLASTX for protein)’ to search the uploaded data.
     8. Synthesize predictions and BLAST search results into gene models using Apollo:
            a. General navigation tools (scrolling, zooming) can be accessed on the main Apollo screen
            b. General tools to handle files and data can be accessed in the tab menu at the top of the
               Apollo screen.
            c. Editing tools can be access by right-clicking (command-click on Macs) an item on the
               workspace.
            d. A utility to record the chances and change the name of a model is included in the editing
               tools as “Annotation info editor.”
            e. A tool to lengthen or shorten exons is included in the editing tools as “Exon detail editor.”
            f. Apollo indicates items that require specific attention by using triangles; right- or left-
               pointing green or red triangles point at potentially missing start or stop codons. Yellow
               triangles indicate a non-canonical splice sites.




38
                                 Advanced Genome Annotation
                                             Annotating a DNA Contig

Experiment 1: Predict Genes in an Arabidopsis Contig

      I.   Create a Project
      1.   Enter DNA Subway at http://www.dnasubway.org.
      2.   Click the red square to annotate a genomic sequence.
      3.   Select sample sequence Arabidopsis thaliana (mouse-ear cress) Chr5, 100.00 kb.
      4.   Provide a title (required), a project description (optional) and click ‘Continue’.

     II. Mask Repeats
      1. Click ‘RepeatMasker.’
      2. Once the bullet has finished blinking, click ‘RepeatMasker’ again to view a listing of repetitive
         DNA sequences ‘RepeatMasker’ has identified and masked.

           You may wish to note how many and which types of repetitive DNA ‘RepeatMasker’ identified.
           Under the “Attributes” menu you may find unfamiliar terms for defined repeats such as the
           transposons Copia and Harbinger. You can use a search engine to get additional information.

      3.   Close the table to return to DNA Subway.
      4.   Click ‘Local Browser’ to view the results in a graphical interface.
      5.   Maximize the browser window.
      6.   Change Show 10 kbp to Show 100 kbp in the Scroll/Zoom utility.


      7. Close the Local Browser screen to return to DNA Subway.

     III. Predict Genes
       1. Click ‘Augustus.’
       2. Once ‘Augustus’ has finished click ‘FGenesH’. Then, click ‘SNAP’. Finally, click ‘tRNA Scan’ (Note:
          the Augustus, FGenesH and SNAP algorithms predict protein-coding genes; tRNA Scan identifies
          tRNA genes).

           These gene prediction programs all use different – though sometimes similar – methods to
           generate predictions. Did you notice any difference in runtime?

      3. View the results for each predictor by clicking the predictor button again to generate a table. You
         should also examine the results in the Local Browser.

           Do the different programs predict the same genes or can you identify differences among the
           predictions?

      4. Close the table and browser screens to return to DNA Subway.


39
  IV. Search Databases for Transcriptome Evidence
   1. Click the ‘BLASTN’ buttons to search a database of known genes and transcripts (e.g. cDNAs and
      ESTs) for a match to the contig sequence.
   2. When BLASTN is complete, then Click ‘BLASTX’ to search for matches in the contig sequence in a
      database of experimentally verified protein sequences.
   3. Go to: http://gfx.dnalc.org/files/evidence/Annotation and download the file “at chr5 est
      evidence.fasta.”
   4. Click ‘Upload Data’, and upload the above file under “Add DNA data in FASTA format.”
   5. Click ‘User BLASTN.’



     6. View BLAST matches in the table view by clicking the respective BLAST buttons again. You may
        also choose to view the results in the Local Browser.
     7. Close the table and browser screens to return to DNA Subway.


Experiment 2: Synthesize Gene Predictions and Transcriptome Evidence into Gene Models

Technique 1: Edit Exons

     I. Build a Gene Model
     1. Click ‘Apollo.’




        Your screen should look something like this, with multiple evidence types (gene predictions,
        repeats, transcriptome evidence, etc.) displayed by color coded icons.




40
     2. Click the Tiers menu and select Expand Tiers to view all available evidence. Apollo initially
        collapses the different evidence types onto a single line each, regardless of how many pieces of
        evidence are available for each position.
     3. Under the View menu, uncheck “Show reverse strand.”
     4. Zoom, pan and scroll to nucleotide position 29,500-33,500 until you can comfortably view details
        for a gene on the forward strand in this location.




     5. You should now be able to distinguish gene features such as exons and introns.

        Compare the predictions with each other and with the BLAST evidence – what similarities and
        differences can you identify? The Augustus gene prediction has the same structure as the other
        predictions and the BLASTN evidence, however, it is longer than the other predictions and
        therefore stronger agrees with the BLASTN evidence than the other predictions.

     6. Double-click the Augustus prediction and move it onto the workspace – this is the foundation for
        a model for the gene in this location.
     7. Right-Click on your gene model to name it using the “Annotation Info Editor”
     8. Scroll down the page and identify the BLASTN evidence. Double-click and move the longest piece
        of BLASTN evidence onto the workspace.




     9. Name the BLAST based model by Right-Clicking on the model and using the “Annotation Info
        Editor”




        Comparing the Augustus prediction and the BLASTN evidence, you will find that they share the
        same exon-intron structure, but differ in the overall lengths: the gene model starts and ends
        further down-stream than the BLASTN evidence.


41
     1. Use Exon Detail Editor to adjust the lengths of the flanking exons of the model:
           a. Double-click the gene model.
           b. Right-click the gene model;
           c. Select Exon detail editor in the pop-up window to open the Exon Editor;
           d. the Exon Editor displays the sequences of the gene model and the BLASTN evidence side-
               by-side; a red frame highlights the gene model;
           e. Grab and hold the edge at the beginning of the model’s first exon and move it to the left
               to position it flush with the start of the BLASTN-match;




              f. Click the end of the gene model depicted in the schematic view at the bottom of the
                 Editor window to edit that part of the sequence;
              g. Grab and hold the edge at the end of the last exon and move it 89 nucleotides to the left
                 and up to position it flush with the end of the BLASTN-match;
              h. Close the Exon Editor.

     2. To conclude your annotation for this gene’s structure:
            a. Right-click the BLASTN evidence on your workspace;
            b. Select Delete selection;
            c. Delete any other evidence or prediction from the workspace until only your gene final
               model remains;
            d. Click menu tab File and select Upload to DNA Subway.

     II.   Browse Your Gene Model
      1.   Minimize or close Apollo.
      2.   Bring up the DNA Subway window.
      3.   Click ‘Local Browser’ to browse your gene model.



42
Technique 2: Fix Start Codons

     1. Navigate to nucleotide position 14,000.




        Identify the differences among the predictions and the BLAST evidence.
        Specifically, what start and end points for the gene do the different prediction and evidence
        items indicate?

     2. Move the Augustus gene prediction and the BLASTN evidence for this gene onto the workspace;
        adjust the 5’- and 3’ ends of the model. Name your respective models using the “Annotation Info
        Editor.”
        Examine the model’s beginning: Does it have a start codon? Zoom in the first third of the first
        exon (position 14060 through 14200) to answer this question.




     3. To define a start codon for your model:
           a. Zoom into the first exon;
           b. Evaluate whether the biological evidence (BLASTX) provides evidence for a start codon;
               if the biological evidence does not provide a position for a start codon choose the first
               ATG/methione instead;
           c. Move your cursor to the upper edge of your screen;
           d. Grab and hold the first green rectangle located within the first exon;


43
            e. Move the green rectangle all the way down onto your model to insert it as a new start
                codon.
     4. To finalize your annotation:
            a. Zoom out and verify your model;
            b. Delete from the workspace any evidence or predictions other than your final model for
                this gene ;
            c. Upload your result to DNA Subway.

Technique 3: Delete Exons

     1. Navigate to nucleotide position 47,500.




        Identify the differences among the predictions and the BLAST evidence.
        What is the number of exons for the different predictions and evidence items?

     2. Move the Augustus gene prediction and the BLASTN evidence onto the workspace. Name your
        respective models using the “Annotation Info Editor.”

     3. Compare the Augustus-derived gene model and the BLASTN evidence. You will find that the
        model’s leading exon is not supported by BLAST evidence. To remove it:
            a. Click the first exon in the gene model.
            b. Right-click the model;
            c. Click Delete selection.
     4. Adjust the 5’- and 3’ ends of the model by using Exon Detail Editor to match it to the BLASTN
        evidence.

     5. To finalize your annotation:
            a. Zoom out and verify your model;
            b. Delete from the workspace any evidence or predictions other than your final model for
                this gene ;
            c. Upload your result to DNA Subway.



44
Technique 4: Split Exons

     1. Navigate to nucleotide position 18,500-21,000.




        Identify the differences among the predictions and the BLAST evidence.
        Specifically, what is the number of exons for the different predictions and evidence items?

     2. Move the Augustus gene prediction and the BLASTN evidence for this gene onto the workspace;
        adjust the 5’- and 3’ ends of the model.
     3. Compare the gene model and the BLASTN evidence. You will find that the gene model shows one
        long leading exon where the BLASTN evidence has two. To split this exon:
            a. Zoom into the first exon in the gene model;
            b. Click the first exon in the gene model.
            c. Right-click in the first exon approximately at the position where you wish to split it;
            d. Select Split exon to split the first exon into two fragments;
            e. Double-click the gene model.
            f. Right-click the gene model;
            g. Select Exon detail editor in the pop-up window to open the Exon Editor;
                the Exon Editor displays the sequences of the gene model and the BLASTN evidence side-
                by-side; a red frame highlights the gene model;
            h. Maximize the Exon Editor window;
            i. Find the gap in the highlighted sequence at the spot at which the background color in the
                former first exon changes – this is the position where the exon has been split;
            j. Grab the 3’-edge of the first exon fragment and move it to the left and up to position it
                flush with the end of the first BLASTN exon;
            k. Grab the 5’-edge of the downstream fragment and move it to the right and down to
                position it flush with the beginning of the second BLASTN exon;
            l. Close the Exon Editor.




45
     4. You will find that by splitting the first exon into two you generated a non-canonical splice site. To
        adjust the splice site:
            a. Double-click the gene model.
            b. Right-click the gene model;
            c. Select Exon detail editor in the pop-up window to open the Exon Editor;
            d. Adjust the beginning of the gene model’s second (new) exon to start following (in 3’-
                direction) the nearest AG;
            e. Close the Exon Editor.
     5. To finalize your annotation:
            a. Zoom out and verify your model;
            b. Delete from the workspace any evidence or predictions other than your final model for
                this gene ;
            c. Upload your result to DNA Subway.

Techniques 5 & 6: Merge Exons and Build Alternative Transcripts

     1. Navigate to nucleotide position 89500-92,500.




     2. Move the Augustus gene prediction and the longest BLASTN transcript evidence that resembles
        the model (5 exons, Exon #4 about 60 nucleotides) onto the workspace; adjust the 5’- and 3’
        ends of the model.
     3. Record your edits and name the model.
     4. Delete the BLASTN evidence from the workspace.




46
     5. Compare the gene model with the various biological evidence items. You will find that some
         BLASTN evidence shows Exon #4 to be about 110 nucleotides long as opposed to 58 nucleotides
         in the first model.
     6. To build an alternative transcript for this gene:
             a. Double-click the first model;
             b. Right-click) the first model;
             c. Select Duplicate transcript to generate the foundation for an alternative transcript.
     7. Move the BLASTN evidence that contains five exons with an Exon #4 of about 110 nucleotides in
         length onto the workspace.
     8. Extend the 3’-end of Exon #4 in the alternative model to the 3’-edge of the BLASTN evidence
         using Exon Detail Editor.
     9. To update the open reading frame/coding sequence:
             a. Double-click the new model;
             b. Right-click the model;
             c. Select Calculate longest ORF.
     10. Delete the BLASTN evidence from the workspace.
     11. Record your changes and name the alternative gene model.
     12. Compare the biological evidence with the two gene models. You will find that some BLASTN
         evidence shows a large fourth exon that encompasses Exon #4 and Exon #5 in the current two
         models.
     13. To build a third alternative transcript:
             a. Right-click the first model and select “duplicate”;
             b. Shift-click the fourth and fifth exons in the third model;
             c. Right-click one of the exons;
             d. Select Merge exons.
     14. Update the third model’s open reading frame/coding sequence.
     15. Record your changes and name the new alternative gene model.
     16. Compare the biological evidence with the two gene models. You will find that some BLASTX
         PROTEIN evidence shows a large second exon that encompasses Exon #2 and Exon #3 in the
         previous two models. However, the problem with using this information to build a fourth
         alternative transcript is that no biological evidence is available that would allow you to determine
         what other exons would be part of this fourth transcript – therefore you should not build a
         fourth alternative model without further evidence.
     6. To finalize your annotation:
             a. Zoom out and verify your model;
             b. Delete from the workspace any evidence or predictions other than your final model for
                  this gene ;
             c. Upload your result to DNA Subway.




47
Experiment 3: Identify Gene Homologs in Yellow Line

      I. Search Genomes for Homologs to Annotated Genes
      1. Enter DNA Subway and open a project.
      2. Click Transfers, then click Continue to open the Local Browser.
      3. Click a gene model, prediction, evidence item or repeat.
      4. Select Detail View and Transfers
      5. Select a sequence to transfer and click the Genome Prospecting button to transfer the selected
         sequence to the Yellow Line.
     II. Example
      1. Transfer the different gene models built to the Yellow Line.
      2. Transfer the genes, ORFs and proteins and compare the number of matches for each.


Experiment 4: Compare Your Results to Existing Records

Note: This function may only be available for projects based DNA Subway sample sequences.

      I.    Compare Your Annotations to Those in the Phytozome Genome Hub
      1.    Enter DNA Subway and open a project.
      2.    Click Export.
      3.    Select the Phytozome Genome Hub and click Continue.
      4.    Find the chromosome and location in the field labeled Landmark or Region.
      5.    To compare your results to those presented by Phytozome, zoom into the region that contains
            your annotation and compare it with the gene model present in Phytozome’s Transcript track.
     II.    Examine Gene Functions Listed in Phytozome
      1.    Roll over a gene model in Phytozome’s Transcript track to check whether Phytozome lists a
            potential function for the gene.
     III.   Examine Gene Homologs Listed in Phytozome
       1.   To identify in what plants homologs to the gene have been found:
                a. scroll down to the Tracks section of the Phytozome window;
                b. check the boxes next to BLASTX and BLATX Plant Peptides;
                c. click Update Image;
                d. roll over entries in the Peptide BLASTX and BLATX tracks to identify what plant species
                    encode homologous genes and/or proteins.
                e. Close the Phytozome window.




48
                             Biological and Gene Annotation Concepts
RepeatMasker
    A genome is an organism’s entire complement of DNA.
    DNA is a directional molecule composed of two anti-parallel strands.
    The genetic code is read in a 5’ to 3’ direction, referring to the 5’ and 3’ carbons of deoxyribose.
    Eukaryotic genomes contain large amounts of repetitive DNA, including simple repeats and transposons.
    Transposons can be located in intergenic regions (between genes) or in introns (within genes).
    Genes and transposons are directional, and can be encoded on either DNA strand.
    Repeats are non-directional, and, in effect, do occur on both strands.
    Transposons can mutate like any other DNA sequence.

Gene Predictors
    Protein-coding information in DNA and RNA begins with a start codon, is followed by codons, and ends with a
       stop codon.
    Codons in mRNA (5’-AUG-3’, etc.) have sequence equivalents in DNA (5’-ATG-3’, etc.).
    The DNA strand that is equivalent to mRNA is called the “coding strand.” The complementary strand is called
       the “template strand,” because it serves as the template for synthesizing mRNA.
    Non-spliced genes, which are characteristic of prokaryotes, are also found in eukaryotes.
    Even in a spliced gene, the protein-coding information may be organized as Open Reading Frame (ORF).
    Most eukaryotic genes are spliced, whereby intervening segments (introns) are removed and the remaining
       segments (exons) are spliced together.
    Splice sites (exon-intron boundaries) have sequence patterns that are recognized by the splicing apparatus
       (spliceosome).
    Gene prediction programs use consensus sequences around splice sites to predict exon-intron boundaries.
    Over 90% of eukaryotic introns have “canonical splice sites,” whereby introns begin with GT (mRNA: GU) and
       end in AG (mRNA: AG).
    The protein coding sequence of a eukaryotic mRNA (or gene) is flanked by 5’- and 3’-untranslated regions
       (UTRs); introns can be located in UTRs.
    In most eukaryotic genes, transcripts are alternatively spliced, yielding different mRNAs and proteins.
    UTRs hold information for the half-lives of mRNAs and for regulatory purposes.
    Gene > mRNA > CDS.
    CDS = nucleotides that encode amino acid sequence.
    In mRNA: CDS = ORF.

BLAST Searches
    Basic Local Alignment Search Tool (BLAST) searches databases for matches to a query DNA or protein
        sequence.
    Gene or protein homologs share sequence similarities due to descent from a common ancestor.
    Biological evidence is needed to edit and confirm gene models predicted by computer algorithms.
    Biological evidence is most often derived from mRNA transcripts (ESTs, cDNAs, RNAseq). Protein sequence
        data are available, too, but much less common.
    Many ESTs and cDNAs are disrupted by “introns” when they are aligned against genomic DNA.
    ESTs & cDNAs may be incomplete.
    The BLAST algorithm does not resolve intron/exon boundaries.
    The BLAST algorithm is not restricted to detecting sequences that fully match a query (“global” matches) but,
        instead, matches query subsequences as well (“local” matches).
    The BLAST algorithm matches sequences to the fullest extent possible and, often, realigns the same sequence
        twice.




49
                    Web Resources for Genome Annotation
A. Major Plant Genome Hubs:
      DOE JGI’s http://www.phyotozme.net
      University of Iowa: http://www.plantgdb.org/
      CSHL: http://www.gramene.org/
      ENSEMBL: http://plants.ensembl.org/index.html
      NCBI: http://www.ncbi.nlm.nih.gov/genomes/PLANTS/PlantList.html
      NCBI: http://www.ncbi.nlm.nih.gov/mapview/

B. Some Plant Genome Portals:
      Arabidopsis, TAIR: http://www.arabidopsis.org/
      Corn: http://www.maizesequence.org/index.html
      Grape: http://www.cns.fr/externe/GenomeBrowser/Vitis/
      Poplar: http://genome.jgi-psf.org/poplar/poplar.home.html
      Rice: http://rice.plantbiology.msu.edu/
      Tomato: http://solgenomics.net/about/tomato_sequencing.pl

C. Browsers:
      Ensembl: http://www.ensembl.org
      GBrowse: http://gmod.org/wiki/GBrowse
      JBRowse: http://jbrowse.org/
      UCSC Browser: http://genome.ucsc.edu
      xGDB: http://brendelgroup.org/bioinformatics2go/bioinformatics2go.php

D. Annotation Tools:
      Apollo: http://apollo.berkeleybop.org/current/index.html
      Artemis: http://www.sanger.ac.uk/resources/software/artemis/
      yrGATE: http://brendelgroup.org/bioinformatics2go/bioinformatics2go.php

E. Other Resources:
      Course download site: http://gfx.dnalc.org/files/evidence
      DynamicGene: http://www.sanger.ac.uk/resources/software/artemis/
      GeneBoy: http://www.dnai.org/geneboy/
      BioServers: http://www.bioservers.org/bioserver/
      mRNA/gDNA: http://www.ncbi.nlm.nih.gov/spidey/
      mRNA/gDNA: http://pbil.univ-lyon1.fr/sim4.php
      Splice site predictor: http://www.fruitfly.org/seq_tools/splice.html
      Promoter predictor: http://www.fruitfly.org/seq_tools/promoter.html



50
                                           Yellow Line Walkthrough
A. Examining Transposons
Example Sequences:           mPing Mite Element, Ping Transposase Gene, Ping Transposase Protein
Tool(s):                     Yellow Line TARGeT
Concept(s):                  Mobile genetic elements (transposons), Non-autonomous

 TARGeT: TARGeT (Tree Analysis
                                      I. Create Project
 of Related Genes and
 Transposons) uses either a DNA                 1. Log-in to DNA Subway (dnasubway.iplantcollaborative.org)
 or amino acid ‘seed’ query to:
 (i) automatically identify and                 2. Click ‘Prospect Genomes using TARGeT’ (Yellow Square)
 retrieve gene family homologs
 from a genomic database, (ii)
 characterize gene structure
                                                3. Select sample: mPing Mite Element (Oryza sativa/ Rice)
 and (iii) perform phylogenetic
 analysis. Due to its high speed,               4. Provide your project with a title, then Click ‘Continue’
 TARGeT is also able to
 characterize very large gene         II. Search the O.sativa genome using TARGeT
 families, including transposable
 elements (TEs) (-from the
 abstract of the TARGeT paper@                  1. Click ‘Oryza sativa japonica’ in the ‘Select Genomes’ stop
 doi: 10.1093/nar/gkp295)

                                                2. Click ‘Run’ again to search the genome.
 Transposons (DNA, Retroviral,
 LINES): Genetic elements
 which have the ability to be         III. Identify the number of mPing elements in the O.sativa genome
 amplified and redistributed
 within a genome.                               1. Click ‘Alignment Viewer’ to see results returned.
 Non-autonomous transposons:
 Transposons which lack an                           Genome name                 Hit# Project #
 active transposase gene, thus
 requiring help from another
 transposon to move.

 Autonomous transposons:                                        Key to results naming in alignment viewer
 Transposons which have a
 functional tranposase and can        *Double clicking the hit name opens the sequence and location in new browser tab.
 move within the genome.
                                                2. Record the number of hits in the table below.

IV. Identify the number of Ping transposons (using DNA sequence and protein)
Repeat the steps above (Sections I-III) using Ping transposase gene and Ping Transposase protein to answer
collect the following data and answer the following questions.

                                    mPing mite element             Ping Transposon (DNA)            Ping Transposon (Protein)
Number of hits in O.sativa                 52
Hit number 1 – locus                     Chr: 6
Hit number 2 – locus
Hit number 3 – locus
Hit number 4 – locus
Hit number 5 – locus




51
                            Advanced Yellow Line Example
Prospecting example: Finding and analyzing DNA transposons (Ping - DNA transposon in rice)
Background Reading: http://www.nature.com/nature/journal/v421/n6919/full/nature01214.html

Example:

 1.   Open DNA Subway and start a new project in the yellow line selecting the mPing Mite Element
      from the sample sequences.
 2.   Enter a project title and click ‘Continue.’
 3.   In the ‘Search Genomes’ stop select Oryza sativa japonica and click ‘Run.’
      a. Click ‘Alignment Viewer’ to view the results of your search. This will open up two screens, one
         displaying a tree and another displaying sequence alignments. How many matches did the
         search yield? What is the relationship between the match and the query?
      b. Close all viewers and return to DNA Subway.
 4.   Create a new project, this time querying rice with the Ping transposase Gene [ORF] as query.
      a. How many matches did this search yield? (Again, use the alignment screen to count.)
      b. To view details about a match, double-click its ID (left-most column in Alignment Viewer;
         enable pop-ups in your browser). This screen also has a link to open Phytozome at the location
         of the match.
      c. Using the tree, determine the relationships among the hits. As the query sequence originates in
         the rice genome you can identify the match that’s identical to the query sequence.
      d. Close all viewers and return to DNA Subway.
 5.   Create a new project, this time querying rice with the Ping Transposase Protein.
      a. How many matches did this search yield? Explain the differences in the number of results for
         the three queries.
      b. In the alignment screen, find the row for the query (ID=Ping), click its ID field once (left-most
         column), then bring the tree screen to the foreground and find Ping among the matches
         displayed.
      c. All matches constitute sequences that are contained in the genome of the rice plant that was
         sequenced to determine the sequence of the entire rice genome. What do the lengths of tree
         branches indicate?
      d. Transposable elements that diverged from a common ancestor more recently will differ from
         each other less than they would differ from those that diverged in the more distant past. How
         many groups of transposons contain matches that seem to have diverged from each other
         more recently? What would you be looking for in order to answer this question?
 6.   Repeat the different kinds of searches and analyses in other genomes. To date only rice, maize,
      and Arabidopsis have been exhaustively studied for TEs. Prospecting other genomes will reveal
      new information about these organisms.




52
                                              Blue Line Walkthrough
A. Examining DNA Sequence
Example Sequences:            rbcL sample 1
Tool(s):                      Sequence Viewer
Concept(s):                   DNA Barcoding, Sanger DNA Sequencing
 DNA Barcoding: The process of       I. Create Project
 species identification by
 examination of DNA Sequence.                  1. Log-in to DNA Subway (dnasubway.iplantcollaborative.org)
 rbcL: A gene coding the large
                                               2. Click ‘Determine Sequence Relationships.’ (Blue Square)
 subunit of the enzyme RuBisCo,
 and one of the important loci
 for species identification of                 3. Select project type ‘Barcoding: rbcL.’
 plants.
                                               4. Select sample sequence ‘rbcL sample 1.’
 Sanger DNA Sequencing: A
 method of DNA sequencing
 that uses fluorescently labeled               5. Provide your project with a title, then Click ‘Continue.’ Alternatively, if
 didexoynucleotide terminators                  you have sequenced your DNA using your Genewiz account, Select
 to generate the sequence of a                  ‘Import trace files from DNALC.’ – Then select sequences to import.
 DNA sample.

 Quality (Phred) Score:              II. View Sequence
 Nucleotide calls read from
 sequencing output files are                   6. Click ‘Sequence Viewer’ to show a list of your sequences.
 assigned a quality score of 10,
 20, 30, 40, or 50. A score of 50              7. Click on a sequence name to show the sequences’ trace file.
 means that the base is called
 with a 99.999% accuracy. A
 score less than 20 is the cut-off
 for high quality sequence.
                                                           Questions:
Q.1:    What do you notice about the electropherogram peaks and quality scores at nucleotide positions
labeled “N”?
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________

Q.2:   Where do the ‘N’s’ in the sequence tend to be distributed, and Why?
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________


Additional Investigation: Learn more about Sanger Sequence at: http://www.dnalc.org/view/15479-Sanger-method-of-DNA-
sequencing-3D-animation-with-narration.html




53
B. Assembling and Editing DNA Sequence
Example Sequences:         rbcL sample 1 from Part A
Tool(s):                   Sequence Trimmer, Pair Builder, Consensus Builder
Concept(s):                Sanger DNA Sequencing, bidirectional reads
 Bidirectional sequence:          I. Trim 5’/3’ ends
 DNA sequence generated by
 sequencing a DNA strand in the           1. Click ‘Sequence Trimmer.’
 forward and reverse
 orientation.
                                          2. Click ‘Sequence Trimmer’ again to examine to changes made in the
 Consensus sequence: A                    sequence
 sequence that sums the
 consensus of two or more DNA     II. Pair Builder
 sequences.
                                          1. Click ‘Pair Builder.’

                                          2. Select the check boxes next to the sequences that represent
                                          bidirectional reads of the same sequence set. Alternatively Select the
                                          ‘Auto Pair’ function and verify the pairs generated.

                                          3. As necessary, Reverse Compliment sequences that were sequenced in
                                          the reverse orientation by clicking the ‘F’ next to the sequence name. The
                                          ‘F’ will become an ‘R’ to indicate the sequence has been reverse
                                          complimented.

                                          4. Save the created pairs.

                                  III. Consensus Builder

                                          1. Click ‘Consensus Builder’

                                          2. Click ‘Consensus Builder’ again to examine the created consensus files.
                                          Any differences between two reads will be highlighted in yellow in the
                                          consensus builder.

                                          3. Make needed edits, and Save your changes.

                                                       Questions:
Q.3:    Sequence identified by DNA subway as low quality is marked by a symbol. What problems might it
cause to generate consensus sequence from low-quality DNA sequence?
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________



54
C. Matching sequence to databases
Example Sequences:           rbcL sample 1 from Part B
Tool(s):                     BLAST, Upload Data, Reference Data
Concept(s):                  BLAST Searches, GenBank, BOLD Database
 BLAST: Basic Local Alignment       I. Check for matches in GenBank
 Search Tool (BLAST) is an
 algorithm that search                       1. Click ‘BLASTN.’
 databases of biological
 sequence information (e.g.
                                             2. Click the ‘BLAST’ link for the sequence of interest.
 DNA, RNA, or Protein
 sequence) and return matches.
 The BLASTN program is specific              3. Examine the BLAST matches for candidate identification. Clicking the
 to nucleotide data.                         species name given in the BLAST hit will also give additional
 GenBank: The largest database
                                             information/photos of the listed species.
 of publicly available nucleotide
 sequences. As of 2011 the                   4. If desired, select the check box next to any hit, and select ‘Add BLAST
 database contains well over                 hits to project’ to add selected sequences to your project.
 100 billion nucleotides of
 generated sequence data.
                                    II. Upload Data (optional)
 BOLD: Barcode of Life Online
 Database (BOLD) is an online                1. If desired, Click ‘Upload Data’ to import additional data into your
 repository for sequence data                project. You will need to repeat steps in the ‘Assemble Sequences’ stop
 generated by DNA barcoding
 projects worldwide.                         on DNA Subway.

                                    III. Reference Data (optional)

                                             1. Click ‘Reference Data.’

                                             2. Select one or more groups of sequences from selected reference
                                             samples of rbcL sequence.

                                             3. Select ‘Add ref data’ to add the data to your project.

                                                         Questions:
Q.4:    BLAST will return the closest matches present in GenBank. Will you be able to identify an unknown
species using BLAST alone? Why or Why not?
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________

Additional Investigation: See the laboratory: “Using Barcoding to identify and Classify Living Things.”
(http://www.urbanbarcodeproject.org/files/Barcoding_Protocol.pdf)




55
C. Building Phylogenetic Trees
Example Sequences:          rbcL sample 1 from Part C
Tool(s):                    Select Data, MUSCLE, PHYLIP NJ, PHYLIP ML
Concept(s):                 Sequence alignment, phylogenetics
 Multiple Alignment: A (usually)   I. Select Data for Alignment
 computer generated alignment
 sequences. Under the                      1. Click ‘Select Data.’
 assumption that all sequences
 within the alignment are
 similar (e.g. of a common
                                           2. Select any and all sequences you wish to add to your tree.
 genetic origin, from a common
 locus, in the same strand                 3. Click ‘Save.”
 orientation) gaps are
 introduced where
                                   II. Generate Multiple Sequence Alignment
 misalignments (e.g. insertions
 or deletion/ missing data)
 appear.                                   1. Click ‘MUSCLE.’
                            \
 Phylogenetic tree: A diagram              2. Click ‘MUSCLE’ again to open the sequence alignment window.
 which represents inferred
 evolutionary relationships
 between organisms. As applied             3. At the start and end of alignment for all sequence, left-click the
 here, sequences are displayed                mouse over the position number over the alignment to open a menu,
 with branch lengths that are                 and trim both ends of the alignment.
 proportional to the differences
 between the sequences.
                                           4. Click ‘Submit Trimmed Alignment.’
 PHYLIP NJ and PHYLIP ML: Tree
 building algorithms based on      III. Construct Phylogenetic Tree
 the “Neighbor Joining” and
 “Maximum likelihood methods
 respectively. See:
                                           1. Click either ‘PHYLIP NJ’ or ‘PHYLIP ML’ to run the tree construction
 http://www.icp.be/~opperd/pr              algorithm.
 ivate/neighbor.html and
 http://www.icp.ucl.ac.be/~opp             2. Click the button for the algorithm you chose above again to launch a
 erd/private/max_likeli.html
                                           viewer for the multiple alignment and tree.

                                                        Questions:
Q.5:    What relationship do you see between sequences that have more mutations (align less well with
majority of sequences) in the alignment and the lengths of a sequences’ branch on the tree?
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________

Q.6:    Do you see differences in the phylogenetic tree generated by the Neighbor-joining vs. Maximum
likelihood method?
_____________________________________________________________________________________________
_____________________________________________________________________________________________
_____________________________________________________________________________________________



56

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:177
posted:10/3/2012
language:Unknown
pages:32