092708 Shah Bioinformatics Handout by 8KY1ZJPu


									                                   The Magic Fit
1. Introduction to NCBI’s resources and educational tools.
2. Use NCBI Entrez to search for a gene sequence or a protein sequence.
3. Use NCBI Blast to search for homologous sequences.
4. Use PDB to download protein coordinates.
5. Use Swiss-PDB viewer to analyze protein structures.
6. Use Swiss-PDB viewer to overlay related structures.

California Grade Seven Science Content Standards

2d. Students know plant and animal cells contain many thousands of different genes.
2e. Students know DNA is the genetic material of living organisms.

3a. Students know genetic variation [is a] cause of evolution and diversity of organisms.

Investigation and Experimentation
7a. Select and use appropriate tools and technology (computers).
7b. Use a variety of print and electronic resources (including the World Wide Web) to
    collect information and evidence.
7c. Communicate the logical connection among science concepts, data collected and
    conclusions drawn.
7d. Construct appropriately labeled diagrams to communicate scientific knowledge.
7e. Communicate the steps and results from an investigation in written reports and oral

California Grades 9 to 12 Biology and Life Sciences Content Standards

Cell Biology
1d. Students know the central dogma of molecular biology.
1h. Students know most macromolecules are synthesized from precursors.
4. Genes are a set of instructions encoded in the DNA sequence of each organism that
    specify the sequence of amino acids in proteins.

8f. Students know how to use DNA or protein sequence comparisons to show probable
    evolutionary relationships.

Kandarp Shah (UCI GK-12)                  Page 1                                6/27/2012
Note: Make a folder on your desktop to save your search results and files during this

Search for a MAP kinase gene sequence, FUS3 from Saccharomyces cerevisiae,
using Entrez.

1. Go to http://www.ncbi.nlm.nih.gov.

2. Click on All Databases link at the top.

3. Enter Fus3 in the search box and click on Go.

You will see a number on the left side of each database indicating the number of hits for
the search. We will use the Nucleotide database to search for the FUS3 gene sequence.

4. Click on Nucleotide.

You will see a selection of 20 hits/results listed on the page from a total of 78 items. More
results are listed on adjacent pages.

                                 You will also see a list of organisms related to the hits on
                                 the top right side of the page.

Kandarp Shah (UCI GK-12)                     Page 2                                6/27/2012
At the top of the page before the hits, there is a list of three genes related to our search
from NCBI Gene. Our gene for FUS3 from Saccharomyces cerevisiae is the second one
on the list. However, let’s try to refine our results to reduce the number of hits.

5. Click on Limits tab. Select Title under Fields and Genomic DNA/RNA under
molecule. Click on Go. The number of hits should go down to 7.

6. Click on History tab. You will see a list of two searches we have done so far under
Most Recent Queries.

7. In the search box, clear any text and enter the following text (do not hit Go at this
point): “Saccharomyces cerevisiae”[Organism]

Saccharomyces cerevisiae is the name of our organism (baker’s yeast). The quotations
tell NCBI to look for the words together as one unit. The word organism in brackets tells
NCBI to limit the query to only that organism.

Kandarp Shah (UCI GK-12)                   Page 3                                  6/27/2012
8. Left click the number for the latest search result (Search Fus3 Field: Title
Limits: Genomic DNA/RNA) at the top of the list in Most Recent Queries.

9. In the menu that appears, click on AND. You will see AND (#2) added to the search
box after your text.

Selecting AND tells NCBI to combine the search results for the new text query in the
search box and the query you select from the history and only return hits that meet both

10. Click on Go. You should see only a single result.

We have now refined our query to give us exactly what we wanted. The result is the
coding sequence for FUS3 without upstream/downstream regions or introns.

11. Click on M31132 (accession number) for the result to browse for further information.

Try the options in the Display and Show drop down menus to see the possibilities.

Further down on the page, you will see information such as gi number, some links, the
locus for the gene, number of base pairs, organism, authors and publication reference
including link to PubMed, related protein sequence and finally, the coding sequence
(CDS) for the gene.

Kandarp Shah (UCI GK-12)                   Page 4                                 6/27/2012
An important step in the analysis of genome information is deciphering the complete
coding potential or protein coding sequence (CDS) region of each gene. CDS is a
sequence of nucleotides that corresponds with the sequence of amino acids in a protein. A
typical CDS starts with ATG and ends with a stop codon. CDS can be a subset of an open
reading frame (ORF) [1].

12. Click back on your browser to return to the results page. Copy the sequence
identification number in the first line (gi|171532). We will use this for the BLAST search.

Note: You can also get the GI number from the gene information page but the format is
not correct to use in a BLAST search.

Performing a BLAST search
BLAST = Basic Local Alignment Search Tool

1. Go to NCBI homepage. Click on BLAST at the top.

2. Under Basic BLAST, select nucleotide blast.

3. In the box labeled Enter Query Sequence/ Enter accession number, gi, or FASTA
sequence, paste the GI number you copied for the gene or enter it manually (See figure

4. Enter a name for the job. You can keep the default name or assign your own.

5. Choose Human Genomic + Transcript for the database.

We want to find related genes/transcripts in the Human Genome/Transcriptome.

Kandarp Shah (UCI GK-12)                  Page 5                                6/27/2012
6. Under Program Selection, select More dissimilar sequences (discontiguous

Food for thought: Selecting Highly similar sequences for program selection will give
you zero results for the search. Message on screen reads “No significant similarity
found”. Why? (Answer is at the end of this guided tour on the last page).

7. Click on BLAST.

8. You will see a page similar to below during the search.

9. And finally… results with sequence alignments. MAPK1 is the gene we are interested

Kandarp Shah (UCI GK-12)                  Page 6                              6/27/2012
Search for Fus3 protein sequence using Entrez and BLAST human proteome for
similar proteins.

1. Use the steps above to search for the Fus3 protein sequence.

2. BLAST it (GI|536007) against human protein database (protein blast) to search for
similar proteins. Enter Homo Sapiens under Organism to restrict the search to human
proteins. You should find MAPK1 (also known as Erk2) as the top hit.

Take a look at the sequence alignment for Fus3 and Erk2 on the results page.From the
BLAST results and sequence alignment we know that 50% sequence identity and 68%
sequence similarity. Let’s download the protein coordinates for Erk2 and Fus3 from PDB
and look at the structures.

Downloading PDB files for Fus3 and Erk2

You can do this by searching for Fus3 under “Structure” on Entrez or directly from PDB
website. We will use the RCSB Protein Data Bank (PDB) database.
RCSB = Research Collaboratory for Structural Bioinformatics (RCSB); more information
at http://home.rcsb.org/.

1. Got to http://www.pdb.org.

2. Enter Fus3 in the search box and click on Site Search.

You will see 7 structure hits for your search. We will look at Crystal structure of non-
phosphorylated Fus3 at the bottom of the page (PDB ID: 2b9f).

Kandarp Shah (UCI GK-12)                  Page 7                                  6/27/2012
Each structure in the PDB is represented by a 4 character identifier of the form [0-9][a-
z,0-9][a-z,0-9][a-z,0-9]. For example, 4HHB, 9INS are identification codes for PDB
entries for hemoglobin and insulin. Many of PDB WWW pages, including the PDB home
page, allow you to enter a PDB ID and retrieve information for the corresponding
structure. Historically, 30% of queries to the PDB sites are of this type [2].

3. Click on 2b9f.

Browse the page to look at the available information such as title, author information,
date the structure information was deposited in the database, experimental method,
molecule, source and related structures.

                                  4. Click on Display Files in the left panel and then
                                  click on PDB File to open the file.

                                  Take a look at the contents of a typical PDB file. The
                                  left column identifies the type of information in the
                                  right column. PDB files contain a Header, Title,
                                  Compound information (protein name, source,
                                  experimental data gathering technique), author,
                                  journal, remarsk, etc. The last part is the list of 3-D
                                  coordinates for each atom in the protein and related
                                  heteroatoms such as from water.

5. Save the file in your folder on your desktop. You can save the file from this page by
using Save As or you can go back and use Download Files feature on the page for 2b9f.
Note the location of the file on your desktop.

Download the PDB file for Erk2 using the steps above. You will find only one structure
that is not in a complex: Structure of Signal-Regulated Kinase (1erk) from Rat. This
structure is fine for our purpose.

Kandarp Shah (UCI GK-12)                  Page 8                                 6/27/2012
Looking at the 3-D structures of Fus3 and Erk2

Download and extract Swiss-PDB Viewer DeepView into your folder on the desktop
from http://spdbv.vital-it.ch/. The download link is in the left panel. Once extracted, the
viewer is ready for use.

1. Double-click on spdbv application to open the viewer.

2. Select the File menu and Open PDB File to open the pdb file for Fus3 (2b9f.pdb) from
your folder on the desktop.

Take a look at the 3-D structure of Fus3. Select Display menu and Render in 3D and
Render in Solid 3D. Try different color options under the Color menu (suggestion: Color
 act on ribbon AND Color  Secondary Structure). Try the features in the Control
Panel (under Wind menu): Compare left clicks vs. right clicks under different columns.
Left-click selects individual amino acid residues, Right-click selects all. Right click under
the columns labeled show, side and label to see what happens. I find it easier to look at
the backbone structure without the sidechains.

3. Select the File menu and Open PDB File to open the pdb file for Erk2 (1erk.pdb) from
your folder on the desktop.

Now that you have two structures open,
control panel needs to know which
structure you ate working with at the
moment. Check or uncheck the box for
Visible to see or hide a structure at the top
of the control panel.

                                                              Left-click the name 1erk to
                                                              select the proper structure.

Kandarp Shah (UCI GK-12)                   Page 9                                 6/27/2012
4. Remove sidechains/labels from the view. Render in 3D and Solid 3D.

5. Select Fit menu  Magic Fit  CA only  Layers 2b9f and 1erk  OK.

Watch the structures align in space. Center the structures on screen using         button
under the file menu. Compare the two structures for similarity.

Answer for BLAST search:

Highly similar sequence feature does not work for our blast search because yeast has
relatively less number of introns compared to the human genome. The introns in the
sequence makes the sequences dissimilar. Selecting this feature makes the search highly
stringent. On the other hand selecting more dissimilar feature allows for discontinuity in
the sequences (discontiguous sequences).


1. Furuno M et al. CDS annotation in full-length cDNA sequence. Genome Res. June
2003. [PMID: 12819146]

2. http://www.rcsb.org/robohelp_f/#site_navigation/introduction_to_site_navigation.htm
Help link from PDB website.

Kandarp Shah (UCI GK-12)                 Page 10                                 6/27/2012

To top