Docstoc

NCBI SOAP

Document Sample
NCBI SOAP Powered By Docstoc
					Obtaining Promoter Sequences via Eutils

           - scripting from the ground up using Perl




 NCBI Eutils – perl script construct , January, 2008   1
                                   Background
The promoter sequence is an essential part of any gene.
  The lack of characterization and annotation, however,
  makes it hard to define.

For this reason, it is NOT an integral part of the NCBI gene
  record. So obtaining the sequence requires extra work.
                   Gene boundary as defined in NCBI Entrez Gene

                                                                  stop




                                  Biological Gene boundary

   Question to be addressed : How to get this?

  NCBI Eutils – perl script construct , January, 2008                    2
                    Common Approaches

• Starting from mRNA:
   – identify the gene
   – get the genomic coordinates
   – modify the coordinates to obtain the upstream
     sequence


• Corresponding Eutils call
   – (esearch, to get gi) elink to gene
   – efetch (esummary) to get the coordinates
   – efetch to get the subsequence

  NCBI Eutils – perl script construct , January, 2008   3
      Analysis of the Conservation in the
       Promoters of Homolgous Genes
• Requires a bit more leg work
Entrez                                              Eutils
Identify the gene                                   Elink from mRNA to gene
Identify the homologous gene set                    Elink from gene to homologene


Link back to gene set                               Elink from homologene to gene


Get DocSum with gene coordinate                     Esummary
information
Get subsequences using modified                     Efetch
coordinates



   NCBI Eutils – perl script construct , January, 2008                              4
                         Web Work Through
Search in coreNucleotide with example accession
         U49897 (human PAH gene mRNA)
         M36917 (human GDP1 gene mRNA)
Click the Link Menu and select Gene




Follow the NC link within the Gene to get transcribed region




  Extract upstream subsequence through coordinates adjustment
  See web document for details:
  http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/promoter.html
   NCBI Eutils – perl script construct , January, 2008          5
Entrez Steps for Homologous Genes



Follow the link to HomoloGene




                                                        Link back
                                                        to Gene to
                                                        collect the
                                                        list



  NCBI Eutils – perl script construct , January, 2008                 6
                        Entrez Live Search




NCBI Eutils – perl script construct , January, 2008   7
    Detailed Eutils Work Flow (batch)
The workflow for this under eutils is outlined below.
• Esearch with accession to get gi,
          Note: eutils services operate on uid, or gi for sequences
• Elink with gi from nuccore to Gene to get geneid
• Elink from the gene to homologene to get homologene id
• Elink from the homologene id back to gene
• Esummary to get summary of the list of genes
• Parse the genome accession/coordinates, adjust the coordinates
accordingly
• Construct URL for individual efetch of the subsequences


   NCBI Eutils – perl script construct , January, 2008                8
                               Coding in Perl
We will code this in Perl from ground up using ConText text editor.
        Make sure it shows line numbers and syntax highlight.

To simplify the task, we code each step separately.

Test to make sure it functions correctly by output to screen
        before adding code for the next step.

Save the code to desktop

Execute it under DOS prompt:
       Start  Program AccessoriesCommand Prompt
       perl test8_1.pl



 NCBI Eutils – perl script construct , January, 2008                  9
                     Perl Implementation :
                  esearch in nuccore to get GI
use strict;
use LWP::Simple; # required package

my $base="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/";
my ($esearch, $elink, $esummary) = ("esearch.fcgi?", "elink.fcgi?", "esummary.fcgi?");
my ($query, $param, $term, $web, $key, $result);
$query="U49897";
$param ="&db=nuccore&term=$query&usehistory=y";

# esearch with query accession to get gi, in the form of WebEnv and QueryKey
$result=get ($base.$esearch.$param);


# print out the result and the parsed strings
print "$result\n";

$web=$1 if ($result=~/<WebEnv>(.+)</);
$key=$1 if ($result=~/<QueryKey>(.+)</);

# print the parsed WebEnv and QueryKey strings
print "$web\n$key\n";




Save the about to a text file (such as test8_1.pl), test it using: perl test8_1.pl

     NCBI Eutils – perl script construct , January, 2008                                 10
                 Output from esearch Step

<?xml version="1.0"?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "htt
p://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
<eSearchResult>
        <Count>1</Count>
        <RetMax>1</RetMax>
        <RetStart>0</RetStart>
        <QueryKey>1</QueryKey>
        <WebEnv>0QxDy1AtMneqpLb3HlCGNNuwCKgvJoQ9cbo5Yqkp5uuIZeaNLbtIso2cRm5@1FC2
543279DDE7C0_0018SID</WebEnv>
        <IdList>
                 <Id>2462721</Id>
        </IdList>
        <TranslationSet>
        </TranslationSet>
        <QueryTranslation></QueryTranslation>
</eSearchResult>

0QxDy1AtMneqpLb3HlCGNNuwCKgvJoQ9cbo5Yqkp5uuIZeaNLbtIso2cRm5@1FC2543279DDE7C0_001
8SID

1




     NCBI Eutils – perl script construct , January, 2008                           11
                        Perl Implementation:
                    elink from nuccore to gene
Comment out the previous print code first:
# print "$result\n";
# print "$web\n$key\n";
before adding the following code.

# elink from nuccore to gene using the gi (in the form of WebEnv and query_key)
# to get geneid (in the form of WebEnv and QueryKey)
$param="&db=gene&dbfrom=nuccore&WebEnv=$web&query_key=$key&cmd=neighbor_history";
$result=get($base.$elink.$param);

# print out the result
print "$result\n";

$web=$1 if ($result=~/<WebEnv>(.+)</);
$key=$1 if ($result=~/<QueryKey>(.+)</);

# print out the parsed strings
print "$web\n$key\n";




Save the about to a text file (i.e. test8_2.pl), test it using: perl test8_2.pl

     NCBI Eutils – perl script construct , January, 2008                            12
         Perl Implementation :
     esummary to get gene DocSum

If this is the only gene of interest, we can get the esummary directly by
adding the following code:

$param="&db=gene&WebEnv=$web&query_key=$key";
$result=get ($base.$esummary.$param);

print $result;




For a set of homologous genes, we will need to explore homologene link to
get the calculated homologene group. Then link back to gene to get the set
first.




  NCBI Eutils – perl script construct , January, 2008                        13
              Perl Implementation:
    elink to homologene then back to gene
To get the homologous group, then the individual gene records, we do:
# elink from gene to homologene
$param="&db=homologene&dbfrom=gene&WebEnv=$web&query_key=$key&cmd=neighbor_history";
$result=get($base.$elink.$param);
$web=$1 if ($result=~/<WebEnv>(.+)</);
$key=$1 if ($result=~/<QueryKey>(.+)</);

# elink from homologene back to gene
$param="&db=gene&dbfrom=homologene&WebEnv=$web&query_key=$key&cmd=neighbor_history";
$result=get($base.$elink.$param);
$web=$1 if ($result=~/<WebEnv>(.+)</);
$key=$1 if ($result=~/<QueryKey>(.+)</);

# esummary to get the gene docsums for the list of homologous genes
$param="&db=gene&WebEnv=$web&query_key=$key";
$result=get($base.$esummary.$param);


The next task:
        Parse the genome accession and coordinates
        modify the coordinates according to orientation
        construct efetch URL for the subsequence retrieval


     NCBI Eutils – perl script construct , January, 2008                               14
     Interim Output of Gene DocSum
<eSummaryResult>
<DocSum>
           <Id>5053</Id>
           <Item Name="Name" Type="String">PAH</Item>
           <Item Name="Description" Type="String">phenylalanine hydroxylase</Item>
           <Item Name="Orgname" Type="String">Homo sapiens</Item>
           <Item Name="Status" Type="Integer">0</Item>
           <Item Name="CurrentID" Type="Integer">0</Item>
           <Item Name="Chromosome" Type="String">12</Item>
           <Item Name="GeneticSource" Type="String">genomic</Item>
           <Item Name="MapLocation" Type="String">12q22-q24.2</Item>
           <Item Name="OtherAliases" Type="String">PKU, PKU1</Item>
           <Item Name="OtherDesignations" Type="String">amino acid hydroxylase</Item>
           <Item Name="NomenclatureSymbol" Type="String">PAH</Item>
           <Item Name="NomenclatureName" Type="String">phenylalanine hydroxylase</Item>
           <Item Name="NomenclatureStatus" Type="String">Official</Item>
           <Item Name="TaxID" Type="Integer">9606</Item>
           <Item Name="Mim" Type="List">
                      <Item Name="int" Type="Integer">261600</Item>
           </Item>
           <Item Name="GenomicInfo" Type="List">
                      <Item Name="GenomicInfoType" Type="Structure">
                                  <Item Name="ChrLoc" Type="String">12</Item>
                                  <Item Name="ChrAccVer" Type="String">NC_000012.10</Item>
                                  <Item Name="ChrStart" Type="Integer">101835510</Item>
                                  <Item Name="ChrStop" Type="Integer">101756233</Item>
                      </Item>
           </Item>
           <Item Name="GeneWeight" Type="Integer">96054</Item>
</DocSum>
</eSummaryResult>

   NCBI Eutils – perl script construct , January, 2008                                       15
       Parsing SeqID and Coordinates
# rountine to parse the chromosome accession, gene start and stop coordinates
my ($tmp, $accn, $start, $stop, $strand, $seq, $id);
while ($result =~ /<DocSum>(.+?)<\/DocSum>/gs){
# two very important letters that allows global matching and match span the new lines
      $tmp = $1;
      $id = $1 if ($tmp =~ /<Id>(\d+)</);
      $accn = $1 if ($tmp =~ /ChrAccVer.+>(.+)</);
      $start = $1 if ($tmp =~ /ChrStart.+>(\d+)</);
      $stop = $1 if ($tmp =~ /ChrStop.+>(\d+)</);
      if ($start==0 || $stop == 0){
         # code to catch the gene record without assembled chromosome
         print STDERR "Skipping $id due to ambiguous coordinates ...\n";
         next;
      }
      if ($start >$stop){        # adjusting coordinates for reverse ori
         $strand=2;
         $stop=$start+1500;
         $start-=1;
      } else {                   # adjusting coordinates for forward ori
         $strand=1;
         $stop=$start-1;
         $start-=1500;
      }
      $param = "&db=nuccore&retmode=text&rettype=fasta&id=$accn”;
      $param .= "&seq_start=$start&seq_stop=$stop&strand=$strand";
      $seq = get ($base.$efetch.$param);
      print $seq;
      $seq =""; $start=$stop=$strand=$id=0;
}


    NCBI Eutils – perl script construct , January, 2008                                 16
  “We have the sequences, then what?”

Common analyses we can do for those sequences:

Multiple sequence alignment
          http://bioinfo.genopole-toulouse.prd.fr/multalin/multalin.html
          http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py

Transcription factor binding sites analysis
          http://www.cbil.upenn.edu/cgi-bin/tess/tess?RQ=SEA-FR-Query
          http://motif.genome.jp
          http://www.dcode.org

Adapt the code for your favorite gene of interest to see what you can get.

Tidy the code up, add functions to make it do more, etc …




  NCBI Eutils – perl script construct , January, 2008                        17

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:24
posted:11/6/2011
language:English
pages:17