Docstoc

assembly - DOC

Document Sample
assembly - DOC Powered By Docstoc
					Bsml Assembly Sequence Encoding
Overview

Genes and associated elements including transcript, cds, and and exon encodings are
mapped to a genomic assembly as bsml features. Each feature's attributes specify it's
location on the assembly sequence in genomic coordinates. Clusters of features
representing genes are organized as sets of feature groups, where each feature group
associates elements describing a single transcriptional unit. See appendix A for a
reference eukaryotic assembly encoding.

Document Structure

Figure 1 summarizes the general structure of a bsml assembly encoding, highlighting
the usage of feature group sets to cluster elements by transcriptional unit and the
association of these feature group sets to encode a gene.

Figure 1

<Bsml>
  <Definitions>
    <Sequences>
         <Sequence id="Genomic Assembly">
            <Feature-tables>
                  <Feature-table>
                            <Feature class="GENE" id="Gene_001" />
                             <Feature class="TRANSCRIPT" id="Gene_001_Transcript_001" />
                            <Feature class="CDS" id="Gene_001_CDS_001" />
                            <Feature class="EXON" id="Gene_001_Exon_001" />
                            <Feature class="EXON" id="Gene_001_Exon_002" />
                            <Feature class="TRANSCRIPT" id="Gene_001_Transcript_002" />
                            <Feature class="CDS" id="Gene_001_CDS_002" />
                            <Feature class="EXON" id="Gene_001_Exon_003" />
                            <Feature class="EXON" id="Gene_001_Exon_004" />
                  </Feature-table>
                  <Feature-group group-set="Gene_001">
                            <Link rel="GENE" href="#Gene_001">
                            <Feature-group-member feature-type="TRANSCRIPT"
featref="Gene_001_Transcript_001" />
                            <Feature-group-member feature-type="CDS" featref="Gene_001_CDS_001" />
                            <Feature-group-member feature-type="EXON" featref="Gene_001_Exon_001" />
                            <Feature-group-member feature-type="EXON" featref="Gene_001_Exon_002" />
                  </Feature-group>
                  <Feature-group group-set="Gene_001">
                            <Link rel="GENE" href="#Gene_001">
                            <Feature-group-member feature-type="TRANSCRIPT"
featref="Gene_001_Transcript_002" />
                            <Feature-group-member feature-type="CDS" featref="Gene_001_CDS_002" />
                            <Feature-group-member feature-type="EXON" featref="Gene_001_Exon_003" />
                            <Feature-group-member feature-type="EXON" featref="Gene_001_Exon_004" />
                  </Feature-group>
            </Feature-tables>
         </Sequence>
    </Sequences>
  </Defintions>
</Bsml>
Assembly Sequence Element

Genomic assembly sequences are represented as bsml sequence elements. Genes
and associated elements are mapped to the sequence through usage of bsml feature
groups as described above. The sequence element provides access to the assembly's
nucleotide sequence. This sequence may be encoded "in-line" through the usage of a
Seq-data element or externally by a Seq-data-import element. External encodings need
to specify the format of the referenced file through the "format" attribute. Current
software and API support only "fasta" encodings. The "id" attribute specifies the external
fasta id of the sequence in multi-fasta files and the "source" attribute provides a full
system file path.

Figure 2 - In-line nucleotide sequence representation

<Sequence id="Genomic Assembly">
       <Seq-data>AGCTACG...</Seq-data>
</Sequence>

Figure 3 - External nucleotide sequence representation
<Sequence id="Genomic Assembly">
       <Seq-data-import           format="fasta" id="chado_tryp_lma2_98_assembly"
       source="/usr/local/scratch/BSML_REPOSITORY/chado_tryp.fsa">
       </Seq-data-import>
</Sequence>

The Bsml Object Layer API uses data referenced in these elements to extract a gene's
genomic sequence, transcript sequence, CDS sequence, and individual exons.

To provide consistency with amino acid sequence representations, a Bsml Attribute
element is provided referencing the assembly id of the genomic sequence.

Figure 4 - Bsml Attribute Element : Assembly

<Sequence id="Genomic Assembly">
       <Attribute name="ASSEMBLY" content="lma2_98_assembly"/>
</Sequence>

Feature Elements

Feature elements identify gene, transcript, cds, and exon locations on an assembly
sequence. Each feature type is identified using the "class" attribute which has the
following controlled vocabulary.

{ GENE, CDS, TRANSCRIPT, EXON }

Feature "id" attributes uniquely identify feature elements and are currently assigned the
identifiers found in the TIGR databases. Note XML idrefs cannot be started with a
numeric digit.
Features are assigned location elements of type Interval-loc or Site-loc. Interval-loc
elements have "startpos," "endpos,", and "complement" attributes. "Startpos" and
"endpos" correspond to nucleotide positions relative to the genomic assembly
sequence. "Complement" is defined as 0 if the referenced sequence is on the same
strand as the reference sequence and 1 if it corresponds to the reverse complement.
Site-loc elements define a single position identifier denoted as the "pos" attribute as well
as a "class" attribute of either "START" or "STOP." The "complement" attribute is set in
the same manner as an Interval-loc.

A gene feature contains an Assembly attribute specifying the assembly id of the
sequence it references. This is a temporary fix serving to facilitate API lookup table
queries and will be removed before the system goes into production. The referenced
assembly should be implied by the bsml hierarchy.

Each CDS feature contains a bsml Link element specifying the bsml id of the amino acid
sequence the CDS encodes. The sequences are defined in the <Sequences> section of
the bsml document and provide inline sequence data.

Figure 5 - Amino Acid Sequence definition referenced by a CDS feature

<Sequence length="61" molecule="aa" id="protein_22125">
       <Attribute name="ASSEMBLY" content="lma2_98_assembly"></Attribute>
       <Seq-data>MTRCTEATVREPAAA...</Seq-data>
</Sequence>

<Feature class="CDS" id="cds_22134">
             <Site-loc complement="0" class="START" sitepos="6616"></Site-loc>
             <Site-loc complement="0" class="STOP" sitepos="7017"></Site-loc>
             <Link href="#protein_22125" rel="SEQ"></Link>
</Feature>

Feature Groups

Feature groups associate feature elements into transcriptional units. That is, they
associate a TRANSCRIPT feature, CDS feature, and set of EXONS as a group.
Prokaryotic data, for which CDS=TRANSCRIPT=GENE, contain equivalent CDS and
TRANSCRIPT features as well as a single exon spanning this interval. This conforms
with the unified gene model used in the Chado systems. Feature groups contain a bsml
Link element referencing the gene to which the transcript belongs as well as a
"group-set" attribute which associates feature groups with each other. The current
scheme sets the "group-set" attribute to the feature id of the gene element, making the
use of a "gene link" redundant.

Feature-group-member elements contain the attributes "feature-type" and "featref."
"Feature-type" attributes define the type of feature the feature-group-member
references and may be set to TRANSCRIPT, CDS, or EXON. "Featref" sets the bsml
identifier associated with the referenced feature element.

Sets of feature groups, linked by having equivalent "group-set" attributes, define the
transcript structure of a gene.

Figure 6 - Feature group encoding

   <Feature-group group-set="lma2_98.t00003_feature_id_gene:22128">
             <Feature-group-member feature-type="TRANSCRIPT"
featref="transcript_22127"></Feature-group-member>
             <Feature-group-member feature-type="EXON" featref="exon_22131"></Feature-group-member>
             <Feature-group-member feature-type="CDS" featref="cds_22129"></Feature-group-member>
             <Link href="#lma2_98.t00003_feature_id_gene:22128" rel="GENE"></Link>
           </Feature-group>
Appendix A - Reference Eukaryotic Assembly Encoding

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE Bsml PUBLIC "-//EBI//Labbook, Inc. BSML DTD//EN"
"http://www.labbook.com/dtd/bsml3_1.dtd">

<Bsml>
  <Definitions>
    <Sequences>
      <Sequence length="15561" title="unspecified" topology="circular" molecule="dna"
id="lma2_98_assembly">
        <Attribute name="ASSEMBLY" content="lma2_98_assembly"></Attribute>
        <Feature-tables>
           <Feature-table id="Bsml4147">
            <Feature class="GENE" id="lma2_98.t00001_feature_id_gene:22118">
               <Qualifier value="lma2_98_assembly" value-type="name"></Qualifier>
               <Interval-loc endpos="3565" startpos="1259" complement="0"></Interval-loc>
            </Feature>
            <Feature class="TRANSCRIPT" id="transcript_22117">
               <Site-loc complement="0" class="START" sitepos="1259"></Site-loc>
               <Site-loc complement="0" class="STOP" sitepos="3565"></Site-loc>
            </Feature>
            <Feature class="EXON" id="exon_22121">
               <Interval-loc endpos="3565" startpos="1259" complement="0"></Interval-loc>
            </Feature>
            <Feature class="CDS" id="cds_22119">
               <Site-loc complement="0" class="START" sitepos="1259"></Site-loc>
               <Site-loc complement="0" class="STOP" sitepos="3565"></Site-loc>
               <Link href="#protein_22120" rel="SEQ"></Link>
            </Feature>
            <Feature class="GENE" id="lma2_98.t00002_feature_id_gene:22123">
               <Qualifier value="lma2_98_assembly" value-type="name"></Qualifier>
               <Interval-loc endpos="4438" startpos="4620" complement="1"></Interval-loc>
            </Feature>
            <Feature class="TRANSCRIPT" id="transcript_22122">
               <Site-loc complement="1" class="START" sitepos="4620"></Site-loc>
               <Site-loc complement="1" class="STOP" sitepos="4438"></Site-loc>
            </Feature>
            <Feature class="EXON" id="exon_22126">
               <Interval-loc endpos="4438" startpos="4620" complement="1"></Interval-loc>
            </Feature>
            <Feature class="CDS" id="cds_22124">
               <Site-loc complement="1" class="START" sitepos="4620"></Site-loc>
               <Site-loc complement="1" class="STOP" sitepos="4438"></Site-loc>
               <Link href="#protein_22125" rel="SEQ"></Link>
            </Feature>
            <Feature class="GENE" id="lma2_98.t00003_feature_id_gene:22128">
               <Qualifier value="lma2_98_assembly" value-type="name"></Qualifier>
               <Interval-loc endpos="6944" startpos="6546" complement="0"></Interval-loc>
            </Feature>
            <Feature class="TRANSCRIPT" id="transcript_22127">
               <Site-loc complement="0" class="START" sitepos="6546"></Site-loc>
               <Site-loc complement="0" class="STOP" sitepos="6944"></Site-loc>
            </Feature>
            <Feature class="EXON" id="exon_22131">
    <Interval-loc endpos="6944" startpos="6546" complement="0"></Interval-loc>
  </Feature>
  <Feature class="CDS" id="cds_22129">
    <Site-loc complement="0" class="START" sitepos="6546"></Site-loc>
    <Site-loc complement="0" class="STOP" sitepos="6944"></Site-loc>
    <Link href="#protein_22130" rel="SEQ"></Link>
  </Feature>
  <Feature class="GENE" id="lma2_98.t00004_feature_id_gene:22133">
    <Qualifier value="lma2_98_assembly" value-type="name"></Qualifier>
    <Interval-loc endpos="7017" startpos="6616" complement="0"></Interval-loc>
  </Feature>
  <Feature class="TRANSCRIPT" id="transcript_22132">
    <Site-loc complement="0" class="START" sitepos="6616"></Site-loc>
    <Site-loc complement="0" class="STOP" sitepos="7017"></Site-loc>
  </Feature>
  <Feature class="EXON" id="exon_22136">
    <Interval-loc endpos="7017" startpos="6616" complement="0"></Interval-loc>
  </Feature>
  <Feature class="CDS" id="cds_22134">
    <Site-loc complement="0" class="START" sitepos="6616"></Site-loc>
    <Site-loc complement="0" class="STOP" sitepos="7017"></Site-loc>
    <Link href="#protein_22135" rel="SEQ"></Link>
  </Feature>
  <Feature class="GENE" id="lma2_98.t00005_feature_id_gene:22138">
    <Qualifier value="lma2_98_assembly" value-type="name"></Qualifier>
    <Interval-loc endpos="11487" startpos="10144" complement="0"></Interval-loc>
  </Feature>
  <Feature class="TRANSCRIPT" id="transcript_22137">
    <Site-loc complement="0" class="START" sitepos="10144"></Site-loc>
    <Site-loc complement="0" class="STOP" sitepos="11487"></Site-loc>
  </Feature>
  <Feature class="EXON" id="exon_22141">
    <Interval-loc endpos="11487" startpos="10144" complement="0"></Interval-loc>
  </Feature>
  <Feature class="CDS" id="cds_22139">
    <Site-loc complement="0" class="START" sitepos="10144"></Site-loc>
    <Site-loc complement="0" class="STOP" sitepos="11487"></Site-loc>
    <Link href="#protein_22140" rel="SEQ"></Link>
  </Feature>
  <Feature class="GENE" id="lma2_98.t00006_feature_id_gene:22143">
    <Qualifier value="lma2_98_assembly" value-type="name"></Qualifier>
    <Interval-loc endpos="14989" startpos="14138" complement="0"></Interval-loc>
  </Feature>
  <Feature class="TRANSCRIPT" id="transcript_22142">
    <Site-loc complement="0" class="START" sitepos="14138"></Site-loc>
    <Site-loc complement="0" class="STOP" sitepos="14989"></Site-loc>
  </Feature>
  <Feature class="EXON" id="exon_22146">
    <Interval-loc endpos="14989" startpos="14138" complement="0"></Interval-loc>
  </Feature>
  <Feature class="CDS" id="cds_22144">
    <Site-loc complement="0" class="START" sitepos="14138"></Site-loc>
    <Site-loc complement="0" class="STOP" sitepos="14989"></Site-loc>
    <Link href="#protein_22145" rel="SEQ"></Link>
  </Feature>
</Feature-table>
           <Feature-group id="Bsml4148" group-set="lma2_98.t00001_feature_id_gene:22118">
             <Feature-group-member feature-type="TRANSCRIPT"
featref="transcript_22117"></Feature-group-member>
             <Feature-group-member feature-type="EXON"
featref="exon_22121"></Feature-group-member>
             <Feature-group-member feature-type="CDS"
featref="cds_22119"></Feature-group-member>
             <Link href="#lma2_98.t00001_feature_id_gene:22118" rel="GENE"></Link>
           </Feature-group>
           <Feature-group id="Bsml4149" group-set="lma2_98.t00002_feature_id_gene:22123">
             <Feature-group-member feature-type="TRANSCRIPT"
featref="transcript_22122"></Feature-group-member>
             <Feature-group-member feature-type="EXON"
featref="exon_22126"></Feature-group-member>
             <Feature-group-member feature-type="CDS"
featref="cds_22124"></Feature-group-member>
             <Link href="#lma2_98.t00002_feature_id_gene:22123" rel="GENE"></Link>
           </Feature-group>
           <Feature-group id="Bsml4150" group-set="lma2_98.t00003_feature_id_gene:22128">
             <Feature-group-member feature-type="TRANSCRIPT"
featref="transcript_22127"></Feature-group-member>
             <Feature-group-member feature-type="EXON"
featref="exon_22131"></Feature-group-member>
             <Feature-group-member feature-type="CDS"
featref="cds_22129"></Feature-group-member>
             <Link href="#lma2_98.t00003_feature_id_gene:22128" rel="GENE"></Link>
           </Feature-group>
           <Feature-group id="Bsml4151" group-set="lma2_98.t00004_feature_id_gene:22133">
             <Feature-group-member feature-type="TRANSCRIPT"
featref="transcript_22132"></Feature-group-member>
             <Feature-group-member feature-type="EXON"
featref="exon_22136"></Feature-group-member>
             <Feature-group-member feature-type="CDS"
featref="cds_22134"></Feature-group-member>
             <Link href="#lma2_98.t00004_feature_id_gene:22133" rel="GENE"></Link>
           </Feature-group>
           <Feature-group id="Bsml4152" group-set="lma2_98.t00005_feature_id_gene:22138">
             <Feature-group-member feature-type="TRANSCRIPT"
featref="transcript_22137"></Feature-group-member>
             <Feature-group-member feature-type="EXON"
featref="exon_22141"></Feature-group-member>
             <Feature-group-member feature-type="CDS"
featref="cds_22139"></Feature-group-member>
             <Link href="#lma2_98.t00005_feature_id_gene:22138" rel="GENE"></Link>
           </Feature-group>
           <Feature-group id="Bsml4153" group-set="lma2_98.t00006_feature_id_gene:22143">
             <Feature-group-member feature-type="TRANSCRIPT"
featref="transcript_22142"></Feature-group-member>
             <Feature-group-member feature-type="EXON"
featref="exon_22146"></Feature-group-member>
             <Feature-group-member feature-type="CDS"
featref="cds_22144"></Feature-group-member>
             <Link href="#lma2_98.t00006_feature_id_gene:22143" rel="GENE"></Link>
           </Feature-group>
        </Feature-tables>
        <Seq-data-import format="fasta" id="chado_tryp_lma2_98_assembly"
source="/usr/local/scratch/BSML_REPOSITORY/chado_tryp.fsa"></Seq-data-import>
     </Sequence>
     <Sequence length="769" title="unspecified" molecule="aa" id="protein_22120">
       <Attribute name="ASSEMBLY" content="lma2_98_assembly"></Attribute>

<Seq-data>MLNTDRSMTLNSHYSSSDSAAGMPAPGLAVSAAGQTGHKDMVSSPSISVDQAPTTPGSDAI
NFFAKAPSSYKSHFGDSAAGVVASSTMPVYSAPDTPLQKGINSGYSARAPSRISLSGSVGASASSCGLT
GSPAFASGNGVIGAGAGLHSNSSSFGNSPQTVSITPLVKLGLPSSVCIVAQPTSPSAADGVACGGEAAVA
SDIVNAPEPMGSTCIPNLVDGVGAPLQAIYSSSSSSSTSSAPSPLAQTTSSPQCGGNTSCVSPSGTCSLL
FEAEVNISSPNAETASAAQALDEDALGETHQCTFISPPENNAKVDDGRTRSPKTPSRGGPSKSCGSAPS
AKSNPRADHAASPPKNSPSHSKQRQQRAGAASRRQNVSASNCSLSTGQMAPESLQKGCSSPASQGEF
VSVYDSDFETLLSIPASQVLHRPNPALGVSRLVLCRKFNLENPQSCSKGEMCKFVHADIRKASRCSIHVN
YAWRSLALCTYPRLPAGDEVTVLAPNERSPSEVIPSERILVTRGSTNWREHTAPLSHCAHYYFNRMCNR
GKRCNFIHAVHVDPNVQGDFKHAPAPRAVAPIASKPSNSAASRAASGAQAASNEGHHQSSAHHGGAAA
RTAPLAKQPQLLPPVQQNSANNVANAGGGIAYAFAPSIFALPPHGCMAYPLLNPTSSSTAMGVPQSPGG
VAGGAPYVMLMNSAGQQVGCSYVPATLMPSTPQQMGGSNGVFLIVPTANGSGGSLGGSGAPLGSPGV
PVSPLGQSFSSLGNGPVNW*</Seq-data>
     </Sequence>
     <Sequence length="61" title="unspecified" molecule="aa" id="protein_22125">
       <Attribute name="ASSEMBLY" content="lma2_98_assembly"></Attribute>

<Seq-data>MTRCTEATVREPAAAVLEKTKTERRAETNTLMQKGDEEASIVCRLSFRFTDHPSLHRNHQ*</
Seq-data>
     </Sequence>
     <Sequence length="133" title="unspecified" molecule="aa" id="protein_22130">
       <Attribute name="ASSEMBLY" content="lma2_98_assembly"></Attribute>

<Seq-data>MHAIYMYSSVLETDASACVCVRVYACVGRLTPSTNHCHLLPPSPKNASAACDAICLLPFLLGP
VSALVSCLPLRTRSLAAHHSLFPLLQLLLLQRRPRRAQVYAPSPTHTHAHISDPHPSVRGIEGSGNLTM*</
Seq-data>
     </Sequence>
     <Sequence length="134" title="unspecified" molecule="aa" id="protein_22135">
       <Attribute name="ASSEMBLY" content="lma2_98_assembly"></Attribute>

<Seq-data>MRALGDSPLQRITATSSPPPPRTPPRLVMLSVYCLSSSALSALSCHASHCAPAALPRTTPSF
RFFNFYFCSVAHDVPRCMLPLPHTRMHTSPTPTPQCVVSKGPVTSPCEAKPSSSSSSSPLPPLPVPNHF
WR*</Seq-data>
     </Sequence>
     <Sequence length="448" title="unspecified" molecule="aa" id="protein_22140">
       <Attribute name="ASSEMBLY" content="lma2_98_assembly"></Attribute>

<Seq-data>MCFHLGASWVGDKLQPLLFRCARASPRVHILAPPPFRLLLLLSSFRFFGRCISLLPSSCALSA
FPFLFYVSGRLVEFARGSEERKTKIRSVSSAKRDRKCTFTNPREPYAAPSSLSLLIRTHPITHARPRLRRTH
ATGLRVVYSSLFLYSSILCFIFPFLVSSCTVDLVSTSSPTSLLSPSYIYIYIYIYSDVRVDQRTFFVLESGRLAS
LTCFTSPHRRRRLSPLAHHLLPLHVFDSDVSGDTFAIFPPFLLTSPPFCLPIRGTHSVTATNQHKTTTTTNN
TREGRRVRQKDSTQGFFFHTPRPFLPSPSSPLPFSSLSLCLVDCPSDTRPRPTPRSTDASAWLAAVETSI
CLSPTFSLCVFVCRYSRQFPSGCKACFFTLSPSPLPPTFCRLACSSGLSFRITSFFSRYLPTPFVLLSGLS
CRFFCCCSIYIFSVSLIRYCEKLH*</Seq-data>
     </Sequence>
     <Sequence length="284" title="unspecified" molecule="aa" id="protein_22145">
       <Attribute name="ASSEMBLY" content="lma2_98_assembly"></Attribute>

<Seq-data>MFSPVYEDVSGLFGHPMPISGFGPSTELQFPQTSYMTTTTNAPQTPTQQQQQQLAGRATVY
SLVDASGVQLHRQAAPPQQIPMSSVPFGAFDGASIVPYEVFMQHQQQQQLAAAGKPTAVHAQDGVKYY
LSSSTGAPVMASAGPAATTQANIMQMQQHGGTLYTLESNGTTTALTSTPPAGGLFSGAAAANGAPPSGE
YPFYAAATGFLSSGPLGMAPATVGSASVTPYAIQGHMPSASPRARGAVNRSPAATIAGNQHHSNSGNNS
SSASTRRENQATLHEM*</Seq-data>
      </Sequence>
    </Sequences>
  </Definitions>
</Bsml

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:3
posted:2/29/2012
language:
pages:9